Unable to shutdown

Kevin Oberman kob6558 at gmail.com
Wed Aug 31 06:04:44 UTC 2011


Jeremy,

I think we are simply not communicating, I guess. You are arguing
point with which I agree.

Comments in line:
On Tue, Aug 30, 2011 at 4:43 PM, Jeremy Chadwick
<freebsd at jdc.parodius.com> wrote:
> On Tue, Aug 30, 2011 at 04:10:13PM -0700, Kevin Oberman wrote:
>> On Tue, Aug 30, 2011 at 2:48 PM, Jeremy Chadwick
>> <freebsd at jdc.parodius.com> wrote:
>> > On Tue, Aug 30, 2011 at 01:29:02PM -0400, David Magda wrote:
>> >> On Tue, August 30, 2011 11:50, Kevin Oberman wrote:
>> >> [...]
>> >> > The more I look at this, the more it seems to me that it is an issue
>> >> > with the Seagate drive and not a FreeBSD issue. Probably a bug that is
>> >> > never triggered on Windows, so is largely unnoticed. I suspect Widows
>> >> > probably orders the command is a subtly different order.
>> >> [...]
>> >>
>> >> Or not the drive per se, but the USB-to-IDE/SATA chipset.
>> >>
>> >> A while back on the OpenSolaris zfs-discuss list there was an issue where
>> >> USB drives would have corrupt ZFS pools if a drive was yanked without a
>> >> 'zpool export' being run. Even though ZFS is supposed to always be
>> >> consistent on-disk (because it's transactional), this wasn't happening.
>> >>
>> >> It turned that the chipset had a list of particular SATA commands that it
>> >> allowed through to the drive, and all others were simply answered with
>> >> "OK", regardless of what actual actions needed to be taken. One of the
>> >> SATA commands that was NOT whitelisted was the 'cache flush'
>> >> command--which ZFS needs to make sure that it's data structures were
>> >> written in the proper order.
>> >>
>> >> Turns out the drive and its firmware were fine and doing things properly,
>> >> it's just that the necessary commands weren't getting to it because of the
>> >> USB adaptor's chipsset.
>> >
>> > I don't think that advice is applicable in this situation. ?Here's why:
>> >
>> > Kevin's original description indicates that when the drive (or enclosure
>> > translation ASIC for that matter) is in standby, when the system is shut
>> > down, the drive/ASIC never spins back up on I/O (flushing all I/O
>> > buffers to disk).
>> >
>> > If he issues "ls" commands or similar userland-induced I/O to the drive
>> > prior to shutting the system down, the drive/ASIC spins up normally.
>> >
>> > Here's Kevin's original quote:
>> >
>> >>> The drive is "green" and spins down when idle. ?If an attempt is made
>> >>> to shutdown the system while the drive is spun down, the system goes
>> >>> through the usual shutdown including flushing all buffer out to disk,
>> >>> but when the final disk access to mark the file systems as clean, the
>> >>> drive never spins up and the system hangs until it is powered down.
>> >>> I've found no way to avoid this other then to remember to access the
>> >>> disk and cause it to spin up before shutting down.
>> >>>
>> >>> If I attempt to unmount the file systems when the drive is shut down.
>> >>> the same thing happens, but I can recover as the second file system
>> >>> is still mounted and an ls(1) to that file system will cause the disk
>> >>> to spin up and everything is fine.
>> >
>> > So the question is what's "unique" about flushing all I/O buffers to
>> > disk during shutdown compared to issuing standard I/O in userland. ?I
>> > can speculate all day as to what the cause is, but it's highly unlikely
>> > that the USB-to-SATA controller ASIC is causing the problem.
>>
>> You are perhaps assuming a bit too much. Since I know that a disk read
>> or write WILL spin up the drive, I can only assume that the msdosfs is
>> not finding anything to flush, so is not writing. I see the full
>> "flushing all buffers" countdown and it always runs successfully to
>> zero. This, without the drive spinning up. This begs at least the
>> question of whether the drive is receiving any writes or whether the
>> "writes" are simply being cached by the drive to save energy. I
>> suspect that the drive only spins up when enough of its write cache is
>> filled.
>
> If there's "nothing to flush", then why is the kernel indefinitely
> looping (finally giving up, and it usually prints something when it
> encounters that condition) when trying to flush buffers when the drive
> is spun down?  What exactly is it trying to flush if there's "nothing to
> flush"?

I think you may be focusing on things you believe I meant when I didn't mean or
say them. I don't have any reason to believe that a cache flush is or is not the
command that is hanging. I have absolutely no doubt that a flush is requested by
the OS during the unmount process.  I'm just not sure what other commands might
be issued. And, of course, they are CAM operations that the box is probably
converting to SATA, but I can't even say this for sure as the Seagate
drive in question
is a SATA drive in the box. I can only say that the drive is not a
standard 9mm laptop
drive It is longer, thicker and heavier than a laptop drive. It is the
same width as a
normal 2.5 in. drive.

As to the issue of "nothing to flush", that was my fault as I was
entering text in a stream
of consciousness  and I realized that, if there was only a little data
being written, it might
not spin up the drive (i.e. take it out of standby) until more data is
written or a cache flush
is ordered.

Note that NOTHING is ever printed out and the hang does NOT time out.
It is until the drive
leaves standby or power is cycled, whichever comes first. :-)

If I inadvertently used the term "sleep"I am sorry. Clearly the drive
should never enter
sleep mode and I really meant "standby".

> instead use UFS2 and see if the problem disappears?  This is in no way a
> permanent solution.  If this workaround fixes the problem, then I'm
> inclined to believe msdosfs is to blame.  There have been a lot of
> discussion of this driver in the kernel as of late, and the general
> opinion of it is that it's crummy.

Actually, for me it is as I will shortly be re-partitioning this into
a GPT disk without any
msdosfs partitions. I will give it a try with a UFS partition tomorrow
and see what
happens.

When you say that it is crummy, are you referring to the USB driver,
the AHCI driver, or
the msdosfs support? I have long been concerned about the latter due
to occasional
unstable behavior that is "fixed" by booting Windows. fsck_msdosfs
seems to do some
questionable things, too.

> And here's another thought: what if the issue is limited, somehow, to
> just writes?  Meaning, could the kernel issue a "false" read to the
> device (for some random LBA, even LBA 0 for all I care) and then proceed
> with its write/flushing?  I wonder if that would cause the drive to spin
> up first.  That would be a "quirk" in my opinion.

Interesting idea, but I really doubt that it's an issue with the write
other than that the
drive may not leave standby unless the cache is full enough that it flushes.

> There's also the possibility the USB stack on FreeBSD is doing something
> really stupid... man, I don't even want to go down that road.  Hans
> should be able to help determine if that's the case, but not using
> msdosfs as a test would be a good start.

Yes. I make no claim to understand the USB layer at all, but I do
understand that
it is very tricky. Lots of evidence of that in how broken early
Microsoft USB stacks
were.

>> In that case, the "flush cache" might actually be what is issued, but
>> I can't claim any certainly about that. I'm not willing to completely
>> clear the USB-SATA chip as the culprit.
>
> I'm pretty certain FLUSH CACHE or -EXT is what's used when the kernel is
> shutting down.  You ABSOLUTELY want all pending disk I/O (writes in
> particularly) written to the platters/media on the disk before the
> machine reboots, otherwise you're hoping the drive does it before it
> gets re-initialised during POST or when an option ROM (AHCI) starts.

No argument there. Clearly a flush is mandatory. Not doing so would be
disastrous.
>
> So I'm pretty sure the kernel is iterating over whatever cache buffers
> there are for I/O (I don't know what this is called technically) and
> issuing WRITE DMA or -EXT and either waiting for a non-error response
> from the device or issuing it blindly followed by a FLUSH CACHE or -EXT
> (either once per write or at the very end).

Again, I really believe that the kernel fully believes that all writes
are complete,
at least to the disk cache. At that point the FS structures can be removed and
the FS is no longer mounted as seen from the perspective of the
system, this MUST
be done before the disk cache is flushed and the FS is marked "clean".
I suspect,
but don't know for sure, that the last two operations performed are to
mark the drive
clean and then do a cache flush. Of possible relevance is that none of the file
system is marked "clean" during a hung shutdown. All need to be FSCKed although
nothing ever seems to need fixing by fsck(8).

>> > Furthermore, Windows doesn't have "special disk/enclosure drivers" for
>> > such drives, so there's nothing "unique" Windows would be sending across
>> > the wire, ATA-protocol-wise, that would explain why Windows works and
>> > FreeBSD doesn't. ?At least that's my opinion.
>>
>> This is not always quite true, but it is true for the general case. (I
>> know some WD
>> enclosures do install a custom driver.)
>
> It's true 99% of the time.  I use Windows XP exclusively on my
> workstations and make use of USB-class storage devices (hard disks, CF,
> microSD) quite often.  There are no drivers involved, but just like with
> FreeBSD there are potential device quirks.

yes, asnd the drives that do use the special driver seem to work fine
with umass. I
was just being unnecessarily pedantic. Sorry.

> The only way to find out what Windows is doing in this situation is to
> make use of a hardware ATA protocol analyser (one would need to buy one
> (expensive) and disassemble the drive and stick the analyser between the
> USB/SATA ASIC and the drive).  Fun project?  Not really.

Yes, I remember doing this sort of stuff when I was designing
interfaces and writing drivers for them back in the last millennium.
At least the lab bought the analyzers.

>> > With ATA/SATA, the FLUSH CACHE (0xe7) and -EXT (0xea) (for 48-bit LBAs)
>> > commands are separate from WRITE DMA (0xca) and -EXT (0x35) (for 48-bit
>> > LBAs). ?Both FLUSH CACHE commands do not take LBAs in their input CDB.
>> > To "flush buffers to disk" I imagine what the kernel should be doing is
>> > issuing WRITE commands followed by FLUSH CACHE. ?The WRITEs should be
>> > "waking" the drive up.
>>
>> Should they? As I pointed out above, that is not necessarily the case.
>
> "It depends".  If the drive is in "sleep", then no.  If "standby", then
> yes.  There is no ATA protocol "wakeup" command, just for the record.

Nope, only a reset. Sleep is not something I would expect to ever be an issue.

> What needs to happen here is that those wanting to participate in this
> ATA protocol discussion *NEED* to familiarise themselves with the
> ATA8-ACS specification.  Please PLEASE **PLEASE** take the time to do
> this before questioning.
>
> http://www.t13.org/Documents/UploadedDocuments/docs2007/D1699r4a-ATA8-ACS.pdf
>
> Section 4.18.3 contains a flow-chart diagram that is difficult to
> understand, so I'll summarise:
>
> PM0 state = ACTIVE state -- spun up and ready to handle any I/O of any kind
>
> PM1 state = IDLE state -- this does not mean "the drive is sitting there
> idle doing nothing.  There is an ATA IDLE command that can be used to
> tell the drive to go into a "lower-power" state.
>
> PM2 state = STANDBY state -- this equates to "camcontrol standby".  This
> is what people here are describing as "the drive has spun down".  Or,
> well, I sure hope that's what people are describing, because "sleep" is
> not the same thing as "standby".
>
> PM3 state = SLEEP state -- this equates to "camcontrol sleep".  It's
> permanent until the entire bus is reset or the physical device is
> power-cycled (which works varies from device to device).
>
> So with those definitions, you can see quite clearly the documentation
> states what should happen when transitioning from one state to another.
> Specifically this is the one that matters (PM2 --> PM0 state):
>
> Transition PM2:PM0: When a media access is required, the device shall
> make a transition to the PM0:Active mode.
>
> Now as for drives which may be in IDLE mode (I'm not sure if FreeBSD
> makes use of that mode automatically or not), it's the same thing:
>
> Transition PM1:PM0: When a media access is required, the device shall
> make a transition to the PM0:Active mode.
>
> So that answers the question: any I/O (read or write) to the device
> should spin the drive up.  If you have an enclosure or an ASIC that is
> screwing this up (I highly doubt it, and this is not the same problem as
> what David was describing!), then it's in violation of the ATA protocol.

Nice description. I understand it, but the standrad does not specify EXACTLY
what triggers a transition from standby to ready (PM2 to PM0). Only that it is
something that requires media access. A write does not necessarily require
media access if you define "media" as the disk platter.
>> > But wait, there's more.
>> >
>> > I want to point out to people that "sleep" and "standby" are two very
>> > different things (they're separate ATA commands too). ?So if you're
>> > using "camcontrol sleep" you probably should be using "camcontrol
>> > standby". ?The man page is quite clear about the repercussions of the
>> > former (and in the latter case I can imagine I/O to the drive failing or
>> > simply timing out given that a bus reset is not performed during
>> > shutdown TMK).
>>
>> This is  very interesting point. Note that when this happens, whether
>> at shutdown
>> or when unmounting the file system, it hangs forever. There is no timeout.
>>
>> I should also make one oddity completely clear, just in case my
>> initial report failed to
>> do so. I have two msdosfs file systems on the disk (along with an encrypted UFS
>> system which is not normally mounted). I can dismount one file system.
>> It no longer
>> shows up as mounted, but the drive DOES NOT SPIN UP. Only when I attempt to
>> unmount the second FS does that unmount hang. And, since the system is running
>> normally and the drive is still mounted, I can issue a command to read
>> from the disk
>> and it spins up. (I actually use tcsh command completion to do this by typing
>> "ls /media/MUSIC/Ctrl-D" The terminal window freezes at that point for several
>> seconds until the disk is spun up and ready and than completes the
>> operation. Both
>> disks are then unmounted and the system is clear.
>>
>> Does anyone know what the very last operations of unmount are? Things that are
>> AFTER the system as been removed from all system tables? I'm guessing it is just
>> to mark the system as clean (single block write) and flush the cache.
>> I'm guessing
>> that the write is not going to fill cache to the point of triggering a
>> spin-up, so the
>> system THINKS the first drive is unmounted, but something is still not complete.
>
> This is really starting to sound like idiocy within the msdosfs driver.
> That's just my opinion at this point.  As for what happens during device
> unmount, I believe it's handled per-device (per-layer) as well as
> per-filesystem.  Kirk McKusick might have some insight to this --
> filesystems aren't something I'm really well-versed in.

Yes, you are right. I'll find out when I try it out tomorrow. Kirk
almost certainly does
know since this is relevant to ANY file system.

> Sorry for sounding crass, but I really grow tired of people "blaming
> hardware" willy-nilly when in my experience most of these wonky problems
> turn out to be bugs/issues in FreeBSD.  Anyone who thinks this OS is
> infallible is smoking some serious crack.

I really know that the FS is far less than perfect, but the fact that
the two reports
of this sort of behavior both involve USB drives from the same manufacturer and
probably running identical firmware does tend to point to hardware issue. It's
certainly not proof.
-- 
R. Kevin Oberman, Network Engineer - Retired
E-mail: kob6558 at gmail.com


More information about the freebsd-stable mailing list