UFS2 and/or sparse file bug causing copy process to land in
'D'' state?
Kostik Belousov
kostikbel at gmail.com
Sun Feb 22 03:01:00 PST 2009
On Sun, Feb 22, 2009 at 12:00:38AM -0800, Carl wrote:
> I've come across what I'm thinking may be a bug in the context of
> FreeBSD 7.0 with a pair of gmirrored drives and gjournaled partitions
> when copying a large number of files into a file-backed memory device.
>
> The consequence of this problem is that a process enters the 'D' state
> (process in disk) indefinitely, cannot be killed, and the system cannot
> be shutdown. The only solution is to cold reboot the system, which is a
> really big problem for remote systems. This is happening to me
> intermittently with the standard tar-tar pipeline form of copying, but
> has happened with the rsync 3.0.4 port as well.
>
> I would appreciate it if some of you would see if you can repeat this
> problem. Here is a sequence of tcsh shell commands which manifest the
> problem (on occasion but not every time), which I will refer to as the
> "truncate sequence" (depends on fully populated /usr/src tree as data set):
>
> # truncate -s 671088640 target
> # mdconfig -f target -S 512 -y 255 -x 63 -u 7
> # bsdlabel -w /dev/md7 auto
> # newfs -O2 -m 0 -o space /dev/md7a
> # mount /dev/md7a /media
> # tar -cvf - -C /usr/src . | tar -xvpof - -C /media
> # umount /media ; mdconfig -d -u 7 ; rm target
>
> An alternate version has yet to fail for me and involves replacing the
> first line with this one:
>
> # dd if=/dev/zero of=target bs=1M count=640
>
> I'll call that the "dd sequence". Here is an ordered series of tests I
> just completed:
>
> a) Repeated truncate sequence 7 times - 1st, 5th, and 7th failed.
> b) Repeated dd sequence 7 times - no failures.
> c) Repeated truncate sequence 6 time - no failures.
> d) Used following sequence to ensure all disk caches flushed:
>
> # dd if=/dev/random of=target bs=1M count=4096
> # dd if=target of=/dev/null bs=1M
> # rm target
>
> e) Repeated truncate sequence 4 times - no failures.
> f) Performed orderly reboot.
> g) Repeated truncate sequence 2 times - 2nd failed.
> h) Performed orderly reboot.
> i) Repeated dd sequence 7 times - no failures.
>
> All failures involve the second tar in the pipeline hanging in the 'D'
> state. In each case I do a cold reboot before proceeding with the next test.
>
> It's tempting to speculate that a bug exists in code related to handling
> sparse files specifically, but perhaps it just raises the probability of
> tripping a bug that would eventually manifest in the dd sequence as
> well. OTOH, I don't know how to rule out a physical disk or disk
> firmware problem.
>
> This problem has occurred with different data sets and different sized
> memory disks, but only with the source and destination filesystems being
> UFS2. I have done similar sequences with EXT2 and FAT16 destinations
> with no failures thus far, but the memory disks and data sets were
> smaller so it's conceivable that probability worked against me.
>
> I should note that the drives are Seagate ST31000340AS Barracudas, but
> both drives have been upgraded to firmware version SD1A and are
> therefore supposedly free of the infamous little horror Seagate
> inflicted on so many of us. smartctl tells me that both disks still have
> a raw value of 0 for Reallocated_Sector_Ct and both pass the "short"
> self test.
Please, see
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
for instructions on how to gather the required information to diagnose
the issue.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20090222/221d4b12/attachment.pgp
More information about the freebsd-fs
mailing list