UFS2 and/or sparse file bug causing copy process to land in 'D'' state?

Sun Feb 22 03:01:00 PST 2009

On Sun, Feb 22, 2009 at 12:00:38AM -0800, Carl wrote:
> I've come across what I'm thinking may be a bug in the context of 
> FreeBSD 7.0 with a pair of gmirrored drives and gjournaled partitions 
> when copying a large number of files into a file-backed memory device.
> 
> The consequence of this problem is that a process enters the 'D' state 
> (process in disk) indefinitely, cannot be killed, and the system cannot 
> be shutdown. The only solution is to cold reboot the system, which is a 
> really big problem for remote systems. This is happening to me 
> intermittently with the standard tar-tar pipeline form of copying, but 
> has happened with the rsync 3.0.4 port as well.
> 
> I would appreciate it if some of you would see if you can repeat this 
> problem. Here is a sequence of tcsh shell commands which manifest the 
> problem (on occasion but not every time), which I will refer to as the 
> "truncate sequence" (depends on fully populated /usr/src tree as data set):
> 
>      # truncate -s 671088640 target
>      # mdconfig -f target -S 512 -y 255 -x 63 -u 7
>      # bsdlabel -w /dev/md7 auto
>      # newfs -O2 -m 0 -o space /dev/md7a
>      # mount /dev/md7a /media
>      # tar -cvf - -C /usr/src . | tar -xvpof - -C /media
>      # umount /media ; mdconfig -d -u 7 ; rm target
> 
> An alternate version has yet to fail for me and involves replacing the 
> first line with this one:
> 
>      # dd if=/dev/zero of=target bs=1M count=640
> 
> I'll call that the "dd sequence". Here is an ordered series of tests I 
> just completed:
> 
> a) Repeated truncate sequence 7 times - 1st, 5th, and 7th failed.
> b) Repeated dd sequence 7 times - no failures.
> c) Repeated truncate sequence 6 time - no failures.
> d) Used following sequence to ensure all disk caches flushed:
> 
>      # dd if=/dev/random of=target bs=1M count=4096
>      # dd if=target of=/dev/null bs=1M
>      # rm target
> 
> e) Repeated truncate sequence 4 times - no failures.
> f) Performed orderly reboot.
> g) Repeated truncate sequence 2 times - 2nd failed.
> h) Performed orderly reboot.
> i) Repeated dd sequence 7 times - no failures.
> 
> All failures involve the second tar in the pipeline hanging in the 'D' 
> state. In each case I do a cold reboot before proceeding with the next test.
> 
> It's tempting to speculate that a bug exists in code related to handling 
> sparse files specifically, but perhaps it just raises the probability of 
> tripping a bug that would eventually manifest in the dd sequence as 
> well. OTOH, I don't know how to rule out a physical disk or disk 
> firmware problem.
> 
> This problem has occurred with different data sets and different sized 
> memory disks, but only with the source and destination filesystems being 
> UFS2. I have done similar sequences with EXT2 and FAT16 destinations 
> with no failures thus far, but the memory disks and data sets were 
> smaller so it's conceivable that probability worked against me.
> 
> I should note that the drives are Seagate ST31000340AS Barracudas, but 
> both drives have been upgraded to firmware version SD1A and are 
> therefore supposedly free of the infamous little horror Seagate 
> inflicted on so many of us. smartctl tells me that both disks still have 
> a raw value of 0 for Reallocated_Sector_Ct and both pass the "short" 
> self test.

Please, see
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
for instructions on how to gather the required information to diagnose
the issue.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20090222/221d4b12/attachment.pgp