UFS2 and/or sparse file bug causing copy process to land in 'D'' state?

Sun Feb 22 00:38:49 PST 2009

I've come across what I'm thinking may be a bug in the context of 
FreeBSD 7.0 with a pair of gmirrored drives and gjournaled partitions 
when copying a large number of files into a file-backed memory device.

The consequence of this problem is that a process enters the 'D' state 
(process in disk) indefinitely, cannot be killed, and the system cannot 
be shutdown. The only solution is to cold reboot the system, which is a 
really big problem for remote systems. This is happening to me 
intermittently with the standard tar-tar pipeline form of copying, but 
has happened with the rsync 3.0.4 port as well.

I would appreciate it if some of you would see if you can repeat this 
problem. Here is a sequence of tcsh shell commands which manifest the 
problem (on occasion but not every time), which I will refer to as the 
"truncate sequence" (depends on fully populated /usr/src tree as data set):

      # truncate -s 671088640 target
      # mdconfig -f target -S 512 -y 255 -x 63 -u 7
      # bsdlabel -w /dev/md7 auto
      # newfs -O2 -m 0 -o space /dev/md7a
      # mount /dev/md7a /media
      # tar -cvf - -C /usr/src . | tar -xvpof - -C /media
      # umount /media ; mdconfig -d -u 7 ; rm target

An alternate version has yet to fail for me and involves replacing the 
first line with this one:

      # dd if=/dev/zero of=target bs=1M count=640

I'll call that the "dd sequence". Here is an ordered series of tests I 
just completed:

a) Repeated truncate sequence 7 times - 1st, 5th, and 7th failed.
b) Repeated dd sequence 7 times - no failures.
c) Repeated truncate sequence 6 time - no failures.
d) Used following sequence to ensure all disk caches flushed:

      # dd if=/dev/random of=target bs=1M count=4096
      # dd if=target of=/dev/null bs=1M
      # rm target

e) Repeated truncate sequence 4 times - no failures.
f) Performed orderly reboot.
g) Repeated truncate sequence 2 times - 2nd failed.
h) Performed orderly reboot.
i) Repeated dd sequence 7 times - no failures.

All failures involve the second tar in the pipeline hanging in the 'D' 
state. In each case I do a cold reboot before proceeding with the next test.

It's tempting to speculate that a bug exists in code related to handling 
sparse files specifically, but perhaps it just raises the probability of 
tripping a bug that would eventually manifest in the dd sequence as 
well. OTOH, I don't know how to rule out a physical disk or disk 
firmware problem.

This problem has occurred with different data sets and different sized 
memory disks, but only with the source and destination filesystems being 
UFS2. I have done similar sequences with EXT2 and FAT16 destinations 
with no failures thus far, but the memory disks and data sets were 
smaller so it's conceivable that probability worked against me.

I should note that the drives are Seagate ST31000340AS Barracudas, but 
both drives have been upgraded to firmware version SD1A and are 
therefore supposedly free of the infamous little horror Seagate 
inflicted on so many of us. smartctl tells me that both disks still have 
a raw value of 0 for Reallocated_Sector_Ct and both pass the "short" 
self test.

Carl                                             / K0802647