UFS2 and/or sparse file bug causing copy process to land in 'D''
state?
Carl
k0802647 at telus.net
Sun Feb 22 00:38:49 PST 2009
I've come across what I'm thinking may be a bug in the context of
FreeBSD 7.0 with a pair of gmirrored drives and gjournaled partitions
when copying a large number of files into a file-backed memory device.
The consequence of this problem is that a process enters the 'D' state
(process in disk) indefinitely, cannot be killed, and the system cannot
be shutdown. The only solution is to cold reboot the system, which is a
really big problem for remote systems. This is happening to me
intermittently with the standard tar-tar pipeline form of copying, but
has happened with the rsync 3.0.4 port as well.
I would appreciate it if some of you would see if you can repeat this
problem. Here is a sequence of tcsh shell commands which manifest the
problem (on occasion but not every time), which I will refer to as the
"truncate sequence" (depends on fully populated /usr/src tree as data set):
# truncate -s 671088640 target
# mdconfig -f target -S 512 -y 255 -x 63 -u 7
# bsdlabel -w /dev/md7 auto
# newfs -O2 -m 0 -o space /dev/md7a
# mount /dev/md7a /media
# tar -cvf - -C /usr/src . | tar -xvpof - -C /media
# umount /media ; mdconfig -d -u 7 ; rm target
An alternate version has yet to fail for me and involves replacing the
first line with this one:
# dd if=/dev/zero of=target bs=1M count=640
I'll call that the "dd sequence". Here is an ordered series of tests I
just completed:
a) Repeated truncate sequence 7 times - 1st, 5th, and 7th failed.
b) Repeated dd sequence 7 times - no failures.
c) Repeated truncate sequence 6 time - no failures.
d) Used following sequence to ensure all disk caches flushed:
# dd if=/dev/random of=target bs=1M count=4096
# dd if=target of=/dev/null bs=1M
# rm target
e) Repeated truncate sequence 4 times - no failures.
f) Performed orderly reboot.
g) Repeated truncate sequence 2 times - 2nd failed.
h) Performed orderly reboot.
i) Repeated dd sequence 7 times - no failures.
All failures involve the second tar in the pipeline hanging in the 'D'
state. In each case I do a cold reboot before proceeding with the next test.
It's tempting to speculate that a bug exists in code related to handling
sparse files specifically, but perhaps it just raises the probability of
tripping a bug that would eventually manifest in the dd sequence as
well. OTOH, I don't know how to rule out a physical disk or disk
firmware problem.
This problem has occurred with different data sets and different sized
memory disks, but only with the source and destination filesystems being
UFS2. I have done similar sequences with EXT2 and FAT16 destinations
with no failures thus far, but the memory disks and data sets were
smaller so it's conceivable that probability worked against me.
I should note that the drives are Seagate ST31000340AS Barracudas, but
both drives have been upgraded to firmware version SD1A and are
therefore supposedly free of the infamous little horror Seagate
inflicted on so many of us. smartctl tells me that both disks still have
a raw value of 0 for Reallocated_Sector_Ct and both pass the "short"
self test.
Carl / K0802647
More information about the freebsd-fs
mailing list