Bizarre IO errors
Michael Conlen
m at obmail.net
Sun Apr 5 00:55:37 PDT 2009
First the background
FreeBSD nfs4.tarhost.com 7.1-RELEASE-p3 FreeBSD 7.1-RELEASE-p3 #0: Sat
Mar 7 00:15:02 EST 2009 root at nfs4.tarhost.com:/usr/obj/usr/src/
sys/GENERIC amd64
(Two of these processors)
CPU: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz (2500.10-MHz K8-
class CPU)
Memory
usable memory = 17165377536 (16370 MB)
avail memory = 16626044928 (15855 MB)
The Disk system
aac0: <Adaptec RAID 52445> mem 0xd8000000-0xd81fffff irq 48 at device
0.0 on pci1
aac0: Enabling 64-bit address support
aac0: Enable Raw I/O
aac0: Enable 64-bit array
aac0: New comm. interface enabled
aac0: [ITHREAD]
aac0: Adaptec 52445, aac driver 2.0.0-1
Controller Status : Optimal
Channel description : SAS/SATA
Controller Model : Adaptec 52445
Controller Serial Number : 8A321083874
Physical Slot : 5
Temperature : 60 C/ 140 F (Normal)
Installed memory : 512 MB
Copyback : Disabled
Background consistency check : Disabled
Automatic Failover : Enabled
Global task priority : High
Performance Mode : Default/Dynamic
Defunct disk drive count : 0
Logical devices/Failed/Degraded : 2/0/0
Logical device number 0
Logical device name : system
RAID level : 1
Status of logical device : Optimal
Size : 953334 MB
Read-cache mode : Enabled
Write-cache mode : Enabled (write-back)
Write-cache setting : Enabled (write-back)
Partitioned : Yes
Protected by Hot-Spare : Yes
Dedicated Hot-Spare : 0,4
Dedicated Hot-Spare : 0,12
Bootable : Yes
Failed stripes : No
--------------------------------------------------------
Logical device segment information
--------------------------------------------------------
Segment 0 : Present (0,0)
Segment 1 : Present (0,8)
Logical device number 1
Logical device name : data
RAID level : 10
Status of logical device : Optimal
Size : 5720064 MB
Stripe-unit size : 256 KB
Read-cache mode : Enabled
Write-cache mode : Disabled (write-through)
Write-cache setting : Enabled (write-back)
when protected by battery
Partitioned : Yes
Protected by Hot-Spare : No
Bootable : No
Failed stripes : No
Filesystem Size Used Avail Capacity Mounted on
/dev/aacd0s1a 15G 417M 14G 3% /
devfs 1.0K 1.0K 0B 100% /dev
/dev/aacd0s1e 15G 30K 14G 0% /tmp
/dev/aacd0s1f 62G 3.2G 54G 6% /usr
/dev/aacd0s1g 762G 354M 701G 0% /usr/local
/dev/aacd0s1d 15G 3.5G 11G 24% /var
/dev/aacd1p1 5.3T 874G 4.0T 18% /usr/local/export
This system is setup as a NFS File server. It handled several stress
tests for two weeks before going online. The files were transfered to
the system and it was placed online and ran fine for a few days. There
are 10 web servers which access the file server, but the file servers
have a cluster of caches in front of them so the load isn't too bad. I
see peaks around 60Mbit/sec of traffic from the NFS server when taking
backups and 30 MBit/sec otherwise. IO is minimized due to the large
amount of available ram. It takes about three hours before disk
caching fills the available memory, so there's not a lot of really hot
data it's going to the disks for, mostly just writes. iostat generally
reports about 1 MB/sec, maybe 2 at most. During stress tests I'd seen
in excess of 800 MB/sec though usually the 300-400 MB/sec range.
Now for the strange. First notice the first two g_vfs_done lines. The
offset is negative. After repeating the second 290 times we see this
odd pattern of the error taking many lines to display. After that the
logs continue occasionally showing a full error line then followed by
a line broken up as below. The first 8 or so lines after the first
have the same offset. After that the offset switches. to
offset=1666490991559323648. That's in excess of 1 Exabyte. I've got a
lot of disk but not that much. That seems to indicate that it's not
really an IO error since the offset is way off the end of what could
possibly be disk (Can someone confirm or deny that?). The offset
occasionally changes to a negative number or some other value but this
particular offset is repeated over and over. Between 1 and 2 AM over 2
GB of this log was generated.
About two minutes after this started NFS stopped responding to the NFS
clients in a prompt manner.
Once the server was restarted it runs fine for sometime but this
pattern soon (within minutes) repeats.
I have noticed that several read errors seem to get logged several at
a time then there will be occasional pauses but this is a sample of
just the "last message repeated" which have 10 or more times repeated
At the moment I can only assume that the negative and exceedingly
large offsets are a symptom of something beyond simply "disk problems"
since the messages logged indicate the OS was attempting to do
something it shouldn't. Can this be confirmed or denied. The
controller reports no problems. Is there anything else you can point
me to?
Start of problems in syslog
Apr 4 00:00:00 nfs4 newsyslog[986]: logfile turned over due to
size>100K
Apr 4 01:11:59 nfs4 rpc.statd: unmon request from localhost, no
matching monitor
Apr 4 14:10:44 nfs4 rpc.statd: unmon request from localhost, no
matching monitor
Apr 4 14:10:44 nfs4 rpc.statd: unmon request from localhost, no
matching monitor
Apr 5 01:08:48 nfs4 kernel:
g_vfs_done():aacd1p1[READ(offset=-6163487656308658176,
length=32768)]error = 5
Apr 5 01:08:48 nfs4 kernel:
g_vfs_done():aacd1p1[READ(offset=-2344660732015456256,
length=32768)]error = 5
Apr 5 01:08:48 nfs4 last message repeated 290 times
Apr 5 01:08:48 nfs4 kernel: g_vfs_done():
Apr 5 01:08:48 nfs4 kernel: aacd1p
Apr 5 01:08:48 nfs4 kernel: 1[RE
Apr 5 01:08:48 nfs4 kernel: AD(
Apr 5 01:08:48 nfs4 kernel: off
Apr 5 01:08:48 nfs4 kernel: set
Apr 5 01:08:48 nfs4 kernel: =-
Apr 5 01:08:48 nfs4 kernel: 234
Apr 5 01:08:48 nfs4 kernel: 466
Apr 5 01:08:48 nfs4 kernel: 073
Apr 5 01:08:48 nfs4 kernel: 201
Apr 5 01:08:48 nfs4 kernel: 545
Apr 5 01:08:48 nfs4 kernel: 6256,
Apr 5 01:08:48 nfs4 kernel: len
Apr 5 01:08:48 nfs4 kernel: gth
Apr 5 01:08:48 nfs4 kernel: =32
Apr 5 01:08:48 nfs4 kernel: 76
Apr 5 01:08:48 nfs4 kernel: 8)]
Apr 5 01:08:48 nfs4 kernel: err
Apr 5 01:08:48 nfs4 kernel: or
Apr 5 01:08:48 nfs4 kernel: = 5
Apr 5 01:08:48 nfs4 kernel:
Log of "message repeated" with more than 9 times repeated.
Apr 5 01:57:27 nfs4 last message repeated 75 times
Apr 5 01:57:28 nfs4 last message repeated 434 times
Apr 5 01:57:38 nfs4 last message repeated 18848 times
Apr 5 01:57:43 nfs4 last message repeated 9894 times
Apr 5 01:57:45 nfs4 last message repeated 435 times
Apr 5 01:57:45 nfs4 last message repeated 105 times
Apr 5 01:57:45 nfs4 last message repeated 433 times
Apr 5 01:57:45 nfs4 last message repeated 303 times
Apr 5 01:57:46 nfs4 last message repeated 421 times
More information about the freebsd-fs
mailing list