6.2-Release ..ish.. CF + ata == freeze?

Wed Feb 15 02:10:56 UTC 2012

2 of the 3 cf cards are very new, like less then 6 months old. 

I think around 65-70 percent is in use. This number doesn't change unless the user dumps data in a home dir, which isn't the case so far. 

You are correct that only writes are failing. Msgbuf has more then what I pasted but I'm pretty sure its just more of the same errors. Ill redouble my check. 

The other slices are very small. One is 35 meg the other is 100 some odd meg. H is 1.2 gig.  

I don't know if ill be able to try the dd test for a few reasons but ill check it out. Let me ask you this. Say zeroing out the drive works without error. Does that tell me anything?  

I also don't have access to smart tools as this is basically a closed system and the vendor would never give us access to a complier. Granted I haven't tried just throwing on gcc from 6.2. I could play with that or maybe since said vendor's dev team is keeping track of this thread they could provide said binary :). 

I really don't like the idea of replacing hardware as I'm looking at around 200 boxes. I really hope it doesn't come to that. 

Thanks for the reply!

Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Jeremy Chadwick <freebsd at jdc.parodius.com>
Date: Mon, 13 Feb 2012 21:18:28 
To: john fleming<jflemingeds at yahoo.com>
Cc: freebsd-stable at freebsd.org<freebsd-stable at freebsd.org>
Subject: Re: 6.2-Release ..ish.. CF + ata == freeze?

On Mon, Feb 13, 2012 at 08:43:08PM -0800, john fleming wrote:
> Just thought i would post over here as i'm not getting a warm fuzzy from checkpoint about being able to find the root cause of an issue. I have a large install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2. I've had 3 firewalls hang basically the same way, with something that looks like a filesystem issue or an?issue with a CF card. 

FreeBSD 6.2 was EOL'd in early-to-mid-2008.  The ATA driver has changed
significantly since then (present-day uses CAM).

> Does anyone happen to know of any bugs (i've been looking around) that could cause something like that? Granted, it could be a batch of bad CF cards, but its odd that i'm seeing the same thing on 3 different boxes and once rebooted they seem ok.
> ?
> Also is it possible to get useful info form the atacontroller when things go south like this from the ddb prompt?

Not particularly.  What's shown below indicates that the driver had
issued some form of ATA write command (there are multiple kinds per ATA
specification), and either the underlying media (CF/disk) or controller
stalled/locked up/took too long.  I forget what the timeout value is in
6.2; I can't be bothered to remember such from 6 years ago.  :-)

> This is what shows in show msgbuf
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5
> ?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 

error 5 = EIO = Input/output error.  But this isn't too big of a
surprise given the timeouts you see prior.

Are these CF cards brand new -- meaning, are they completely unused
(having never had any writes done to them), or have they been in use a
while?  I'm betting they've been in use a while, and have probably been
doing many writes over the years.

Two things to note here:

1) The errors you've shown are only happening on writes, not reads.  Of
course if you omitted information then this isn't an accurate statement.
2) Timeouts are seen when issuing writes to some LBA regions.

How full is the CF card, disk-space-wise?  Not just ad0s4h, I'm talking
about the entire card.  How much space is roughly available?  They're
very small CF cards (1.8GByte roughly), and the less space available,
the less effectiveness of wear levelling (and in some cases the slower
the writes are).

Reason I ask: given that these are CF cards, this smells of cards which
are simply "worn down".  CF cards have limited numbers of writes, and
the card may be "freaking out" internally when attempting to write to
some LBAs which map to CF sectors that are, in effect, "bad".  The CF
cards' ECC implementation may be buggy, or may simply be "spinning hard"
for too long.  You can read about this sort of behaviour on Wikipedia's
CompactFlash article.

You wouldn't be able to verify this with dd if=/dev/ad0, because those
are read operations.  You could zero the media (dd if=/dev/zero
of=/dev/ad0) as a form of verification if you wanted.

Do you happen to know if these CF cards support SMART?  If so,
installing smartmontools (version 5.42 or newer please) and providing
output from "smartctl -a /dev/ad0" may be helpful to me, but I make no
guarantees anything of use will be shown there.

Overall my advice would be to replace the CF cards, especially if they
have been in use for a long while.  It really doesn't matter to me that
it's happening on 3 machines (honest), especially if these are 6.2
machines with CF cards that have been in use for years.  We're lucky to
get 2 years out of our CF cards on our Juniper M120/320s before they
start spitting I/O errors.  Pick larger CF cards as well; more space =
more room for effective wear levelling.

> ?
> ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4
> atapci0: <Intel 6300ESB UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at device 31.1 on pci0
> ata0: <ATA channel 0> on atapci0
> ata1: <ATA channel 1> on atapci0
> atapci1: <Intel 6300ESB SATA150 controller> port 0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at device 31.2 on pci0
> ata2: <ATA channel 0> on atapci1
> ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs partition on the box where almost anything that needs to be written goes.
> trace
> Tracing pid 1101 tid 100043 td 0x656d8460
> kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b
> siointr1(64ba1400) at siointr1+0xf0
> siointr(64ba1400) at siointr+0x38
> intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at intr_execute_handler+0x61
> intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at intr_execute_handlers+0x40
> atpic_handle_intr(4) at atpic_handle_intr+0x96
> Xatpic_intr4() at Xatpic_intr4+0x20
> --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c ---
> lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f
> getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2
> flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30
> flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf
> softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61
> ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2
> ffs_fsync(f0a4ac74) at ffs_fsync+0x12
> VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38
> fsync(656d8460,f0a4acb4) at fsync+0x170
> syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee
> Xint0x80_syscall() at Xint0x80_syscall+0x1f

-- 
| Jeremy Chadwick                                 jdc at parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |