6.2-Release ..ish.. CF + ata == freeze?

Jeremy Chadwick freebsd at jdc.parodius.com
Tue Feb 14 05:18:29 UTC 2012

On Mon, Feb 13, 2012 at 08:43:08PM -0800, john fleming wrote:
> Just thought i would post over here as i'm not getting a warm fuzzy from checkpoint about being able to find the root cause of an issue. I have a large install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2. I've had 3 firewalls hang basically the same way, with something that looks like a filesystem issue or an?issue with a CF card. 

FreeBSD 6.2 was EOL'd in early-to-mid-2008.  The ATA driver has changed
significantly since then (present-day uses CAM).

> Does anyone happen to know of any bugs (i've been looking around) that could cause something like that? Granted, it could be a batch of bad CF cards, but its odd that i'm seeing the same thing on 3 different boxes and once rebooted they seem ok.
> ?
> Also is it possible to get useful info form the atacontroller when things go south like this from the ddb prompt?

Not particularly.  What's shown below indicates that the driver had
issued some form of ATA write command (there are multiple kinds per ATA
specification), and either the underlying media (CF/disk) or controller
stalled/locked up/took too long.  I forget what the timeout value is in
6.2; I can't be bothered to remember such from 6 years ago.  :-)

> This is what shows in show msgbuf
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> ad0: timeout waiting to issue command
> ad0: error issuing WRITE command
> g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5
> ?g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 
> g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 

error 5 = EIO = Input/output error.  But this isn't too big of a
surprise given the timeouts you see prior.

Are these CF cards brand new -- meaning, are they completely unused
(having never had any writes done to them), or have they been in use a
while?  I'm betting they've been in use a while, and have probably been
doing many writes over the years.

Two things to note here:

1) The errors you've shown are only happening on writes, not reads.  Of
course if you omitted information then this isn't an accurate statement.
2) Timeouts are seen when issuing writes to some LBA regions.

How full is the CF card, disk-space-wise?  Not just ad0s4h, I'm talking
about the entire card.  How much space is roughly available?  They're
very small CF cards (1.8GByte roughly), and the less space available,
the less effectiveness of wear levelling (and in some cases the slower
the writes are).

Reason I ask: given that these are CF cards, this smells of cards which
are simply "worn down".  CF cards have limited numbers of writes, and
the card may be "freaking out" internally when attempting to write to
some LBAs which map to CF sectors that are, in effect, "bad".  The CF
cards' ECC implementation may be buggy, or may simply be "spinning hard"
for too long.  You can read about this sort of behaviour on Wikipedia's
CompactFlash article.

You wouldn't be able to verify this with dd if=/dev/ad0, because those
are read operations.  You could zero the media (dd if=/dev/zero
of=/dev/ad0) as a form of verification if you wanted.

Do you happen to know if these CF cards support SMART?  If so,
installing smartmontools (version 5.42 or newer please) and providing
output from "smartctl -a /dev/ad0" may be helpful to me, but I make no
guarantees anything of use will be shown there.

Overall my advice would be to replace the CF cards, especially if they
have been in use for a long while.  It really doesn't matter to me that
it's happening on 3 machines (honest), especially if these are 6.2
machines with CF cards that have been in use for years.  We're lucky to
get 2 years out of our CF cards on our Juniper M120/320s before they
start spitting I/O errors.  Pick larger CF cards as well; more space =
more room for effective wear levelling.

> ?
> ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4
> atapci0: <Intel 6300ESB UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at device 31.1 on pci0
> ata0: <ATA channel 0> on atapci0
> ata1: <ATA channel 1> on atapci0
> atapci1: <Intel 6300ESB SATA150 controller> port 0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at device 31.2 on pci0
> ata2: <ATA channel 0> on atapci1
> ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs partition on the box where almost anything that needs to be written goes.
> trace
> Tracing pid 1101 tid 100043 td 0x656d8460
> kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b
> siointr1(64ba1400) at siointr1+0xf0
> siointr(64ba1400) at siointr+0x38
> intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at intr_execute_handler+0x61
> intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at intr_execute_handlers+0x40
> atpic_handle_intr(4) at atpic_handle_intr+0x96
> Xatpic_intr4() at Xatpic_intr4+0x20
> --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c ---
> lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f
> getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2
> flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30
> flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf
> softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61
> ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2
> ffs_fsync(f0a4ac74) at ffs_fsync+0x12
> VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38
> fsync(656d8460,f0a4acb4) at fsync+0x170
> syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee
> Xint0x80_syscall() at Xint0x80_syscall+0x1f

| Jeremy Chadwick                                 jdc at parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |

More information about the freebsd-stable mailing list