ZFS Crash

Fri May 29 17:44:47 UTC 2009

On Thu, 28 May 2009, Larry Rosenman wrote:

> On Thu, 28 May 2009, Kip Macy wrote:
>
>> On Tue, May 26, 2009 at 5:04 AM, Larry Rosenman <ler at lerctr.org> wrote:
>>> On Mon, 25 May 2009, Larry Rosenman wrote:
>>> 
>>>> On Mon, 25 May 2009, Larry Rosenman wrote:
>>>> 
>>>>> after looking at the code, never mind the "don't call doadump", so we'll
>>>>> get the textdump.
>>>>> 
>>>>> Thanks rwatson for the textdump stuff!
>>>>> 
>>>> Here is current stats before we crash.  Does any of this look totally
>>>> out of line?
>>>> 
>>> It crashed again, but did *NOT* make it into ddb enough to do the 
>>> textdump.
>>> 
>>> It was hung with the backtrace (looks like the same, but I couldn't
>>> scroll the screen back).
>>> 
>>> Ideas?
>>> 
>>> I'm really concerned that there is a problem.
>>> 
>>> 
>>> 
>> 
>> 
>> - Type of disks?
> 6 SATA Seagate 400GB (5) / 500 GB (1).
>
>
> ATA channel 0:
>    Master: acd0 <Memorex DVD+-RAM 510L v1/MWS7> ATA/ATAPI revision 7
>    Slave:       no device present
> ATA channel 2:
>    Master:  ad4 <ST3400620AS/3.AAJ> SATA revision 2.x
>    Slave:       no device present
> ATA channel 3:
>    Master:  ad6 <ST3400620AS/3.AAJ> SATA revision 2.x
>    Slave:       no device present
> ATA channel 4:
>    Master:  ad8 <ST3500630AS/3.AAE> SATA revision 2.x
>    Slave:       no device present
> ATA channel 5:
>    Master: ad10 <ST3400620AS/3.AAJ> SATA revision 2.x
>    Slave:       no device present
> ATA channel 6:
>    Master: ad12 <ST3400620AS/3.AAJ> SATA revision 2.x
>    Slave:       no device present
> ATA channel 7:
>    Master: ad14 <ST3400620AS/3.AAJ> SATA revision 2.x
>    Slave:       no device present
>> 
>> 
>> - Size of zpools?
> All 6.
>
>  pool: vault
> state: ONLINE
> status: One or more devices has experienced an error resulting in data
> 	corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
> 	entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
> scrub: none requested
> config:
>
> 	NAME        STATE     READ WRITE CKSUM
> 	vault       ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 	    ad8     ONLINE       0     0     0
> 	    ad10    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0     0
> 	    ad14    ONLINE       0     0     0
> 	  ad4s1f    ONLINE       0     0     0
> 	  ad4s1e    ONLINE       0     0     0
> 	  ad4s1d    ONLINE       0     0     0
>
> errors: 10 data errors, use '-v' for a list
>
>
>  pool: vault
> state: ONLINE
> status: One or more devices has experienced an error resulting in data
> 	corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
> 	entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
> scrub: none requested
> config:
>
> 	NAME        STATE     READ WRITE CKSUM
> 	vault       ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 	    ad8     ONLINE       0     0     0
> 	    ad10    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0     0
> 	    ad14    ONLINE       0     0     0
> 	  ad4s1f    ONLINE       0     0     0
> 	  ad4s1e    ONLINE       0     0     0
> 	  ad4s1d    ONLINE       0     0     0
>
> errors: Permanent errors have been detected in the following files:
>
>        /usr/local/sbin/p4d
>        /var/db/bacula/borg-dir.conmsg
>        vault/usr/obj:<0x16c3a>
>        vault/usr/obj:<0x169bb>
>        /usr/obj/usr/src/lib/libc/random.o
>
>> 
>> 
>> - Compression enabled?
> Yes.
>
>
>

Ok, it just crashed.  Unfortunately, I'm at work and the box is at home.

I did have my script running every minute of that entire boot.

What I saw was a full backup running, and then we started paging, and then
the backup jobs got pager errors, and were killed.

I'm not sure what else went on, so I restarted the bacula daemons that
got killed, and was in the bacula console when it died.

I'll see if I can get a cell-phone camera shot of the console.

I'll also tar up the vmstat outputs and put them on my web server.

What other forensics should I get?  Bear in mind the system is probably
locked up with no dump taken :(

-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 512-248-2683                 E-Mail: ler at lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893