kern/153771: Unkillable process after a SCSI tape write error

Fri Jan 7 18:50:06 UTC 2011

>Number:         153771
>Category:       kern
>Synopsis:       Unkillable process after a SCSI tape write error
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Jan 07 18:50:05 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator:     Martin Simmons
>Release:        8.0-RELEASE-p6
>Organization:
>Environment:
FreeBSD lwfs1-cam.cam.lispworks.com 8.0-RELEASE-p6 FreeBSD 8.0-RELEASE-p6 #0: Wed Dec 22 16:21:50 GMT 2010     root at lwfs1-cam.cam.lispworks.com:/usr/obj/usr/src/sys/GENERICLOCAL  amd64
>Description:
I'm not 100% sure of the sequence of system calls, because they are generated by Bacula, but here is what I think happened.

The kernel is built from the 8.0-RELEASE-p6 GENERIC with a small patch for nfs.

The hardware consists of an Intel chassis with an HP LTO-1 tape drive connected via an LSI U160 SCSI controller:

Jan  3 23:06:24 lwfs1-cam kernel: sym0: <1010-33> port 0x1100-0x11ff mem 0xb1904000-0xb19043ff,0xb1900000-0xb1901fff irq 16 at device 0.0 on pci7
Jan  3 23:06:24 lwfs1-cam kernel: sym0: Symbios NVRAM, ID 7, Fast-80, LVD, parity checking
Jan  3 23:06:24 lwfs1-cam kernel: sym0: open drain IRQ line driver, using on-chip SRAM
Jan  3 23:06:24 lwfs1-cam kernel: sym0: using LOAD/STORE-based firmware.
Jan  3 23:06:24 lwfs1-cam kernel: sym0: handling phase mismatch from SCRIPTS.
Jan  3 23:06:24 lwfs1-cam kernel: sym0: [ITHREAD]
Jan  3 23:06:24 lwfs1-cam kernel: sym1: <1010-33> port 0x1000-0x10ff mem 0xb1905000-0xb19053ff,0xb1902000-0xb1903fff irq 17 at device 0.1 on pci7
Jan  3 23:06:24 lwfs1-cam kernel: sym1: Symbios NVRAM, ID 7, Fast-80, SE, parity checking
Jan  3 23:06:24 lwfs1-cam kernel: sym1: open drain IRQ line driver, using on-chip SRAM
Jan  3 23:06:24 lwfs1-cam kernel: sym1: using LOAD/STORE-based firmware.
Jan  3 23:06:24 lwfs1-cam kernel: sym1: handling phase mismatch from SCRIPTS.
Jan  3 23:06:24 lwfs1-cam kernel: sym1: [ITHREAD]
Jan  3 23:06:24 lwfs1-cam kernel: sa0 at sym0 bus 0 target 3 lun 0
Jan  3 23:06:24 lwfs1-cam kernel: sa0: <HP Ultrium 1-SCSI P61D> Removable Sequential Access SCSI-3 device 
Jan  3 23:06:24 lwfs1-cam kernel: sa0: 160.000MB/s transfers (80.000MHz DT, offset 62, 16bit)

There was a MEDIUM ERROR during a backup to the LTO-1 drive, but the write system call returned as if the error hadn't happened (AFAICS).  The next write was reported as failing with EPERM and a subsequent ioctl to rewind the tape hung in an unkillable state.

These syslog messages were generated

Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): WRITE FILEMARKS(6). CDB: 10 0 0 0 1 0 
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): CAM Status: SCSI Status Error
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): SCSI Status: Check Condition
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): MEDIUM ERROR info:1 asc:c,0
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): Write error
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): Retries Exhausted
Jan  3 00:27:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 00:31:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 00:35:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 01:23:43 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 02:12:08 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 03:00:33 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 03:49:02 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 04:37:27 lwfs1-cam kernel: sym0:3:control msgout: 80 6.

The message at 00:31:18 corresponds to when it tried to rewind the tape and the last message then repeats roughly every 50 minutes.

The output of procstat -k -k for the unkillable backup process is:

  PID    TID COMM             TDNAME           KSTACK                       
 1166 100179 bacula-sd        -                mi_switch+0x16f sleepq_catch_signals+0x31f sleepq_wait_sig+0xc _cv_wait_sig+0x120 seltdwait+0xf6 kern_select+0x62f select+0x5d syscall+0x246 Xfast_syscall+0xe1 
 1166 100259 bacula-sd        -                mi_switch+0x16f sleepq_catch_signals+0x31f sleepq_timedwait_sig+0x19 _sleep+0x1a3 do_cv_wait+0x640 __umtx_op_cv_wait+0x5c syscall+0x246 Xfast_syscall+0xe1 
 1166 100316 bacula-sd        -                mi_switch+0x16f sleepq_wait+0x42 _sleep+0x31c cam_periph_runccb+0x4e sarewind+0x6c saioctl+0x977 giant_ioctl+0x7d devfs_ioctl_f+0x77 kern_ioctl+0xf6 ioctl+0xfd syscall+0x246 Xfast_syscall+0xe1 

Attempts to attach gdb to the process caused gdb to hang (but it was killable).

Attempts to use "camcontrol periphlist sa0" and "camcontrol reset 0:3:0" also hung in an unkillable way.

"camcontrol reset 0" gave "Reset of bus 0 was successful" but didn't fix the hanging processes.

The machine has been rebooted since then so I can't test anything else directly.

>How-To-Repeat:
I don't know if it is repeatable.

>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: