kern/54616: System hangs writing CD-Rs with "atapicam" device

Fri Jul 18 09:00:26 PDT 2003

>Number:         54616
>Category:       kern
>Synopsis:       System hangs writing CD-Rs with "atapicam" device
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Jul 18 09:00:24 PDT 2003
>Closed-Date:
>Last-Modified:
>Originator:     Daniel Lang
>Release:        FreeBSD 4.8-STABLE i386
>Organization:
TU Muenchen
>Environment:
System: FreeBSD atrbg11.informatik.tu-muenchen.de 4.8-STABLE FreeBSD 4.8-STABLE #20: Thu Jul 17 13:38:00 CEST 2003 root at atrbg11.informatik.tu-muenchen.de:/usr/obj/usr/src/sys/ATRBG11 i386

Dmesg excerpt:
[..]
atapci0: <VIA 82C686 ATA66 controller> port 0xffa0-0xffaf at device 4.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
[..]
sym0: <895> port 0xd800-0xd8ff mem 0xefffa000-0xefffafff,0xefffff00-0xefffffff irq 11 at device 13.0 on pci0
sym0: Tekram NVRAM, ID 7, Fast-40, LVD, parity checking
[..]
da1 at sym0 bus 0 target 2 lun 0
da1: <FUJITSU MAM3367MP 0106> Fixed Direct Access SCSI-3 device 
da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da1: 35044MB (71770616 512 byte sectors: 255H 63S/T 4467C)
cd1 at ata1 bus 0 target 0 lun 0
cd1: <HL-DT-ST CD-RW GCE-8520B 1.00> Removable CD-ROM SCSI-0 device 
cd1: 16.000MB/s transfers
cd1: Attempt to query device size failed: NOT READY, Medium not present - tray closed
cd0 at ata0 bus 0 target 0 lun 0
cd0: <TOSHIBA CD-ROM XM-6602B 1017> Removable CD-ROM SCSI-0 device 
cd0: 16.000MB/s transfers
cd0: Attempt to query device size failed: NOT READY, Medium not present
da0 at sym0 bus 0 target 1 lun 0
da0: <IBM DNES-318350W SA30> Fixed Direct Access SCSI-3 device 
da0: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da0: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C)
[..]

camcontrol devlist:
<IBM DNES-318350W SA30>            at scbus0 target 1 lun 0 (pass0,da0)
<FUJITSU MAM3367MP 0106>           at scbus0 target 2 lun 0 (pass1,da1)
<TOSHIBA CD-ROM XM-6602B 1017>     at scbus1 target 0 lun 0 (pass2,cd0)
<HL-DT-ST CD-RW GCE-8520B 1.00>    at scbus2 target 0 lun 0 (pass3,cd1)

# sysctl hw.ata.atapi_dma:
hw.ata.atapi_dma: 0

>Description:

Using the system described above and writing CD-Rs with either
"cdrecord" -> cdrtools-2.0_1 or cdrdao -> cdrdao-1.1.7_4

hangs the system, if the writing speed is >12 more likely if it
is >= 32. 

I already contacted the maintainer of atapicam,
Thomas Quinot <thomas at FreeBSD.ORG>, and the maintainer
of the cdrdao port, Marius Strobl <marius at alchemy.franken.de>.
However, I file this PR to get better coordination
of input and to collect the audit trail.

I will now attach any relevant conversation, that has already
taken place (various repetitions/quotes omitted):

========Previous Discussions Follow===================

Daniel:
---------------------------------------
Now, I would like to burn cd-rs with cdrdao, like this

cdrdao write --device 2,0,0 --driver generic-mmc ...
or
cdrecord dev=2,0,0 ...

I could successfully burn 2 CD-Rs and scrapped about 5, because
after different amount of written blocks, the system suddenly hangs.

One time I had a complete freeze and could only press the reset button.
The other times, the recording/buring process stopped, but I could
still move the mouse and type things into my xterms.
However, every command, that would need to access the filesystems
hanged the shell. e.g. using filename completion, trying to
use 'top', 'ps' or even reboot resulted in a hang of the shell.
Soon all my xterms are unusable and again I needed to press reset.
Any clue, or suggestions, how to further debug this situation.
As there was no panic, I could not get a stack-trace.
I still have acdX devices, maybe they don't go well along with
the SCSI devices?

burncd using /dev/acd1c did work very well, though.
-----------------------------------------------

Daniel (own f'up):
-----------------------------------------------
Hi Thomas,

here is an update:

Daniel Lang wrote on Fri, Jul 04, 2003 at 11:49:35AM +0200:
[..]
> acd0: CDROM <TOSHIBA CD-ROM XM-6602B> at ata0-master PIO4
> acd1: CD-RW <HL-DT-ST GCE-8520B> at ata1-master PIO4
[..]

I've removed the atapi CD (and disk) devices from the kernel.
And first I thought this would solve the problem,
but then it hung again. :(
------------------------------------------------

Thomas (atapicam maintainer):
------------------------------------------------
Having acd in the kernel should not hurt as long as you are not *using*
acd and cd simultaneously to access the same device.

Your problem sounds like the ATA bus hanging. When it hangs and you can
still access your xterms, it would be nice to see what dmesg says.
You can also try to trigger the hang outside of X, and see if there are
messages on the console. Another possible solution is to drop into DDB
using Ctrl+Alt+Esc, and then manually trigger a panic.
-------------------------------------------------

Daniel:
-------------------------------------------------
[..]
So, back in the office, I went back to the issue. Before anything
else, I've updated my system to 4.8-STABLE, then built a
debugging kernel with options DDB.

[..]
# camcontrol devlist
<IBM DNES-318350W SA30>            at scbus0 target 1 lun 0 (pass0,da0)
<FUJITSU MAM3367MP 0106>           at scbus0 target 2 lun 0 (pass1,da1)
<TOSHIBA CD-ROM XM-6602B 1017>     at scbus1 target 0 lun 0 (pass2,cd0)
<HL-DT-ST CD-RW GCE-8520B 1.00>    at scbus2 target 0 lun 0 (pass3,cd1)

The last one is the cd write I've used with the following command:

# cdrdao write --device 2,0,0 --driver generic-mmc --speed 48 -v 2 -n bla.toc

After two successful writes, the system hung again, as before.
_No_ messages on the console.

I entered the debugger with Ctrl-Alt-Esc and did a trace.
I did not copy everything, because I thought I could use gdb -k
later on (I was wrong). However, what I've saved from the trace was:

Apparently the system hung in

camisr(c02f3250,c02b7078,c253aa3,0,10) at camisr+0x8f

eip: 0xc01279d7, esp: 0xc0297008, ebp: 0xc0297020

Please advice what to examine how.

Forcing a panic did not work. I could call panic, but
'continue' did not write a crash dump to disk, but hung again.

call boot(0) did also not work, from this on, I could not even
get back into the debugger and had to hit the reset button.

So it seems that scsi disk operations are not working as well,
so maybe not the ATA bus is hanging but the SCSI subsystem?
Remote-GDB debugging is not an option, unfortunately. I don't have
another RELENG_4 machine ready. However, I have a laptop with
some half-working 5.1-CURRENT. There are no 4.x sources on it...
maybe it could work, if I copy the source tree, but I would like
to have some confirmation, that it works, before I put effort
into this.
-------------------------------------------------

Daniel (again):
-------------------------------------------------
I managed to do some more investigations.

Daniel Lang wrote on Tue, Jul 15, 2003 at 04:25:56PM +0200:
[..]
> After two successful writes, the system hung again, as before.
> _No_ messages on the console.

I found out, that the hangs do not appear (or way less likely),
if the writing speed used is <= 12. But they seem to occur
very likely if the (attempted) writing speed is like 48.

Maybe this is an important hint. Although the drive and the
media claim to support speed 48, it seems that the
overall throughput is in fact slower. 20-30 as it seems
to me. Still I manage to write now and then a CD using this
setting, but maybe after a while something gets confused, if
the application tries to keep up a high writing speed, but
the drive (or rest of the system, bus, etc) cannot keep up
with that. Does this sound reasonable or am I poking in the utter
darkness here?

> I entered the debugger with Ctrl-Alt-Esc and did a trace.
> I did not copy everything, because I thought I could use gdb -k
> later on (I was wrong). However, what I've saved from the trace was:
> 
> Apparently the system hung in
> 
> camisr(c02f3250,c02b7078,c253aa3,0,10) at camisr+0x8f
> 
> eip: 0xc01279d7, esp: 0xc0297008, ebp: 0xc0297020
> 
> Please advice what to examine how.
[..]
> Remote-GDB debugging is not an option, unfortunately. I don't have
[..]
I withdraw that statement! I did set up a remote gdb session
successfully!

But it was sort of useless.

After the system hung again, I used Ctrl-Alt-Esc to enter DDB.
I fired up the remote gdb and told it to remote connect.
Then I issued the 'gdb' command to DDB.
The remote gdb took over and I was in control.

But it seems useless, because the stack did only contain
the DDB routines? I include the (as it seems useless)
backtrace here:

Program received signal SIGTRAP, Trace/breakpoint trap.
Debugger (msg=0xc02a7c49 "manual escape to debugger")
    at /usr/src/sys/i386/i386/db_interface.c:319
319              * XXX
(kgdb) bt
#0  Debugger (msg=0xc02a7c49 "manual escape to debugger")
	at /usr/src/sys/i386/i386/db_interface.c:319
#1  0xc024ce92 in scgetc (sc=0xc030cb20, flags=2)
	at /usr/src/sys/dev/syscons/syscons.c:3164
#2  0xc0249645 in sckbdevent (thiskbd=0xc0305540, event=0, arg=0xc030cb20)
	at /usr/src/sys/dev/syscons/syscons.c:617
#3  0xc0240ea6 in atkbd_intr (kbd=0xc0305540, arg=0x0)
	at /usr/src/sys/dev/kbd/atkbd.c:462
#4  0xc026c48c in atkbd_isa_intr (arg=0xc0305540)
	at /usr/src/sys/isa/atkbd_isa.c:140
#5  0xc02531ef in Xresume1 ()

How do I get to the hanging routine from here?

I'm willing to trace the problem from here, but I need advice
how to proceed.
-------------------------------------------------

Thomas:
-------------------------------------------------
> I found out, that the hangs do not appear (or way less likely),
> if the writing speed used is <= 12. But they seem to occur
> very likely if the (attempted) writing speed is like 48.

Hum, nasty, nasty. Looks like the amount of interrupts caused by
high-speed burning might trigger a race condition between two
instances of camisr(). camisr() does splcam(), but maybe this is not
sufficient to correctly prevent concurrent execution when interrupts
from the ATA driver occur. Maybe the freebsd-scsi people will have a
clearer idea of what is going on here.
-------------------------------------------------

==============================================
So far for that. I got further hints from
Marius Strobl <marius at alchemy.franken.de>, who maintains
the "cdrdao" port.
Since the emails have been in german, I post a translated summary:

Marius sugges the following:

- try cdrdao with WITHOUT_SCGLIB=yes (in case scglib interface
  to SCSI causes the problem).

- update scglib in cdrdao to current version as in cdrtools,
  since cdrdao uses older version.

Daniel> Hmmm, second is of questionable success, since fairly recent
        cdrtools did also show problem.

Daniel> Why are there different versions of cdrdao? With and without
        scglib. Any reasons to keep scglib in cdrdao? -> SCSI folk

Marius suggest further:

- try DMA mode for ATAPI CD-RW, it should generate less interrupts,
  thus maybe avoiding (but not solilving?) the problem.

  => Will try hw.ata.atapi_dma=1

================================================

Thats any information so far. I will add any more information
I can get in my future experiments.

Best regards,
 Daniel

>How-To-Repeat:

On system as above (4.8-STABLE, options atapicam, fast CD-RW (ATAPI),
two SCSI disks, try to write CD-Rs on high speed using the CAM
interface.

>Fix:

Not known yet.

>Release-Note:
>Audit-Trail:
>Unformatted: