ATA DMA dump failures: take 2
Dmitry Pryanishnikov
dmitry at atlantis.dp.ua
Sat Feb 25 06:44:17 PST 2006
Hello!
I'm trying to find why the new ATA DMA dump code in CURRENT fails under
some conditions. My conditions IMHO are very common: I issue
cd /usr/ports/editors/openoffice.org-2.0
NOCLEANDEPENDS=yes make extract clean
(just to create and then delete a LOT of files) on ASUS M5A notebook
with "only" 256Mb of RAM. This reliably panics my system during the clean
pass, when softupdates code runs into the shortage of kmem_map:
panic: kmem_malloc(4096): kmem_map too small: 82014208 total allocated
Before trying to understand how to tune my system better (alas tuning(7)
doesn't mention kmem_map at all) I'm trying to obtain crash dump, but
I'm just getting infamous "FAILURE - out of memory in start" error in
ad_strategy. OK, it's very unwise to rely on availability of kernel
memory in situations like mine. But we can easily guard against it by
preallocating a spare "struct ata_request". I've created a simple patch:
ftp://external.atlantis.dp.ua/FreeBSD/CURRENT/nodump/ata-disk.c.patch
wich solves this allocation problem and instruments code in order to
understand code flow. Note that it's unclear to me _what_ guarantees
that ad_strategy() will always finish it's job, so I've added a check
for BIO_DONE. Actually once I've got this check failed, and my system
was just keeping print '.''s (request has never been finished).
But the most serious problem is that in more than 90% of cases I don't even
come to printf("}"); ! I'm just getting another "panic: double fault" instead.
Look at the pictures DSCN1971-4 in the same folder as patch. On the 1st
picture you can see that panic happens during the execution of ad_strategy()
(there is a "{" w/o matching "}"). On the 2nd you see the start of 'bt'
output. I've no idea about trap 0x17 - is it stack overflow or something else?
On the 3rd you can see what that main part of the stack is filled with:
repetitive sequence of nested
ata_start()
ata_interrupt()
ata_finish()
ata_completed()
4th picture is the point where initial ad_dump() takes place. My theory is
that ata driver tries to finish off all queued I/O requests and is running out
of the stack. And the question here is whether driver should try to complete
those previously queued requests at all: OS has just crashed, so data (and
disk block numbers!) in those request can be invalid.
My main question is whether dump speed increase worth the loss of dump
robustness? I think it's not. Alas, this new dump code has already been
commited to RELENG_6, so IMHO we should try to fix this issue before
ongoing 6.1-RELEASE. Impossibility to obtain a crash dump can make developer's
life really difficult. IMHO we should try to make the new code robust
(so it won't fail in the case of OS resource shortages), but if we fail
the good old (slow but always working) dump code should be restored.
Sincerely, Dmitry
--
Atlantis ISP, System Administrator
e-mail: dmitry at atlantis.dp.ua
nic-hdl: LYNX-RIPE
More information about the freebsd-current
mailing list