mfi and "copy out failed" messages

Charles Sprickman spork at bway.net
Fri May 4 05:29:20 UTC 2012


I'm wondering if anyone has some interest in this issue, I recently think I tracked down a long-standing fs corruption and panic issue on a Dell 2970 that I was never able to solve:

http://lists.freebsd.org/pipermail/freebsd-fs/2010-July/008858.html (there are other threads, but that's the gist of the issue)

I'd read in various threads that the "mfiX: Copy out failed" was a harmless message.  But recently I started thinking that there had to be some relation between those messages and the panics.  The timing fits - I had megacli performing a status check on the controller in a periodic script that kicked off with the daily run.  Most of my panics were during or shortly after the daily run.  The "Copy out failed" messages always corresponded to megacli being run.

132 days ago I removed the daily megacli check and the box has not had a kernel panic since then.  Previous to this my longest uptime was not more than a few months.  While this is by no means 100% definitive, it sure seems like something is going on here.  My best guess is that megacli and/or the mfi driver are interacting in a bad way and that the "Copy out failed" message is indicating something did not hit the controller that should have.  My earlier assumption was that it was just some control message megacli was sending that didn't make it, but now I'm thinking it's some request to write actual data to the drive that's failing.

As a reminder, the card in question is:

mfi0: <Dell PERC 6> port 0xec00-0xecff mem 0xe9f80000-0xe9fbffff,0xe9fc0000-0xe9ffffff irq 37 at device 0.0 on pci7
mfi0: 3049 (boot + 3s/0x0020/info) - Firmware version 1.22.02-0612
mfi0: 3051 (boot + 23s/0x0020/info) - Controller hardware revision ID (0x0)
mfi0: 3052 (boot + 23s/0x0020/info) - Package version 6.2.0-0013

If anyone with knowledge of the mfi driver would like to comment, I'd very much appreciate it.  This box is going to be repurposed in the coming months as an ESXi host to hold some backup/standby VMs, but before that I would not mind taking some time to test any patches, extra debug printfs in mfi, etc.  I suspect I can probably trigger the panic pretty easily by mimicking the daily run conditions - just kick off a find from "/" and then repeatedly loop the megacli command to check the array health.  

The box is still on 7.3, but I'd gladly upgrade to 8.3 and test there if needed once the box is freed up.

Thanks,

Charles

--
Charles Sprickman
NetEng/SysAdmin
Bway.net - New York's Best Internet www.bway.net
spork at bway.net - 212.655.9344







More information about the freebsd-scsi mailing list