kernel panic on 6.2-RC2 with GENERIC.

Mon Jan 8 04:15:43 UTC 2007

Jan Mikkelsen wrote:
> (Scott:  I should have emailed you this earlier, but Christmas and various
> other things got in the way.)
> 
> Ian West wrote:
>> On Sun, Jan 07, 2007 at 02:25:02PM -0500, Mike Tancsa wrote:
>>> At 11:43 AM 1/7/2007, Craig Rodrigues wrote:
>>>> On Fri, Jan 05, 2007 at 06:59:10PM +0200, Nikolay Pavlov wrote:
>>>> [ Areca kernel panic, IO failures ... ]
>> I have seen this identical fault with the new areca driver, my machine
>> is opteron hardware, but running a regular i386/SMP kernel/world. With
>> everything at 6.2RC2 (as of 29th of December) except the areca driver
>> the machine is rock solid, with the 29th of december version of the
>> areca driver the box will crash on extract of a large tar 
>> file, removal
>> of a large directory structure, or pretty much anything that 
>> does a lot
>> of disk io to different files/locations. There is no error 
>> log prior to
>> seeing the following messages..
>>
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=433078272, length=8192)]error = 5
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=433111040, length=16384)]error = 5
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=433209344, length=16384)]error = 5
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=433242112, length=32768)]error = 5
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=437612544, length=4096)]error = 5
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=437616640, length=12288)]error = 5
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=437633024, length=6144)]error = 5
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=437639168, length=2048)]error = 5
>> Dec 29 14:26:44 aleph kernel: 
>> g_vfs_done():da0s1g[WRITE(offset=437641216, length=6144)]error = 5
>>
>> There are a string of these, followed by a crash and reboot. 
>> The file system
>> state can be left very dirty to the point where background 
>> fsck seems unable
>> to recover it.
>>
>> The areca card in question is running the latest firmware/boot and
>> has shown no problems either before, or since backing out the areca
>> driver.
>>
>> The volume is ran the tests on was a 250G on a raid6 raid set.
> 
> I have seen various problems with various Areca drivers.  All on
> 6.2-RC1/amd64 with an Areca RAID-6 volume.
> 
> Areca 1.20.00.02 seems to work fine.
> 
> Areca 1.20.00.12 (from the Areca website) seems to have data corruption
> problems.  My tests involve doing a "diff -r" on a filesystem with 2GB of
> data.  It will occasional find differences in files.  On examination, the
> last 640 bytes of the first block of the affected file contain data from
> another file "nearby" in the filesystem.  Unmounting and remounting the
> filesystems and rerunning the test shows no problem, or a difference in
> another file entirely.  I think this is the cause of the g_vfs_done failures
> with this version of the driver;  the offsets are wrong because the data is
> corrupted.
> 
> Areca 1.20.00.13 (as currently in the tree) does not seem to have data
> corruption problems, but I can trigger g_vfs_done failures under heavy I/O.
> 
> I have raised this with Areca support, and I'm waiting to hear back from
> Erich Chen.
> 
> Regards,
> 
> Jan Mikkelsen
> 

I discussed this issue in length with the release engineering team 
today, and we're going to go ahead with keeping the .013 version in
6.2 since it has been working very reliably for a number of other
testers, and reverting it at this late stage of the release represents
more risk.  A note about this issue will likely be put into the 6.2
errata document as well.

I plan to dig into this problem next week unless Areca fixes it first.
Please let me know if you hear anything from them.

Scott