Stable SATA pci card for FreeBSD 6.x/7.0

Wed Aug 6 03:17:51 UTC 2008

On Tue, Aug 05, 2008 at 07:08:20PM +0200, Sebastiaan van Erk wrote:
> Jeremy Chadwick wrote:
>> On Tue, Aug 05, 2008 at 02:47:45PM +0200, Sebastiaan van Erk wrote:
>>> Hi,
>>>
>>> Thanks for the reply.
>>>
>>> Jeremy Chadwick wrote:
>>>> Yes, most of the Silicon Image ICs I've read about have odd driver
>>>> problems or general issues (even under Windows).  The system rebooting
>>>> is an odd one; you sure your PSU can handle two disks?
>>> Well, I've got a 450W Asus PSU in there, but I've also got 6 hard 
>>> disks  and 1 dvd-rom drive (mostly inactive) in there. The hard disks 
>>> are  mostly 250/300GB but the two new ones are 1TB SATA drives. But 
>>> the 450W  should easily be enough, shouldn't it?
>>
>> Without getting into semantics, a 450W PSU may be on the light side for
>> 6 disks.  I'm fairly amazed you're able to power up that machine without
>> disk errors or other problems during POST.  You'll be having 6 disks
>> spin up all simultaneously -- and spin-up is when disks draw the most
>> power, and possibly during normal operation.
>>
>> If you have a different (or larger) PSU, I would recommend trying that
>> to see if it addresses your problem.  A PSU which isn't providing enough
>> power will cause the disks to occasionally disconnect from the bus, or
>> the machine sporadtically lock up, reboot (power-cycle), or other odd
>> things.
>
> Unfortunately I don't have a larger PSU lying around, but I could buy  
> one; though I'd like to try some other stuff first because I've had 6  
> disks in my PC before without any problems.

See the very bottom of my mail.  I don't believe the PSU is the problem,
after reviewing your SMART statistics.

> <...parts of thread cut...>

> My other (on-board) SATA controller is a VIA controller; and I've never  
> had any problems with it (although the hardware raid messed up once a  
> year or 2 ago, and since then I've been using software raid without any  
> issues).

Okay, so you've got an onboard VIA (VT6410) SATA controller, an onboard
VIA IDE controller, and a PCI SATA controller.  I'd still like to know
which disks are attached to what controller, and if any of the devices
are sharing IRQs.  Can you provide the output from the following two
commands?

dmesg | egrep 'atapci|(ad|ata)[0-9]+'
vmstat -i

I'm just trying to narrow stuff down.

>> Your recommended method of troubleshooting (swapping the 250G for the
>> 1TB) is a good idea.  But hear me loud and clear: just because you
>> switch the disks and the problem disappears for a few hours doesn't mean
>> it's gone.  There have been **many** people who have shown up on the
>> mailing lists stating "I did <X thing> and now it works!", only to find
>> that a week later it *didn't* fix the problem.
>
> Yes, I don't really expect it to solve the problem, but was thinking  
> that at least I could try and stress test the known working disks on the  
> controller and try to see if it's the controller that's the problem or  
> the disks (or something else). I've been able to reproduce the crashes  
> pretty well by just doing a lot of disk IO on the 1TB disks only (so the  
> other disks were pretty idle during the tests).

It's interesting that the disks which are giving you trouble are Samsung
disks.  There's some history here which you should be made aware of:

In July, Daniel Eriksson reported data corruption occurring with his
nVidia MCP55 chipset when 1TB Samsung disks were attached to it.  The
same disks on another controller performed fine.  The corruption was
being detected by ZFS as checksum errors.  (UFS/UFS2 won't detect this
sort of thing, unless the corruption is occurring somewhere within the
filesystem tables.)

http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043427.html

Soren Schmidt (ata(4) author) replied that there are some nVidia
chipset-related fixes for ATA in -CURRENT, and provided a patch.  Daniel
reported that the patch made absolutely no difference:

http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043434.html

Daniel also tried using a firmware patch for his Samsung disks, which
limit the SATA speed to SATA150, but the speed was still negotiated as
SATA300 (indicating the vendors' own f/w patch is broken, or FreeBSD
does not play well with it).  The f/w patch didn't fix his problem
either:

http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043432.html

zbeeble at gmail.com reported using his MCP55 controller without any
problem -- as long as he didn't use Samsung disks.  He stated that he
believes Samsung disks are PATA disks that use a PATA-to-SATA adapter
inside of the drive, leading to problems (and yes, those adapters are
known to cause all sorts of mayhem):

http://lists.freebsd.org/pipermail/freebsd-stable/2008-July/043485.html

I'm not sure what became of the thread; Daniel never provided a
post-mortem.  I'm left to believe he probably took zbeeble at gmail.com's
advice and switched to another disk vendor.

> <...parts of thread cut...>
> <...smartctl output for both Samsung disks...>

Thanks for upgrading to 5.38.  All the SMART statistics for these disks
look okay.

Can you run some SMART tests on the disks?  You can run these tests
while the disks are in use (but I/O will make the test take longer to
complete):

smartctl -t short /dev/ad4
smartctl -t short /dev/ad6

Then you'll need to look at the SMART self test log, as well as the
SMART error log, to see if anything is returned.  Make sure the tests
have completed (the Status field should be "Completed without error",
unless an error was found of course):

smartctl -a /dev/ad4
smartctl -a /dev/ad6

If nothing is found, try a different test (also safe to run during
operation; don't let the word "offline" scare you), and repeat looking
at the logs once more.  This test may take some time, though:

smartctl -t offline /dev/ad4
smartctl -t offline /dev/ad6

At this point, I'm inclined to believe the issue is specific to those
Samsung disks.  I do not believe your PSU is a problem; the SMART
statistics would be showing a higher number of power-cycles if the disks
were losing power.

Worth noting (about Samsung disks) is that smartctl has options to work
around 3 different firmware bugs.  The bugs are SMART statistics-related,
but those kind of mistakes don't give me "warm fuzzies".  Be wary.  :-)

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |