Reset Problem with SATA Port Multiplier
Alexander Motin
mav at FreeBSD.org
Sun Jul 28 10:23:49 UTC 2013
On 28.07.2013 03:08, Dieter BSD wrote:
> Bob writes:
>> After a few hours of a database-like workload
>
> A faster way to trigger the problem would be useful.
>
>> We're actually more interested in archive type workloads than this
>> database workload and we have not observed the problem with an archive
>> workload.
>
> So perhaps something about the timing triggers the bug?
>
> Sam writes
>> if you have a script or a way to build a kernel to help debug this I will
>> run it if you post it here... I have the same issue on a 3 port multiplier
>> using -HEAD
>
> Can you share the make and model of this 3 port multiplier?
> If it is happening with more than one model of pm, it is more likely
> some generic problem, rather than triggering some model-specific quirk/bug.
> Has anyone seen this problem with an older OS release? (say 7.x or 8.x?)
> If the problem was introduced recently, we might be able to find it
> by looking at what changed in the source code. I haven't seen the
> problem with 8.2 or earlier.
>
> Looks like a verbose boot will give a little more info.
> But I suspect that adding more log(9) statements will be needed.
> Unless mav has a better idea?
There are two sides of this problem: original issue and imperfect error
recovery. First one is a big question. I can't say what is actually
going on there that causes the problem. Just recently I've made one more
attempt to get some documentation on SATA controllers from Marvell. But
even after signing NDA process again stopped since I am neither buying
thousands of their chips as vendor nor they are supporting for
end-users. The alike situation is with other vendors.
What's about the recovery, problem is that neither CAM nor mvs driver
now track faulty status of the devices. So if some disk's firmware stuck
and start causing infinite timeouts, that will substantially interrupt
operation of other devices sharing that SATA port. Probably the
mechanism of dropping faulty device could be improved somehow.
What is about SAS, mentioned here -- that is quite different more
expensive market. And even while protocols are much more sophisticated
and hardware, firmware and software there are much better tested, there
also situations happen sometimes when single misbehaving device may put
down whole fabric.
--
Alexander Motin
More information about the freebsd-hardware
mailing list