Botched NCQ on SSD - cannot disable?

Alexander Motin mav at FreeBSD.org
Fri May 22 01:11:52 UTC 2015


On 21.05.2015 21:54, Warner Losh wrote:
> 
>> On May 21, 2015, at 12:42 PM, Neffi <nefftd at gmail.com> wrote:
>> 
>> I was discussing this issue in freenode/#freebsd and I was
>> recommended to shoot an email to you fellows about it.
>> 
>> I've got an Samsung 840 EVO SSD (model MZ-7TE250BW), which uses
>> Samsung's own controller from what I can gather. I had issues of
>> mass data corruption when used under Linux, and several programs
>> crashing unexpectedly when used under FreeBSD. I've gone through
>> 2 drives under warranty with the same issue before customer
>> service suggested to disable drive queuing.
>> 
>> After some research it seems as though this drive (and several
>> other common SSDs) report that they support NCQ, but in fact are
>> botched and will have all sorts of problems with NCQ enabled
>> ranging from poor performance, to I/O stalls to data corruption.
>> 
>> Sure enough the logs on Linux spit out something along the lines
>> of:
>> 
>>> ata1: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x10 frozen 
>>> ata1.00: failed command: READ FPDMA QUEUED
>> 
>> This happens several times when used on Linux, in the few hours
>> leading up to total filesystem corruption.
>> 
>> The recommendation in the Linux world is to disable NCQ on these
>> drives, for which there is an easy boot-time tunable for it. This
>> fixes the issue. No more data corruption.
>> 
>> There doesn't seem to be a tunable for this anywhere on FreeBSD.
>> camcontrol(8) mentions setting the tags used, but only between
>> some hardcoded limits, with a default of 2 -- not sufficient to
>> disable NCQ on the drive. It looks like presently the only option
>> is to manually patch the quirks for this drive in the kernel and
>> recompile before I can even install the system to the drive.
> 
> One option is to use drives that don’t suck so bad.
> 
> If you are using the AHCI controller, it has quirks for some cards
> that don’t properly fill in the NCQ tags, but so far that’s a tiny
> list of mostly older gear. What’s the host controller you are
> using.
> 
> Also, just because the command that hung on the drive is an NCQ
> command, that doesn’t mean disabling NCQ commands will keep you
> safe. That’s just the first one that’s issued after the firmware
> wedges (or could be: that’s a very common scenario for this kind of
> failure mode).
> 
> There’s a quirk for the 840 EVO, but that’s just to force 4k sector
> size.
> 
> While I haven’t used this generation of Samsung SSDs, I’d be highly
> surprised if this issue was really a problem in the drive instead
> of some cabling issue, or other environmental issue leading the the
> wedge.
> 
> It’s true there’s no way to totally disable NCQ, but if the drive
> is hanging with NCQ depth of 2, I’d be highly surprised if it is
> actually NCQ causing this...

IIRC camcontrol can disable NCQ, even though it is not very intuitive:
`camcontrol negotiate adaX -T disable ; camcontrol reset <CAM bus
number where adaX connected>`

-- 
Alexander Motin


More information about the freebsd-hackers mailing list