nand performance

Fri Dec 21 03:02:18 UTC 2012

On Dec 20, 2012, at 5:50 PM, Ian Lepore wrote:

> On Thu, 2012-12-20 at 12:07 -0800, John-Mark Gurney wrote:
>> Ian Lepore wrote this message on Wed, Dec 19, 2012 at 17:41 -0700:
>>> I've been working to get nandfs going on a low-end Atmel arm system.
>>> Performance is horrible.  Last weekend I got my nand-based DreamPlug
>>> unbricked and got nandfs working on it too.  Performance is horrible.
>>> 
>>> By that I'm referring not to the slow nature of the nand chips
>>> themselves, but to the fact that accessing them locks out userland
>>> processes, sometimes for many seconds at a time.  The problem is real
>>> easy to see, just format and populate a nandfs filesystem, then do
>>> something like this
>>> 
>>>  mount -r -t nandfs /dev/gnand0s.root /mnt
>>>  nice +20 find /mnt -type f | xargs -J% cat % > /dev/null
>>> 
>>> and then try to type in another terminal -- sometimes what you're typing
>>> doesn't get echoed for 10+ seconds a time.
>>> 
>>> The problem is that the "I/O" on a nand chip is really just the cpu
>>> copying from one memory interface to another, a byte at a time, and it
>>> must also use busy-wait loops to wait for chip-ready and status info.
>>> This is being done by high-priority kernel threads, so everything else
>>> is locked out.
>>> 
>>> It seems to me that this is about the same situation as classic ATA PIO
>>> mode, but PIO doesn't make a system that unresponsive.  
>>> 
>>> I'm curious what techniques are used to migitate performance problems
>>> for ATA PIO modes, and whether we can do something similar for nand.  I
>>> poked around a bit in dev/ata but the PIO code I saw (which surely
>>> wasn't the whole picture) just used a bus_space_read_multi().  Can
>>> someone clue me in as to how ATA manages to do PIO without usurping the
>>> whole system?
>> 
>> Looks like the problem is all the DELAY calls in dev/nand/nand_generic.c..
>> DELAY is a busy wait not letting the cpu do anything else...  The bad one
>> is probably generic_erase_block as it looks like the default is 3ms,
>> plenty of time to let other code run...  If it could be interrupt driven,
>> that'd be best...
>> 
>> I can't find the interface that would allow sub-hz sleeping, but there is
>> tsleep that could be used for some of the larger sleeps...  But switching
>> to interrupts + wakeup would be best...
>> 
> 
> Yeah, the DELAY() calls were actually not working for me (I think I'm
> the first to test this stuff with an ONFI type chip), and I've replaced
> them all with loops that poll for ready status, which at least minimizes
> the wait time, but it's still a busy-loop.  Real-world times for the
> chips I'm working with are 30uS to open a page for read, ~270uS to write
> a page, and ~750uS to erase a block.

You're the first one to use it with Intel or Micron NAND?  I find that kinda hard to believe given their ubiquity...

But those times look about right for 3xnm parts...  With newer parts, according to published specifications, those times get longer.  Expect them to double over the next year (meaning through Intel/Micron's 20nm parts now rolling out). Other NAND vendors have similar published specs, or there's much public information about this.

> But whether busy-looping for status or busy-looping polling a clock for
> DELAY, or transferring a byte at a time for the actual IO, it's all the
> same... it's cpu and memory bus cycles that are happening in a
> high-priority kernel thread.  

But usually the transfer goes quickly (a few microseconds with dedicated hardware) compared to the waiting (tens or hundreds of microseconds).  The RM9200 doesn't have a dedicated NAND hardware, so byte-banging the data to the device is the only choice...

It looks like you'll also have to coordinate it with a number of GPIO pins, which is good...  That means you'll be able to have an interrupt service the state change of the GPIO pins (well, you may need to augment the current lame on AT91 gpio support that I wrote to allow for this). But the NAND subsystem looks like it needs some support to do that...

> The interface between the low-level controller and the nand layer
> doesn't allow for interrupt handling right now.  Not all hardware
> designs would allow for using interrupts, but mine does, so reworking
> things to allow its use would help some.  Well, it would help for writes
> and erases.  The 180mhz ARM I'm working with doesn't get much done in
> 30uS, so reads wouldn't get any better.   Reads are all I really care
> about, since the product in the field will have a read-only filesystem,
> and firmware updates are infrequent and it's okay if they're a bit slow.

Any idea what the interrupt and scheduling delay runs these days on the RM9200?  It has been forever since I tried to measure it. You may be able to signal a waiting process rather than using DELAY to busy wait for things.  But that likely means a thread of some sort to defer the work once the chip returns done.  Read might get better, from a system load point of view, but maybe not from a performance point of view.  While 30us isn't a lot, you may find that your console performance goes to hell with that long a block...

Warner