nand performance

Sat Dec 22 00:06:04 UTC 2012

On Thu, 2012-12-20 at 20:02 -0700, Warner Losh wrote:
> On Dec 20, 2012, at 5:50 PM, Ian Lepore wrote:
> 
> > On Thu, 2012-12-20 at 12:07 -0800, John-Mark Gurney wrote:
> >> Ian Lepore wrote this message on Wed, Dec 19, 2012 at 17:41 -0700:
> >>> I've been working to get nandfs going on a low-end Atmel arm system.
> >>> Performance is horrible.  Last weekend I got my nand-based DreamPlug
> >>> unbricked and got nandfs working on it too.  Performance is horrible.
> >>> 
> >>> By that I'm referring not to the slow nature of the nand chips
> >>> themselves, but to the fact that accessing them locks out userland
> >>> processes, sometimes for many seconds at a time.  The problem is real
> >>> easy to see, just format and populate a nandfs filesystem, then do
> >>> something like this
> >>> 
> >>>  mount -r -t nandfs /dev/gnand0s.root /mnt
> >>>  nice +20 find /mnt -type f | xargs -J% cat % > /dev/null
> >>> 
> >>> and then try to type in another terminal -- sometimes what you're typing
> >>> doesn't get echoed for 10+ seconds a time.
> >>> 
> >>> The problem is that the "I/O" on a nand chip is really just the cpu
> >>> copying from one memory interface to another, a byte at a time, and it
> >>> must also use busy-wait loops to wait for chip-ready and status info.
> >>> This is being done by high-priority kernel threads, so everything else
> >>> is locked out.
> >>> 
> >>> It seems to me that this is about the same situation as classic ATA PIO
> >>> mode, but PIO doesn't make a system that unresponsive.  
> >>> 
> >>> I'm curious what techniques are used to migitate performance problems
> >>> for ATA PIO modes, and whether we can do something similar for nand.  I
> >>> poked around a bit in dev/ata but the PIO code I saw (which surely
> >>> wasn't the whole picture) just used a bus_space_read_multi().  Can
> >>> someone clue me in as to how ATA manages to do PIO without usurping the
> >>> whole system?
> >> 
> >> Looks like the problem is all the DELAY calls in dev/nand/nand_generic.c..
> >> DELAY is a busy wait not letting the cpu do anything else...  The bad one
> >> is probably generic_erase_block as it looks like the default is 3ms,
> >> plenty of time to let other code run...  If it could be interrupt driven,
> >> that'd be best...
> >> 
> >> I can't find the interface that would allow sub-hz sleeping, but there is
> >> tsleep that could be used for some of the larger sleeps...  But switching
> >> to interrupts + wakeup would be best...
> >> 
> > 
> > Yeah, the DELAY() calls were actually not working for me (I think I'm
> > the first to test this stuff with an ONFI type chip), and I've replaced
> > them all with loops that poll for ready status, which at least minimizes
> > the wait time, but it's still a busy-loop.  Real-world times for the
> > chips I'm working with are 30uS to open a page for read, ~270uS to write
> > a page, and ~750uS to erase a block.
> 
> You're the first one to use it with Intel or Micron NAND?  I find that kinda hard to believe given their ubiquity...
> 

Yep, I'm certain of it.  The code to detect an onfi part is wrong (it
looks for "onfi" rather than "ONFI"), and the code to read and parse the
onfi data doesn't deal with endianess, packing, and alignment.  I have
patches, but I need to review all my XXXes and tidy things up.

> But those times look about right for 3xnm parts...  With newer parts, according to published specifications, those times get longer.  Expect them to double over the next year (meaning through Intel/Micron's 20nm parts now rolling out). Other NAND vendors have similar published specs, or there's much public information about this.
> 
> > But whether busy-looping for status or busy-looping polling a clock for
> > DELAY, or transferring a byte at a time for the actual IO, it's all the
> > same... it's cpu and memory bus cycles that are happening in a
> > high-priority kernel thread.  
> 
> But usually the transfer goes quickly (a few microseconds with dedicated hardware) compared to the waiting (tens or hundreds of microseconds).  The RM9200 doesn't have a dedicated NAND hardware, so byte-banging the data to the device is the only choice...
> 

It's more than a few microseconds.  The base speed on the Micron parts
we're using is a 10mhz bus, so that's ~400uS to transfer a page.  The
part can be told to run as fast as 40mhz (and oddly, I find that I can
just set the wait states to run almost that fast without telling the
chip I intend to do so, and it works just fine), but even so the page
xfer is ~100uS.

The Marvell chips also require the cpu to do all the data movement, so
the only two types of hardware I have to test with have the same
responsiveness problems.  Given how easy it is to implement nand access
in a SoC by sharing the regular memory data bus lines and using a couple
address lines for ALE/CLE, there are probably plenty of others that'll
work the same.

> It looks like you'll also have to coordinate it with a number of GPIO pins, which is good...  That means you'll be able to have an interrupt service the state change of the GPIO pins (well, you may need to augment the current lame on AT91 gpio support that I wrote to allow for this). But the NAND subsystem looks like it needs some support to do that...
> 

Yeah, that's what I was alluding to earlier.  There's an unused function
now to read the RB# line but that doesn't really buy much (which is
probably why the middle layer doesn't ever call it).  It would be better
if there were a function that could wait for a change in RB#, but it
would have to be optional to implement, because it isn't really required
to make the hardware work and you could envision systems where getting
an interrupt on RB# change just isn't possible.

> > The interface between the low-level controller and the nand layer
> > doesn't allow for interrupt handling right now.  Not all hardware
> > designs would allow for using interrupts, but mine does, so reworking
> > things to allow its use would help some.  Well, it would help for writes
> > and erases.  The 180mhz ARM I'm working with doesn't get much done in
> > 30uS, so reads wouldn't get any better.   Reads are all I really care
> > about, since the product in the field will have a read-only filesystem,
> > and firmware updates are infrequent and it's okay if they're a bit slow.
> 
> Any idea what the interrupt and scheduling delay runs these days on the RM9200?  It has been forever since I tried to measure it. You may be able to signal a waiting process rather than using DELAY to busy wait for things.  But that likely means a thread of some sort to defer the work once the chip returns done.  Read might get better, from a system load point of view, but maybe not from a performance point of view.  While 30us isn't a lot, you may find that your console performance goes to hell with that long a block...
> 
> Warner

I also haven't measured interrupt latency for quite a while.  In fact,
since the freebsd 6.2 days.  One thing I have measured recently is
syscall performance... calling getpid() on an rm9200 takes 7 to 9
microseconds.  IMO, that's crazy-long, but that's a problem for another
day.  But that's the kind of thing that makes me a bit skeptical that
trying to buy back 30uS while waiting for a read page-open is going to
restore a lot of responsiveness to userland apps.

-- Ian