hardware for home use large storage

Alexander Leidinger Alexander at Leidinger.net
Mon Feb 15 15:11:17 UTC 2010

Quoting Jeremy Chadwick <freebsd at jdc.parodius.com> (from Mon, 15 Feb  
2010 04:27:44 -0800):

> On Mon, Feb 15, 2010 at 10:50:00AM +0100, Alexander Leidinger wrote:
>> Quoting Jeremy Chadwick <freebsd at jdc.parodius.com> (from Mon, 15 Feb
>> 2010 01:07:56 -0800):
>> >On Mon, Feb 15, 2010 at 10:49:47AM +0200, Dan Naumov wrote:
>> >>> I had a feeling someone would bring up L2ARC/cache devices.  This gives
>> >>> me the opportunity to ask something that's been on my mind for quite
>> >>> some time now:
>> >>>
>> >>> Aside from the capacity different (e.g. 40GB vs. 1GB), is there a
>> >>> benefit to using a dedicated RAM disk (e.g. md(4)) to a pool for
>> >>> L2ARC/cache?  The ZFS documentation explicitly states that cache
>> >>> device content is considered volatile.
>> >>
>> >>Using a ramdisk as an L2ARC vdev doesn't make any sense at all. If you
>> >>have RAM to spare, it should be used by regular ARC.
>> >
>> >...except that it's already been proven on FreeBSD that the ARC getting
>> >out of control can cause kernel panics[1], horrible performance until
> First and foremost, sorry for the long post.  I tried to keep it short,
> but sometimes there's just a lot to be said.

And sometimes a shorter answer takes longer...

>> There are other ways (not related to ZFS) to shoot into your feet
>> too, I'm tempted to say that this is
>>  a) a documentation bug
>> and
>>  b) a lack of sanity checking of the values... anyone out there with
>> a good algorithm for something like this?
>> Normally you do some testing with the values you use, so once you
>> resolved the issues, the system should be stable.
> What documentation?  :-)  The Wiki?  If so, that's been outdated for

Hehe... :)

> some time; I know Ivan Voras was doing his best to put good information
> there, but it's hard given the below chaos.

Do you want write access to it (in case you haven't, I didn't check)?

> The following tunables are recurrently mentioned as focal points, but no
> one's explained in full how to tune these "properly", or which does what
> (perfect example: vm.kmem_size_max vs. vm.kmem_size.  _max used to be
> what you'd adjust to solve kmem exhaustion issues, but now people are
> saying otherwise?).  I realise it may differ per system (given how much
> RAM the system has), so different system configurations/examples would
> need to be provided.  I realise that the behaviour of some have changed
> too (e.g. -RELEASE differs from -STABLE, and 7.x differs from 8.x).
> I've marked commonly-referred-to tunables with an asterisk:

It can also be that some people just tell something without really  
knowing what they say (based upon some kind of observed evidence, not  
because of being a bad guy).

>   kern.maxvnodes

Needs to be tuned if you run out of vnodes... ok, this is obvious. I  
do not know how it will show up (panic or graceful error handling,  
e.g. ENOMEM).

> * vm.kmem_size
> * vm.kmem_size_max

I tried kmem_size_max on -current (this year), and I got a panic  
during use, I changed kmem_size to the same value I have for _max and  
it didn't panic anymore. It looks (from mails on the lists) that _max  
is supposed to give a max value for auto-enhancement, but at least it  
was not working with ZFS last month (and I doubt it works now).

> * vfs.zfs.arc_min
> * vfs.zfs.arc_max

_min = minimum even when the system is running out of memory (the ARC  
gives back memory if other parts of the kernel need it).
_max = maximum (with a recent ZFS on 7/8/9 (7.3 will have it, 8.1 will  
have it too) I've never seen the size exceed the _max anymore)

>   vfs.zfs.prefetch_disable  (auto-tuned based on available RAM on 8-STABLE)
>   vfs.zfs.txg.timeout

It looks like the txg is just a workaround. I've read a little bit in  
Brendan's blog and it seems they noticed the periodic writes too (with  
the nice graphical performance monitoring of OpenStorage) and they are  
investigating the issue. It looks like we are more affected by this  
(for whatever reason). What it is doing (attention, this is an  
observation, not a technical description of code I've read!) seems to  
be to write out data to the disks more early (and thus there is less  
data to write -> less blocking to notice).

>   vfs.zfs.vdev.cache.size
>   vfs.zfs.vdev.cache.bshift
>   vfs.zfs.vdev.max_pending

Uhm... this smells like you got it out of one of my posts where I told  
that I experiment with this on a system. I can tell you that I have no  
system with this tuned anymore, tuning kmem_size (and KVA_PAGES during  
kernel compile) has a bigger impact.

>   vfs.zfs.zil_disable

What it does should be obvious. IMHO this should not help much  
regarding stability (changing kmem_size should give a bigger impact).  
As don't know what was tested on systems where this is disabled, I  
want to highlight the "IMHO" in the sentence before...

> Then, when it comes to debugging problems as a result of tuning
> improperly (or entire lack of), the following counters (not tunables)
> are thrown into the mix as "things people should look at":
>   kstat.zfs.misc.arcstats.c
>   kstat.zfs.misc.arcstats.c_min
>   kstat.zfs.misc.arcstats.c_max

c_max is vfs.zfs.arc_max, c_min is vfs.zfs.arc_min.

>   kstat.zfs.misc.arcstats.evict_skip
>   kstat.zfs.misc.arcstats.memory_throttle_count
>   kstat.zfs.misc.arcstats.size

I'm not very sure about size and c... both represent some kind of  
current size, but they are not the same.

About the tuning I would recommend to depend upon a more human  
readable representation. I've seen someone posting something like  
this, but I do not know how it was generated (some kind of script, but  
I do not know where to get it).

> All that said:
> I would be more than happy to write some coherent documentation that
> folks could refer to "officially", but rather than spend my entire
> lifetime reverse-engineering the ZFS code I think it'd make more sense
> to get some official parties involved to explain things.


> I'd like to add some kind of monitoring section as well -- how
> administrators could keep an eye on things and detect, semi-early, if
> additional tuning is required or something along those lines.
>> >ZFS has had its active/inactive lists flushed[2], and brings into
>> Someone needs to sit down and play a little bit with ways to tell
>> the ARC that there is free memory. The mail you reference already
>> tells that the inactive/cached lists should maybe taken into account
>> too (I didn't had a look at this part of the ZFS code).
>> >question how proper tuning is to be established and what the effects are
>> >on the rest of the system[3].  There are still reports of people
>> That's what I talk about regarding b) above. If you specify an
>> arc_max which is too big (arc_max > kmem_size - SOME_SAVE_VALUE),
>> there should be a message from the kernel and the value should be
>> adjusted to a save amount.
>> Until the problems are fixed, a MD for L2ARC may be a viable
>> alternative (if you have enough mem to give for this). Feel free to
>> provide benchmark numbers, but in general I see this just as a
>> workaround for the current issues.
> I've played with this a bit (2-disk mirror + one 256MB md), but I'm not
> completely sure how to read the bonnie++ results, nor am I sure I'm
> using the right arguments (bonnie++ -s8192 -n64 -d/pool on a machine
> that has 4GB).
> L2ARC ("cache" vdev) is supposed to improve random reads, while a "log"

It is supposed to improve random reads, if the working set is in the cache...

> vdev (presumably something that links in with the ZIL) improves random
> writes.  I'm not sure where bonnie++ tests random reads, but I do see it

It is not supposed to improve random writes, it is supposed to improve  
direct writes (man 2 open, search for O_FSYNC... in Solaris it is  

> testing random seeks.


> The options as I see them are (a)) figure out some *reliable* way to
> describe to folks how to tune their systems to not experience ARC or
> memory exhaustion related issues, or (b) utilise L2ARC exclusively and
> set the ARC (arc_max) to something fairly small.

I would prefer a) together with some more sanity checking when  
changing the values. :)

It is just that it is not easy to come up with a correct sanity checking...


If sarcasm were posted on Usenet, would anybody notice?
		-- James Nicoll

http://www.Leidinger.net    Alexander @ Leidinger.net: PGP ID = B0063FE7
http://www.FreeBSD.org       netchild @ FreeBSD.org  : PGP ID = 72077137

More information about the freebsd-stable mailing list