4.8 ffs_dirpref problem

Mon Nov 24 14:07:24 PST 2003

Don Lewis wrote:
> On 17 Nov, Ken Marx wrote:
> 
>>
>>Don Lewis wrote:
> 
> 
>>>Ok, I'll do the commit as soon as I can do some testing on my -STABLE
>>>box.
>>>
>>
>>Great. Please let us know when this happens. In fact,
>>I kind of got lost which you were planning to commit.
>>Can you point me to it, and I'll do one last overnight run.
> 
> 
> I just committed version which sets minbfree to:
> 	max(1, avgbfree - avgbfree / 4)
> 
> You may want to continue to use the version that you are already running
> which sets minbfree to avgbfree.  I'm not committing my more complex
> version because it benchmarked worse for me than the version I
> committed.
> 
> I'm pretty sure that we can do better than this, but it will require a
> fair amount of tweaking and benchmarking, but for now this version
> should work a lot better than the previous version of the code.
> 
> 
>>>>I was able to run a couple more tests here, and *belive* that the
>>>>fix to the hash table in vfs_bio.c will provide some relief
>>>>for cg block searches when things do fall into the linear search case.
>>>
>>>
>>>I'll see about cranking out patch to use a Fibonacci hash.  It'll
>>>probably be a little while before I can find sufficient time, though.
>>>
>>
>>Ditto the above: thanks/keep us posted. Our clients are
>>anxious to have a 'final' kernel to run with. I think we'll
>>just give them what you commit, and sneak the hash fix in with
>>the security patch or some such. So, no rush, but do let me
>>know if you think it might happen sooner than, say, 2 weeks
>>so I can try and get it all in one release to them.
> 
> 
> I had some time to crank out a patch.  Give this a try and compare it to
> your hash patch.  It hasn't blown up my system, but I don't have any
> benchmark data on it.  You can just do the test where you fill the
> remaining space in the filesystem.  You won't need to do a newfs and
> start from scratch.  It would be great if you could compare the hash
> bucket sizes for the different versions of the hash.
> 
> 
> Index: sys/kern/vfs_bio.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
> retrieving revision 1.242.2.21
> diff -u -r1.242.2.21 vfs_bio.c
> --- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
> +++ sys/kern/vfs_bio.c	18 Nov 2003 02:10:55 -0000
> @@ -140,6 +140,7 @@
>  	&bufreusecnt, 0, "");
>  
>  static int bufhashmask;
> +static int bufhashshift;
>  static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>  struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>  char *buf_wmesg = BUF_WMESG;
> @@ -160,7 +161,20 @@
>  struct bufhashhdr *
>  bufhash(struct vnode *vnp, daddr_t bn)
>  {
> -	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
> +	u_int64_t hashkey64;
> +	int hashkey; 
> +	
> +	/*
> +	 * Fibonacci hash, see Knuth's
> +	 * _Art of Computer Programming, Volume 3 / Sorting and Searching_
> +	 *
> +         * We reduce the argument to 32 bits before doing the hash to
> +	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
> +	 */
> +	hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;
> +	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>
> +	    bufhashshift) & bufhashmask;
> +	return(&bufhashtbl[hashkey]);
>  }
>  
>  /*
> @@ -319,8 +333,9 @@
>  bufhashinit(caddr_t vaddr)
>  {
>  	/* first, make a null hash table */
> +	bufhashshift = 29;
>  	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
> -		;
> +		bufhashshift--;
>  	bufhashtbl = (void *)vaddr;
>  	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>  	--bufhashmask;
> 
> 

Well, I'm mildly beflummoxed - I tried to compare hashtable preformance
between all three known versions of the hashing - legacy power of 2,
the Vicor ^= hash, and Don's fibonacci hash.

Running with 

       minifree = max( 1, avgifree / 4 );
       minbfree = max( 1, avgbfree );

all perform about the same, with no performance problems all
the way up to 100% disk capacity (didn't test into reserved space).

Looking at instrumentation to show freq and avg depth of the
hash buckets, everything seems very calm (mainly because
we're not hitting the linear searching very often, I'd presume).

I can't explain why I seemlingly got performance problems
with similar (identical) minbfree code previously.

So, out of spite, I went back to 

	minbfree = max( 1, avgbfree/4 );

This does hit the hashtable harder for the legacy version
and not so much for either new flavor. Here are a few
samplings of calling my dump routine from the debugger.
"avgdepth" really means 'search depth' since we use
the depth reached after finding a bp in gbincore.

The line below such as,

	 0: avgdepth[1] cnt=801

means that 801 of the hashtable buckets had an avg search
depth of 1 at the time the debug routine was called.
The 'N:' prefix means the N-th unique non-zero such value.
So large cnt's for small []'d depth values means an efficient hash.

I've edited out the details as much as possible.

LEGACY:
--------
Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154
...
Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015
Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7
Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1	<- !!
Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1
 -----------

Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141
...
Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017
Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1
Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6

VICOR x-or hashtable:
---------------------
Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762
Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259
Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3
 -----------

Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744
Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275
Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5

FIBONACCI:
----------
Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811
Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88
Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124
Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1
 -----------

Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801
Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93
Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130

So, while this is far from analytically eshaustive,
it almost appears the fibonacci hash has more entries
of depth 3, while the Vicor one has more at depth 2.

I'm happy to run more tests if you have ideas. I'm also fine
to cut bait and go with whatever you decide. It *seems* like
putting the fibonacci hash is prudent since the current hash
has been observed to be expensive. I had trouble proving this
unequivocally though. So, perhaps Don's minbfree fix is sufficient
after all. I'm tempted at this point to go with the 100% flavor.

Apologies for the delays and any confusion,
k
--
Ken Marx, kmarx at vicor-nb.com
If we form a subcomittee we will reach agreement and stop beating around the 
bush on the bandwith issues.
		- http://www.bigshed.com/cgi-bin/speak.cgi