4.8 ffs_dirpref problem

Fri Nov 28 13:35:15 PST 2003

On 28 Nov, To: kmarx at vicor.com wrote:
> On 24 Nov, Ken Marx wrote:
>> 
>> 
>> Don Lewis wrote:
> 
>>> Index: sys/kern/vfs_bio.c
>>> ===================================================================
>>> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
>>> retrieving revision 1.242.2.21
>>> diff -u -r1.242.2.21 vfs_bio.c
>>> --- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
>>> +++ sys/kern/vfs_bio.c	18 Nov 2003 02:10:55 -0000
>>> @@ -140,6 +140,7 @@
>>>  	&bufreusecnt, 0, "");
>>>  
>>>  static int bufhashmask;
>>> +static int bufhashshift;
>>>  static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>>>  struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>>>  char *buf_wmesg = BUF_WMESG;
>>> @@ -160,7 +161,20 @@
>>>  struct bufhashhdr *
>>>  bufhash(struct vnode *vnp, daddr_t bn)
>>>  {
>>> -	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
>>> +	u_int64_t hashkey64;
>>> +	int hashkey; 
>>> +	
>>> +	/*
>>> +	 * Fibonacci hash, see Knuth's
>>> +	 * _Art of Computer Programming, Volume 3 / Sorting and Searching_
>>> +	 *
>>> +         * We reduce the argument to 32 bits before doing the hash to
>>> +	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
>>> +	 */
>>> +	hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;
>>> +	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>
>>> +	    bufhashshift) & bufhashmask;
>>> +	return(&bufhashtbl[hashkey]);
>>>  }
>>>  
>>>  /*
>>> @@ -319,8 +333,9 @@
>>>  bufhashinit(caddr_t vaddr)
>>>  {
>>>  	/* first, make a null hash table */
>>> +	bufhashshift = 29;
>>>  	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
>>> -		;
>>> +		bufhashshift--;
>>>  	bufhashtbl = (void *)vaddr;
>>>  	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>>>  	--bufhashmask;
>>> 
>>> 
>> 
>> Well, I'm mildly beflummoxed - I tried to compare hashtable preformance
>> between all three known versions of the hashing - legacy power of 2,
>> the Vicor ^= hash, and Don's fibonacci hash.
>> 
>> Running with 
>> 
>>        minifree = max( 1, avgifree / 4 );
>>        minbfree = max( 1, avgbfree );
>> 
>> all perform about the same, with no performance problems all
>> the way up to 100% disk capacity (didn't test into reserved space).
>> 
>> Looking at instrumentation to show freq and avg depth of the
>> hash buckets, everything seems very calm (mainly because
>> we're not hitting the linear searching very often, I'd presume).
>> 
>> I can't explain why I seemlingly got performance problems
>> with similar (identical) minbfree code previously.
>> 
>> So, out of spite, I went back to 
>> 
>> 	minbfree = max( 1, avgbfree/4 );
>> 
>> This does hit the hashtable harder for the legacy version
>> and not so much for either new flavor. Here are a few
>> samplings of calling my dump routine from the debugger.
>> "avgdepth" really means 'search depth' since we use
>> the depth reached after finding a bp in gbincore.
>> 
>> The line below such as,
>> 
>> 	 0: avgdepth[1] cnt=801
>> 
>> means that 801 of the hashtable buckets had an avg search
>> depth of 1 at the time the debug routine was called.
>> The 'N:' prefix means the N-th unique non-zero such value.
>> So large cnt's for small []'d depth values means an efficient hash.
>> 
>> I've edited out the details as much as possible.
>> 
>> LEGACY:
>> --------
>> Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154
>> ...
>> Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015
>> Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7
>> Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1	<- !!
>> Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1
>>  -----------
>> 
>> Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141
>> ...
>> Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017
>> Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1
>> Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6
>> 
>> VICOR x-or hashtable:
>> ---------------------
>> Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762
>> Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259
>> Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3
>>  -----------
>> 
>> Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744
>> Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275
>> Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5
>> 
>> FIBONACCI:
>> ----------
>> Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811
>> Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88
>> Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124
>> Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1
>>  -----------
>> 
>> Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801
>> Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93
>> Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130
>> 
>> So, while this is far from analytically eshaustive,
>> it almost appears the fibonacci hash has more entries
>> of depth 3, while the Vicor one has more at depth 2.
>> 
>> I'm happy to run more tests if you have ideas. I'm also fine
>> to cut bait and go with whatever you decide. It *seems* like
>> putting the fibonacci hash is prudent since the current hash
>> has been observed to be expensive. I had trouble proving this
>> unequivocally though. So, perhaps Don's minbfree fix is sufficient
>> after all. I'm tempted at this point to go with the 100% flavor.
> 
> I think we're running into one of the weaknesses in the Fibonacci hash.
> There are a large number of hash entries for the cylinder group blocks,
> which are located at offsets which are multiples of 89 * 2^10 in your
> example, or something on the order of 2^16.  The effect of this is for
> the cylinder group number to be hashed using the least significant bits
> of the hash multiplier, which don't work as well for distributing the
> hash values.  I tried some of Knuth's suggestions, and got better
> results with the hash multiplier 0x9E376DB1u.  The most significant 16
> bits of the multplier are the same as the original constant, and the
> least significant bits act as a fraction in the desirable range of 1/3
> to 3/7.  Please give this new hash multiplier a try.

I went ahead and spun a new version of my patch with the new multiplier,
one other tweak to the formula, and updated comments.

Index: sys/kern/vfs_bio.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.242.2.21
diff -u -r1.242.2.21 vfs_bio.c

--- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
+++ sys/kern/vfs_bio.c	28 Nov 2003 20:02:06 -0000
@@ -140,6 +140,7 @@
 	&bufreusecnt, 0, "");
 
 static int bufhashmask;
+static int bufhashshift;
 static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
 struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
 char *buf_wmesg = BUF_WMESG;
@@ -160,7 +161,40 @@
 struct bufhashhdr *
 bufhash(struct vnode *vnp, daddr_t bn)
 {
-	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
+	u_int64_t hashkey64;
+	int hashkey; 
+	
+	/*
+	 * A variation on the Fibonacci hash that Knuth credits to
+	 * R. W. Floyd, see Knuth's _Art of Computer Programming,
+	 * Volume 3 / Sorting and Searching_
+	 *
+         * We reduce the argument to 32 bits before doing the hash to
+	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
+	 *
+	 * sizeof(struct vnode) is 168 on i386, so toss some of the lower
+	 * bits of the vnode address to reduce the key range, which
+	 * improves the distribution of keys across buckets.
+	 *
+	 * The file system cylinder group blocks are very heavily
+	 * used.  They are located at invervals of fbg, which is
+	 * on the order of 89 to 94 * 2^10, depending on other
+	 * filesystem parameters, for a 16k block size.  Smaller block
+	 * sizes will reduce fpg approximately proportionally.  This
+	 * will cause the cylinder group index to be hashed using the
+	 * lower bits of the hash multiplier, which will not distribute
+	 * the keys as uniformly in a classic Fibonacci hash where a
+	 * relatively small number of the upper bits of the result
+	 * are used.  Using 2^16 as a close-enough approximation to
+	 * fpg, split the hash multiplier in half, with the upper 16
+	 * bits being the inverse of the golden ratio, and the lower
+	 * 16 bits being a fraction between 1/3 and 3/7 (closer to
+	 * 3/7 in this case), that gives good experimental results.
+	 */
+	hashkey64 = ((u_int64_t)(uintptr_t)vnp >> 3) + (u_int64_t)bn;
+	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 0x9E376DB1u) >>
+	    bufhashshift) & bufhashmask;
+	return(&bufhashtbl[hashkey]);
 }
 
 /*
@@ -319,8 +353,9 @@
 bufhashinit(caddr_t vaddr)
 {
 	/* first, make a null hash table */
+	bufhashshift = 29;
 	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
-		;
+		bufhashshift--;
 	bufhashtbl = (void *)vaddr;
 	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
 	--bufhashmask;