4.8 ffs_dirpref problem

Tue Dec 2 18:06:42 PST 2003

Don Lewis wrote:
> On 28 Nov, To: kmarx at vicor.com wrote:
> 
>>On 24 Nov, Ken Marx wrote:
>>
>>>
>>>Don Lewis wrote:
>>
>>>>Index: sys/kern/vfs_bio.c
>>>>===================================================================
>>>>RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
>>>>retrieving revision 1.242.2.21
>>>>diff -u -r1.242.2.21 vfs_bio.c
>>>>--- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
>>>>+++ sys/kern/vfs_bio.c	18 Nov 2003 02:10:55 -0000
>>>>@@ -140,6 +140,7 @@
>>>> 	&bufreusecnt, 0, "");
>>>> 
>>>> static int bufhashmask;
>>>>+static int bufhashshift;
>>>> static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>>>> struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>>>> char *buf_wmesg = BUF_WMESG;
>>>>@@ -160,7 +161,20 @@
>>>> struct bufhashhdr *
>>>> bufhash(struct vnode *vnp, daddr_t bn)
>>>> {
>>>>-	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
>>>>+	u_int64_t hashkey64;
>>>>+	int hashkey; 
>>>>+	
>>>>+	/*
>>>>+	 * Fibonacci hash, see Knuth's
>>>>+	 * _Art of Computer Programming, Volume 3 / Sorting and Searching_
>>>>+	 *
>>>>+         * We reduce the argument to 32 bits before doing the hash to
>>>>+	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
>>>>+	 */
>>>>+	hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;
>>>>+	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>
>>>>+	    bufhashshift) & bufhashmask;
>>>>+	return(&bufhashtbl[hashkey]);
>>>> }
>>>> 
>>>> /*
>>>>@@ -319,8 +333,9 @@
>>>> bufhashinit(caddr_t vaddr)
>>>> {
>>>> 	/* first, make a null hash table */
>>>>+	bufhashshift = 29;
>>>> 	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
>>>>-		;
>>>>+		bufhashshift--;
>>>> 	bufhashtbl = (void *)vaddr;
>>>> 	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>>>> 	--bufhashmask;
>>>>
>>>>
>>>
>>>Well, I'm mildly beflummoxed - I tried to compare hashtable preformance
>>>between all three known versions of the hashing - legacy power of 2,
>>>the Vicor ^= hash, and Don's fibonacci hash.
>>>
>>>Running with 
>>>
>>>       minifree = max( 1, avgifree / 4 );
>>>       minbfree = max( 1, avgbfree );
>>>
>>>all perform about the same, with no performance problems all
>>>the way up to 100% disk capacity (didn't test into reserved space).
>>>
>>>Looking at instrumentation to show freq and avg depth of the
>>>hash buckets, everything seems very calm (mainly because
>>>we're not hitting the linear searching very often, I'd presume).
>>>
>>>I can't explain why I seemlingly got performance problems
>>>with similar (identical) minbfree code previously.
>>>
>>>So, out of spite, I went back to 
>>>
>>>	minbfree = max( 1, avgbfree/4 );
>>>
>>>This does hit the hashtable harder for the legacy version
>>>and not so much for either new flavor. Here are a few
>>>samplings of calling my dump routine from the debugger.
>>>"avgdepth" really means 'search depth' since we use
>>>the depth reached after finding a bp in gbincore.
>>>
>>>The line below such as,
>>>
>>>	 0: avgdepth[1] cnt=801
>>>
>>>means that 801 of the hashtable buckets had an avg search
>>>depth of 1 at the time the debug routine was called.
>>>The 'N:' prefix means the N-th unique non-zero such value.
>>>So large cnt's for small []'d depth values means an efficient hash.
>>>
>>>I've edited out the details as much as possible.
>>>
>>>LEGACY:
>>>--------
>>>Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154
>>>...
>>>Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015
>>>Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7
>>>Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1	<- !!
>>>Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1
>>> -----------
>>>
>>>Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141
>>>...
>>>Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017
>>>Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1
>>>Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6
>>>
>>>VICOR x-or hashtable:
>>>---------------------
>>>Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762
>>>Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259
>>>Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3
>>> -----------
>>>
>>>Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744
>>>Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275
>>>Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5
>>>
>>>FIBONACCI:
>>>----------
>>>Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811
>>>Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88
>>>Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124
>>>Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1
>>> -----------
>>>
>>>Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801
>>>Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93
>>>Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130
>>>
>>>So, while this is far from analytically eshaustive,
>>>it almost appears the fibonacci hash has more entries
>>>of depth 3, while the Vicor one has more at depth 2.
>>>
>>>I'm happy to run more tests if you have ideas. I'm also fine
>>>to cut bait and go with whatever you decide. It *seems* like
>>>putting the fibonacci hash is prudent since the current hash
>>>has been observed to be expensive. I had trouble proving this
>>>unequivocally though. So, perhaps Don's minbfree fix is sufficient
>>>after all. I'm tempted at this point to go with the 100% flavor.
>>
>>I think we're running into one of the weaknesses in the Fibonacci hash.
>>There are a large number of hash entries for the cylinder group blocks,
>>which are located at offsets which are multiples of 89 * 2^10 in your
>>example, or something on the order of 2^16.  The effect of this is for
>>the cylinder group number to be hashed using the least significant bits
>>of the hash multiplier, which don't work as well for distributing the
>>hash values.  I tried some of Knuth's suggestions, and got better
>>results with the hash multiplier 0x9E376DB1u.  The most significant 16
>>bits of the multplier are the same as the original constant, and the
>>least significant bits act as a fraction in the desirable range of 1/3
>>to 3/7.  Please give this new hash multiplier a try.
> 
> 
> I went ahead and spun a new version of my patch with the new multiplier,
> one other tweak to the formula, and updated comments.
> 
> Index: sys/kern/vfs_bio.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
> retrieving revision 1.242.2.21
> diff -u -r1.242.2.21 vfs_bio.c
> --- sys/kern/vfs_bio.c	9 Aug 2003 16:21:19 -0000	1.242.2.21
> +++ sys/kern/vfs_bio.c	28 Nov 2003 20:02:06 -0000
> @@ -140,6 +140,7 @@
>  	&bufreusecnt, 0, "");
>  
>  static int bufhashmask;
> +static int bufhashshift;
>  static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>  struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>  char *buf_wmesg = BUF_WMESG;
> @@ -160,7 +161,40 @@
>  struct bufhashhdr *
>  bufhash(struct vnode *vnp, daddr_t bn)
>  {
> -	return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
> +	u_int64_t hashkey64;
> +	int hashkey; 
> +	
> +	/*
> +	 * A variation on the Fibonacci hash that Knuth credits to
> +	 * R. W. Floyd, see Knuth's _Art of Computer Programming,
> +	 * Volume 3 / Sorting and Searching_
> +	 *
> +         * We reduce the argument to 32 bits before doing the hash to
> +	 * avoid the need for a slow 64x64 multiply on 32 bit platforms.
> +	 *
> +	 * sizeof(struct vnode) is 168 on i386, so toss some of the lower
> +	 * bits of the vnode address to reduce the key range, which
> +	 * improves the distribution of keys across buckets.
> +	 *
> +	 * The file system cylinder group blocks are very heavily
> +	 * used.  They are located at invervals of fbg, which is
> +	 * on the order of 89 to 94 * 2^10, depending on other
> +	 * filesystem parameters, for a 16k block size.  Smaller block
> +	 * sizes will reduce fpg approximately proportionally.  This
> +	 * will cause the cylinder group index to be hashed using the
> +	 * lower bits of the hash multiplier, which will not distribute
> +	 * the keys as uniformly in a classic Fibonacci hash where a
> +	 * relatively small number of the upper bits of the result
> +	 * are used.  Using 2^16 as a close-enough approximation to
> +	 * fpg, split the hash multiplier in half, with the upper 16
> +	 * bits being the inverse of the golden ratio, and the lower
> +	 * 16 bits being a fraction between 1/3 and 3/7 (closer to
> +	 * 3/7 in this case), that gives good experimental results.
> +	 */
> +	hashkey64 = ((u_int64_t)(uintptr_t)vnp >> 3) + (u_int64_t)bn;
> +	hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 0x9E376DB1u) >>
> +	    bufhashshift) & bufhashmask;
> +	return(&bufhashtbl[hashkey]);
>  }
>  
>  /*
> @@ -319,8 +353,9 @@
>  bufhashinit(caddr_t vaddr)
>  {
>  	/* first, make a null hash table */
> +	bufhashshift = 29;
>  	for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
> -		;
> +		bufhashshift--;
>  	bufhashtbl = (void *)vaddr;
>  	vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>  	--bufhashmask;
> 
> 

Hi, Sorry for the delay. Tested, then realized a stupid mistake,
started over, panic'd the box, and finally tested. Panic was
a bit unsettling. Happened when I tried to call my dump routine
from the debugger. Might have been a typo for all I know(!?) -
was doing quickly to keep benchmark script from bailing due to
"Profiling timer expired" killing my timer. (That's another kettle
of wax, I guess.)

Other than that, the newer hashkey stuff does seem to improve
the shallowness of the table. Below are some samples while doing
the 1.4gb untar commands on 98% and 99%+ full disk. Again, 
running w/ 75% avgbfree, avgifree so as to hit the table harder.

Let me know if this helps or you need something further.

regards,
k.
-- 
Ken Marx, kmarx at vicor-nb.com
We need to optimize and focus hard to resolve the long pole in the tent.
		- http://www.bigshed.com/cgi-bin/speak.cgi

------- snip --------
at 98% full:
==========

----------
Dec  2 16:03:44 oos0b /kernel: 0: avgdepth[0] cnt=602
Dec  2 16:03:44 oos0b /kernel: 1: avgdepth[1] cnt=61
Dec  2 16:03:44 oos0b /kernel: 2: avgdepth[2] cnt=187
----------
Dec  2 16:04:59 oos0b /kernel: 0: avgdepth[1] cnt=1008
Dec  2 16:04:59 oos0b /kernel: 1: avgdepth[2] cnt=15
Dec  2 16:04:59 oos0b /kernel: 2: avgdepth[3] cnt=1
----------
Dec  2 16:06:20 oos0b /kernel: 0: avgdepth[1] cnt=971
Dec  2 16:06:20 oos0b /kernel: 1: avgdepth[2] cnt=51
Dec  2 16:06:20 oos0b /kernel: 2: avgdepth[3] cnt=2
----------
Dec  2 16:06:41 oos0b /kernel: 0: avgdepth[2] cnt=132
Dec  2 16:06:41 oos0b /kernel: 1: avgdepth[1] cnt=885
Dec  2 16:06:41 oos0b /kernel: 2: avgdepth[3] cnt=7

(then panic'd on next call to dump routine from debugger)

-------------------------------------------------------------------------------

at 99% full:
============

----------
Dec  2 17:41:36 oos0b /kernel: 0: avgdepth[2] cnt=211
Dec  2 17:41:36 oos0b /kernel: 1: avgdepth[1] cnt=808
Dec  2 17:41:36 oos0b /kernel: 2: avgdepth[3] cnt=5
----------
Dec  2 17:42:12 oos0b /kernel: 0: avgdepth[2] cnt=68
Dec  2 17:42:13 oos0b /kernel: 1: avgdepth[1] cnt=954
Dec  2 17:42:13 oos0b /kernel: 2: avgdepth[3] cnt=2
----------
Dec  2 17:43:17 oos0b /kernel: 0: avgdepth[1] cnt=965
Dec  2 17:43:17 oos0b /kernel: 1: avgdepth[2] cnt=57
Dec  2 17:43:17 oos0b /kernel: 2: avgdepth[3] cnt=2
----------
Dec  2 17:43:28 oos0b /kernel: 0: avgdepth[2] cnt=124
Dec  2 17:43:28 oos0b /kernel: 1: avgdepth[1] cnt=897
Dec  2 17:43:28 oos0b /kernel: 2: avgdepth[3] cnt=3
----------
Dec  2 17:43:43 oos0b /kernel: 0: avgdepth[2] cnt=183
Dec  2 17:43:43 oos0b /kernel: 1: avgdepth[1] cnt=833
Dec  2 17:43:43 oos0b /kernel: 2: avgdepth[3] cnt=8
----------
Dec  2 17:44:04 oos0b /kernel: 0: avgdepth[2] cnt=201
Dec  2 17:44:04 oos0b /kernel: 1: avgdepth[1] cnt=818
Dec  2 17:44:04 oos0b /kernel: 2: avgdepth[3] cnt=5
----------
Dec  2 17:44:25 oos0b /kernel: 0: avgdepth[2] cnt=205
Dec  2 17:44:25 oos0b /kernel: 1: avgdepth[1] cnt=813
Dec  2 17:44:25 oos0b /kernel: 2: avgdepth[3] cnt=6
----------
Dec  2 17:45:39 oos0b /kernel: 0: avgdepth[2] cnt=216
Dec  2 17:45:39 oos0b /kernel: 1: avgdepth[1] cnt=802
Dec  2 17:45:39 oos0b /kernel: 2: avgdepth[3] cnt=6
----------
Dec  2 17:45:44 oos0b /kernel: 0: avgdepth[2] cnt=215
Dec  2 17:45:44 oos0b /kernel: 1: avgdepth[1] cnt=802
Dec  2 17:45:44 oos0b /kernel: 2: avgdepth[3] cnt=7
----------
Dec  2 17:46:15 oos0b /kernel: 0: avgdepth[2] cnt=212
Dec  2 17:46:15 oos0b /kernel: 1: avgdepth[1] cnt=804
Dec  2 17:46:15 oos0b /kernel: 2: avgdepth[3] cnt=8
----------
Dec  2 17:47:51 oos0b /kernel: 0: avgdepth[2] cnt=202
Dec  2 17:47:51 oos0b /kernel: 1: avgdepth[1] cnt=818
Dec  2 17:47:51 oos0b /kernel: 2: avgdepth[3] cnt=4
----------
Dec  2 17:48:05 oos0b /kernel: 0: avgdepth[2] cnt=210
Dec  2 17:48:05 oos0b /kernel: 1: avgdepth[1] cnt=809
Dec  2 17:48:05 oos0b /kernel: 2: avgdepth[3] cnt=5
----------
Dec  2 17:48:18 oos0b /kernel: 0: avgdepth[2] cnt=214
Dec  2 17:48:18 oos0b /kernel: 1: avgdepth[1] cnt=802
Dec  2 17:48:18 oos0b /kernel: 2: avgdepth[3] cnt=8
----------
Dec  2 17:48:33 oos0b /kernel: 0: avgdepth[2] cnt=215
Dec  2 17:48:33 oos0b /kernel: 1: avgdepth[1] cnt=801
Dec  2 17:48:33 oos0b /kernel: 2: avgdepth[3] cnt=8
----------
Dec  2 17:48:50 oos0b /kernel: 0: avgdepth[2] cnt=229
Dec  2 17:48:50 oos0b /kernel: 1: avgdepth[1] cnt=785
Dec  2 17:48:50 oos0b /kernel: 2: avgdepth[3] cnt=10
----------
Dec  2 17:48:57 oos0b /kernel: 0: avgdepth[2] cnt=231
Dec  2 17:48:57 oos0b /kernel: 1: avgdepth[1] cnt=781
Dec  2 17:48:57 oos0b /kernel: 2: avgdepth[3] cnt=12
----------
Dec  2 17:49:11 oos0b /kernel: 0: avgdepth[2] cnt=231
Dec  2 17:49:11 oos0b /kernel: 1: avgdepth[1] cnt=781
Dec  2 17:49:11 oos0b /kernel: 2: avgdepth[3] cnt=12