4.8 ffs_dirpref problem
Ken Marx
kmarx at vicor.com
Tue Dec 2 18:06:42 PST 2003
Don Lewis wrote:
> On 28 Nov, To: kmarx at vicor.com wrote:
>
>>On 24 Nov, Ken Marx wrote:
>>
>>>
>>>Don Lewis wrote:
>>
>>>>Index: sys/kern/vfs_bio.c
>>>>===================================================================
>>>>RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
>>>>retrieving revision 1.242.2.21
>>>>diff -u -r1.242.2.21 vfs_bio.c
>>>>--- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21
>>>>+++ sys/kern/vfs_bio.c 18 Nov 2003 02:10:55 -0000
>>>>@@ -140,6 +140,7 @@
>>>> &bufreusecnt, 0, "");
>>>>
>>>> static int bufhashmask;
>>>>+static int bufhashshift;
>>>> static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>>>> struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>>>> char *buf_wmesg = BUF_WMESG;
>>>>@@ -160,7 +161,20 @@
>>>> struct bufhashhdr *
>>>> bufhash(struct vnode *vnp, daddr_t bn)
>>>> {
>>>>- return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
>>>>+ u_int64_t hashkey64;
>>>>+ int hashkey;
>>>>+
>>>>+ /*
>>>>+ * Fibonacci hash, see Knuth's
>>>>+ * _Art of Computer Programming, Volume 3 / Sorting and Searching_
>>>>+ *
>>>>+ * We reduce the argument to 32 bits before doing the hash to
>>>>+ * avoid the need for a slow 64x64 multiply on 32 bit platforms.
>>>>+ */
>>>>+ hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;
>>>>+ hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>
>>>>+ bufhashshift) & bufhashmask;
>>>>+ return(&bufhashtbl[hashkey]);
>>>> }
>>>>
>>>> /*
>>>>@@ -319,8 +333,9 @@
>>>> bufhashinit(caddr_t vaddr)
>>>> {
>>>> /* first, make a null hash table */
>>>>+ bufhashshift = 29;
>>>> for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
>>>>- ;
>>>>+ bufhashshift--;
>>>> bufhashtbl = (void *)vaddr;
>>>> vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>>>> --bufhashmask;
>>>>
>>>>
>>>
>>>Well, I'm mildly beflummoxed - I tried to compare hashtable preformance
>>>between all three known versions of the hashing - legacy power of 2,
>>>the Vicor ^= hash, and Don's fibonacci hash.
>>>
>>>Running with
>>>
>>> minifree = max( 1, avgifree / 4 );
>>> minbfree = max( 1, avgbfree );
>>>
>>>all perform about the same, with no performance problems all
>>>the way up to 100% disk capacity (didn't test into reserved space).
>>>
>>>Looking at instrumentation to show freq and avg depth of the
>>>hash buckets, everything seems very calm (mainly because
>>>we're not hitting the linear searching very often, I'd presume).
>>>
>>>I can't explain why I seemlingly got performance problems
>>>with similar (identical) minbfree code previously.
>>>
>>>So, out of spite, I went back to
>>>
>>> minbfree = max( 1, avgbfree/4 );
>>>
>>>This does hit the hashtable harder for the legacy version
>>>and not so much for either new flavor. Here are a few
>>>samplings of calling my dump routine from the debugger.
>>>"avgdepth" really means 'search depth' since we use
>>>the depth reached after finding a bp in gbincore.
>>>
>>>The line below such as,
>>>
>>> 0: avgdepth[1] cnt=801
>>>
>>>means that 801 of the hashtable buckets had an avg search
>>>depth of 1 at the time the debug routine was called.
>>>The 'N:' prefix means the N-th unique non-zero such value.
>>>So large cnt's for small []'d depth values means an efficient hash.
>>>
>>>I've edited out the details as much as possible.
>>>
>>>LEGACY:
>>>--------
>>>Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154
>>>...
>>>Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015
>>>Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7
>>>Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1 <- !!
>>>Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1
>>> -----------
>>>
>>>Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141
>>>...
>>>Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017
>>>Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1
>>>Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6
>>>
>>>VICOR x-or hashtable:
>>>---------------------
>>>Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762
>>>Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259
>>>Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3
>>> -----------
>>>
>>>Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744
>>>Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275
>>>Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5
>>>
>>>FIBONACCI:
>>>----------
>>>Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811
>>>Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88
>>>Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124
>>>Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1
>>> -----------
>>>
>>>Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801
>>>Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93
>>>Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130
>>>
>>>So, while this is far from analytically eshaustive,
>>>it almost appears the fibonacci hash has more entries
>>>of depth 3, while the Vicor one has more at depth 2.
>>>
>>>I'm happy to run more tests if you have ideas. I'm also fine
>>>to cut bait and go with whatever you decide. It *seems* like
>>>putting the fibonacci hash is prudent since the current hash
>>>has been observed to be expensive. I had trouble proving this
>>>unequivocally though. So, perhaps Don's minbfree fix is sufficient
>>>after all. I'm tempted at this point to go with the 100% flavor.
>>
>>I think we're running into one of the weaknesses in the Fibonacci hash.
>>There are a large number of hash entries for the cylinder group blocks,
>>which are located at offsets which are multiples of 89 * 2^10 in your
>>example, or something on the order of 2^16. The effect of this is for
>>the cylinder group number to be hashed using the least significant bits
>>of the hash multiplier, which don't work as well for distributing the
>>hash values. I tried some of Knuth's suggestions, and got better
>>results with the hash multiplier 0x9E376DB1u. The most significant 16
>>bits of the multplier are the same as the original constant, and the
>>least significant bits act as a fraction in the desirable range of 1/3
>>to 3/7. Please give this new hash multiplier a try.
>
>
> I went ahead and spun a new version of my patch with the new multiplier,
> one other tweak to the formula, and updated comments.
>
> Index: sys/kern/vfs_bio.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
> retrieving revision 1.242.2.21
> diff -u -r1.242.2.21 vfs_bio.c
> --- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21
> +++ sys/kern/vfs_bio.c 28 Nov 2003 20:02:06 -0000
> @@ -140,6 +140,7 @@
> &bufreusecnt, 0, "");
>
> static int bufhashmask;
> +static int bufhashshift;
> static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
> struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
> char *buf_wmesg = BUF_WMESG;
> @@ -160,7 +161,40 @@
> struct bufhashhdr *
> bufhash(struct vnode *vnp, daddr_t bn)
> {
> - return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
> + u_int64_t hashkey64;
> + int hashkey;
> +
> + /*
> + * A variation on the Fibonacci hash that Knuth credits to
> + * R. W. Floyd, see Knuth's _Art of Computer Programming,
> + * Volume 3 / Sorting and Searching_
> + *
> + * We reduce the argument to 32 bits before doing the hash to
> + * avoid the need for a slow 64x64 multiply on 32 bit platforms.
> + *
> + * sizeof(struct vnode) is 168 on i386, so toss some of the lower
> + * bits of the vnode address to reduce the key range, which
> + * improves the distribution of keys across buckets.
> + *
> + * The file system cylinder group blocks are very heavily
> + * used. They are located at invervals of fbg, which is
> + * on the order of 89 to 94 * 2^10, depending on other
> + * filesystem parameters, for a 16k block size. Smaller block
> + * sizes will reduce fpg approximately proportionally. This
> + * will cause the cylinder group index to be hashed using the
> + * lower bits of the hash multiplier, which will not distribute
> + * the keys as uniformly in a classic Fibonacci hash where a
> + * relatively small number of the upper bits of the result
> + * are used. Using 2^16 as a close-enough approximation to
> + * fpg, split the hash multiplier in half, with the upper 16
> + * bits being the inverse of the golden ratio, and the lower
> + * 16 bits being a fraction between 1/3 and 3/7 (closer to
> + * 3/7 in this case), that gives good experimental results.
> + */
> + hashkey64 = ((u_int64_t)(uintptr_t)vnp >> 3) + (u_int64_t)bn;
> + hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 0x9E376DB1u) >>
> + bufhashshift) & bufhashmask;
> + return(&bufhashtbl[hashkey]);
> }
>
> /*
> @@ -319,8 +353,9 @@
> bufhashinit(caddr_t vaddr)
> {
> /* first, make a null hash table */
> + bufhashshift = 29;
> for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
> - ;
> + bufhashshift--;
> bufhashtbl = (void *)vaddr;
> vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
> --bufhashmask;
>
>
Hi, Sorry for the delay. Tested, then realized a stupid mistake,
started over, panic'd the box, and finally tested. Panic was
a bit unsettling. Happened when I tried to call my dump routine
from the debugger. Might have been a typo for all I know(!?) -
was doing quickly to keep benchmark script from bailing due to
"Profiling timer expired" killing my timer. (That's another kettle
of wax, I guess.)
Other than that, the newer hashkey stuff does seem to improve
the shallowness of the table. Below are some samples while doing
the 1.4gb untar commands on 98% and 99%+ full disk. Again,
running w/ 75% avgbfree, avgifree so as to hit the table harder.
Let me know if this helps or you need something further.
regards,
k.
--
Ken Marx, kmarx at vicor-nb.com
We need to optimize and focus hard to resolve the long pole in the tent.
- http://www.bigshed.com/cgi-bin/speak.cgi
------- snip --------
at 98% full:
==========
----------
Dec 2 16:03:44 oos0b /kernel: 0: avgdepth[0] cnt=602
Dec 2 16:03:44 oos0b /kernel: 1: avgdepth[1] cnt=61
Dec 2 16:03:44 oos0b /kernel: 2: avgdepth[2] cnt=187
----------
Dec 2 16:04:59 oos0b /kernel: 0: avgdepth[1] cnt=1008
Dec 2 16:04:59 oos0b /kernel: 1: avgdepth[2] cnt=15
Dec 2 16:04:59 oos0b /kernel: 2: avgdepth[3] cnt=1
----------
Dec 2 16:06:20 oos0b /kernel: 0: avgdepth[1] cnt=971
Dec 2 16:06:20 oos0b /kernel: 1: avgdepth[2] cnt=51
Dec 2 16:06:20 oos0b /kernel: 2: avgdepth[3] cnt=2
----------
Dec 2 16:06:41 oos0b /kernel: 0: avgdepth[2] cnt=132
Dec 2 16:06:41 oos0b /kernel: 1: avgdepth[1] cnt=885
Dec 2 16:06:41 oos0b /kernel: 2: avgdepth[3] cnt=7
(then panic'd on next call to dump routine from debugger)
-------------------------------------------------------------------------------
at 99% full:
============
----------
Dec 2 17:41:36 oos0b /kernel: 0: avgdepth[2] cnt=211
Dec 2 17:41:36 oos0b /kernel: 1: avgdepth[1] cnt=808
Dec 2 17:41:36 oos0b /kernel: 2: avgdepth[3] cnt=5
----------
Dec 2 17:42:12 oos0b /kernel: 0: avgdepth[2] cnt=68
Dec 2 17:42:13 oos0b /kernel: 1: avgdepth[1] cnt=954
Dec 2 17:42:13 oos0b /kernel: 2: avgdepth[3] cnt=2
----------
Dec 2 17:43:17 oos0b /kernel: 0: avgdepth[1] cnt=965
Dec 2 17:43:17 oos0b /kernel: 1: avgdepth[2] cnt=57
Dec 2 17:43:17 oos0b /kernel: 2: avgdepth[3] cnt=2
----------
Dec 2 17:43:28 oos0b /kernel: 0: avgdepth[2] cnt=124
Dec 2 17:43:28 oos0b /kernel: 1: avgdepth[1] cnt=897
Dec 2 17:43:28 oos0b /kernel: 2: avgdepth[3] cnt=3
----------
Dec 2 17:43:43 oos0b /kernel: 0: avgdepth[2] cnt=183
Dec 2 17:43:43 oos0b /kernel: 1: avgdepth[1] cnt=833
Dec 2 17:43:43 oos0b /kernel: 2: avgdepth[3] cnt=8
----------
Dec 2 17:44:04 oos0b /kernel: 0: avgdepth[2] cnt=201
Dec 2 17:44:04 oos0b /kernel: 1: avgdepth[1] cnt=818
Dec 2 17:44:04 oos0b /kernel: 2: avgdepth[3] cnt=5
----------
Dec 2 17:44:25 oos0b /kernel: 0: avgdepth[2] cnt=205
Dec 2 17:44:25 oos0b /kernel: 1: avgdepth[1] cnt=813
Dec 2 17:44:25 oos0b /kernel: 2: avgdepth[3] cnt=6
----------
Dec 2 17:45:39 oos0b /kernel: 0: avgdepth[2] cnt=216
Dec 2 17:45:39 oos0b /kernel: 1: avgdepth[1] cnt=802
Dec 2 17:45:39 oos0b /kernel: 2: avgdepth[3] cnt=6
----------
Dec 2 17:45:44 oos0b /kernel: 0: avgdepth[2] cnt=215
Dec 2 17:45:44 oos0b /kernel: 1: avgdepth[1] cnt=802
Dec 2 17:45:44 oos0b /kernel: 2: avgdepth[3] cnt=7
----------
Dec 2 17:46:15 oos0b /kernel: 0: avgdepth[2] cnt=212
Dec 2 17:46:15 oos0b /kernel: 1: avgdepth[1] cnt=804
Dec 2 17:46:15 oos0b /kernel: 2: avgdepth[3] cnt=8
----------
Dec 2 17:47:51 oos0b /kernel: 0: avgdepth[2] cnt=202
Dec 2 17:47:51 oos0b /kernel: 1: avgdepth[1] cnt=818
Dec 2 17:47:51 oos0b /kernel: 2: avgdepth[3] cnt=4
----------
Dec 2 17:48:05 oos0b /kernel: 0: avgdepth[2] cnt=210
Dec 2 17:48:05 oos0b /kernel: 1: avgdepth[1] cnt=809
Dec 2 17:48:05 oos0b /kernel: 2: avgdepth[3] cnt=5
----------
Dec 2 17:48:18 oos0b /kernel: 0: avgdepth[2] cnt=214
Dec 2 17:48:18 oos0b /kernel: 1: avgdepth[1] cnt=802
Dec 2 17:48:18 oos0b /kernel: 2: avgdepth[3] cnt=8
----------
Dec 2 17:48:33 oos0b /kernel: 0: avgdepth[2] cnt=215
Dec 2 17:48:33 oos0b /kernel: 1: avgdepth[1] cnt=801
Dec 2 17:48:33 oos0b /kernel: 2: avgdepth[3] cnt=8
----------
Dec 2 17:48:50 oos0b /kernel: 0: avgdepth[2] cnt=229
Dec 2 17:48:50 oos0b /kernel: 1: avgdepth[1] cnt=785
Dec 2 17:48:50 oos0b /kernel: 2: avgdepth[3] cnt=10
----------
Dec 2 17:48:57 oos0b /kernel: 0: avgdepth[2] cnt=231
Dec 2 17:48:57 oos0b /kernel: 1: avgdepth[1] cnt=781
Dec 2 17:48:57 oos0b /kernel: 2: avgdepth[3] cnt=12
----------
Dec 2 17:49:11 oos0b /kernel: 0: avgdepth[2] cnt=231
Dec 2 17:49:11 oos0b /kernel: 1: avgdepth[1] cnt=781
Dec 2 17:49:11 oos0b /kernel: 2: avgdepth[3] cnt=12
More information about the freebsd-fs
mailing list