4.8 ffs_dirpref problem
Don Lewis
truckman at FreeBSD.org
Fri Nov 28 13:35:15 PST 2003
On 28 Nov, To: kmarx at vicor.com wrote:
> On 24 Nov, Ken Marx wrote:
>>
>>
>> Don Lewis wrote:
>
>>> Index: sys/kern/vfs_bio.c
>>> ===================================================================
>>> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
>>> retrieving revision 1.242.2.21
>>> diff -u -r1.242.2.21 vfs_bio.c
>>> --- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21
>>> +++ sys/kern/vfs_bio.c 18 Nov 2003 02:10:55 -0000
>>> @@ -140,6 +140,7 @@
>>> &bufreusecnt, 0, "");
>>>
>>> static int bufhashmask;
>>> +static int bufhashshift;
>>> static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
>>> struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
>>> char *buf_wmesg = BUF_WMESG;
>>> @@ -160,7 +161,20 @@
>>> struct bufhashhdr *
>>> bufhash(struct vnode *vnp, daddr_t bn)
>>> {
>>> - return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
>>> + u_int64_t hashkey64;
>>> + int hashkey;
>>> +
>>> + /*
>>> + * Fibonacci hash, see Knuth's
>>> + * _Art of Computer Programming, Volume 3 / Sorting and Searching_
>>> + *
>>> + * We reduce the argument to 32 bits before doing the hash to
>>> + * avoid the need for a slow 64x64 multiply on 32 bit platforms.
>>> + */
>>> + hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn;
>>> + hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >>
>>> + bufhashshift) & bufhashmask;
>>> + return(&bufhashtbl[hashkey]);
>>> }
>>>
>>> /*
>>> @@ -319,8 +333,9 @@
>>> bufhashinit(caddr_t vaddr)
>>> {
>>> /* first, make a null hash table */
>>> + bufhashshift = 29;
>>> for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
>>> - ;
>>> + bufhashshift--;
>>> bufhashtbl = (void *)vaddr;
>>> vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
>>> --bufhashmask;
>>>
>>>
>>
>> Well, I'm mildly beflummoxed - I tried to compare hashtable preformance
>> between all three known versions of the hashing - legacy power of 2,
>> the Vicor ^= hash, and Don's fibonacci hash.
>>
>> Running with
>>
>> minifree = max( 1, avgifree / 4 );
>> minbfree = max( 1, avgbfree );
>>
>> all perform about the same, with no performance problems all
>> the way up to 100% disk capacity (didn't test into reserved space).
>>
>> Looking at instrumentation to show freq and avg depth of the
>> hash buckets, everything seems very calm (mainly because
>> we're not hitting the linear searching very often, I'd presume).
>>
>> I can't explain why I seemlingly got performance problems
>> with similar (identical) minbfree code previously.
>>
>> So, out of spite, I went back to
>>
>> minbfree = max( 1, avgbfree/4 );
>>
>> This does hit the hashtable harder for the legacy version
>> and not so much for either new flavor. Here are a few
>> samplings of calling my dump routine from the debugger.
>> "avgdepth" really means 'search depth' since we use
>> the depth reached after finding a bp in gbincore.
>>
>> The line below such as,
>>
>> 0: avgdepth[1] cnt=801
>>
>> means that 801 of the hashtable buckets had an avg search
>> depth of 1 at the time the debug routine was called.
>> The 'N:' prefix means the N-th unique non-zero such value.
>> So large cnt's for small []'d depth values means an efficient hash.
>>
>> I've edited out the details as much as possible.
>>
>> LEGACY:
>> --------
>> Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154
>> ...
>> Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015
>> Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7
>> Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1 <- !!
>> Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1
>> -----------
>>
>> Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141
>> ...
>> Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017
>> Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1
>> Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6
>>
>> VICOR x-or hashtable:
>> ---------------------
>> Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762
>> Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259
>> Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3
>> -----------
>>
>> Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744
>> Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275
>> Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5
>>
>> FIBONACCI:
>> ----------
>> Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811
>> Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88
>> Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124
>> Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1
>> -----------
>>
>> Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801
>> Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93
>> Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130
>>
>> So, while this is far from analytically eshaustive,
>> it almost appears the fibonacci hash has more entries
>> of depth 3, while the Vicor one has more at depth 2.
>>
>> I'm happy to run more tests if you have ideas. I'm also fine
>> to cut bait and go with whatever you decide. It *seems* like
>> putting the fibonacci hash is prudent since the current hash
>> has been observed to be expensive. I had trouble proving this
>> unequivocally though. So, perhaps Don's minbfree fix is sufficient
>> after all. I'm tempted at this point to go with the 100% flavor.
>
> I think we're running into one of the weaknesses in the Fibonacci hash.
> There are a large number of hash entries for the cylinder group blocks,
> which are located at offsets which are multiples of 89 * 2^10 in your
> example, or something on the order of 2^16. The effect of this is for
> the cylinder group number to be hashed using the least significant bits
> of the hash multiplier, which don't work as well for distributing the
> hash values. I tried some of Knuth's suggestions, and got better
> results with the hash multiplier 0x9E376DB1u. The most significant 16
> bits of the multplier are the same as the original constant, and the
> least significant bits act as a fraction in the desirable range of 1/3
> to 3/7. Please give this new hash multiplier a try.
I went ahead and spun a new version of my patch with the new multiplier,
one other tweak to the formula, and updated comments.
Index: sys/kern/vfs_bio.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.242.2.21
diff -u -r1.242.2.21 vfs_bio.c
--- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21
+++ sys/kern/vfs_bio.c 28 Nov 2003 20:02:06 -0000
@@ -140,6 +140,7 @@
&bufreusecnt, 0, "");
static int bufhashmask;
+static int bufhashshift;
static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } };
char *buf_wmesg = BUF_WMESG;
@@ -160,7 +161,40 @@
struct bufhashhdr *
bufhash(struct vnode *vnp, daddr_t bn)
{
- return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]);
+ u_int64_t hashkey64;
+ int hashkey;
+
+ /*
+ * A variation on the Fibonacci hash that Knuth credits to
+ * R. W. Floyd, see Knuth's _Art of Computer Programming,
+ * Volume 3 / Sorting and Searching_
+ *
+ * We reduce the argument to 32 bits before doing the hash to
+ * avoid the need for a slow 64x64 multiply on 32 bit platforms.
+ *
+ * sizeof(struct vnode) is 168 on i386, so toss some of the lower
+ * bits of the vnode address to reduce the key range, which
+ * improves the distribution of keys across buckets.
+ *
+ * The file system cylinder group blocks are very heavily
+ * used. They are located at invervals of fbg, which is
+ * on the order of 89 to 94 * 2^10, depending on other
+ * filesystem parameters, for a 16k block size. Smaller block
+ * sizes will reduce fpg approximately proportionally. This
+ * will cause the cylinder group index to be hashed using the
+ * lower bits of the hash multiplier, which will not distribute
+ * the keys as uniformly in a classic Fibonacci hash where a
+ * relatively small number of the upper bits of the result
+ * are used. Using 2^16 as a close-enough approximation to
+ * fpg, split the hash multiplier in half, with the upper 16
+ * bits being the inverse of the golden ratio, and the lower
+ * 16 bits being a fraction between 1/3 and 3/7 (closer to
+ * 3/7 in this case), that gives good experimental results.
+ */
+ hashkey64 = ((u_int64_t)(uintptr_t)vnp >> 3) + (u_int64_t)bn;
+ hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 0x9E376DB1u) >>
+ bufhashshift) & bufhashmask;
+ return(&bufhashtbl[hashkey]);
}
/*
@@ -319,8 +353,9 @@
bufhashinit(caddr_t vaddr)
{
/* first, make a null hash table */
+ bufhashshift = 29;
for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1)
- ;
+ bufhashshift--;
bufhashtbl = (void *)vaddr;
vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask;
--bufhashmask;
More information about the freebsd-fs
mailing list