git: 4862e8ac0223 - main - vfs cache: describe various optimization ideas
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 03 Oct 2023 13:41:50 UTC
The branch main has been updated by mjg: URL: https://cgit.FreeBSD.org/src/commit/?id=4862e8ac0223d7b19c8b3e070af1e2b38b18f333 commit 4862e8ac0223d7b19c8b3e070af1e2b38b18f333 Author: Mateusz Guzik <mjg@FreeBSD.org> AuthorDate: 2023-10-03 13:36:50 +0000 Commit: Mateusz Guzik <mjg@FreeBSD.org> CommitDate: 2023-10-03 13:36:50 +0000 vfs cache: describe various optimization ideas While here report a sample result from running on Sapphire Rapids: An access(2) loop slapped into will-it-scale, like so: while (1) { int error = access(tmpfile, R_OK); assert(error == 0); (*iterations)++; } .. operating on /usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c In operations per second: lockless: 3462164 locked: 1362376 While the over 3.4 mln may seem like a big number, a critical look shows it should be significantly higher. A poor man's profiler, counting how many times given routine was sampled: dtrace -w -n 'profile:::profile-4999 /execname == "a.out"/ { @[sym(arg0)] = count(); } tick-5s { system("clear"); trunc(@, 40); printa("%40a %@16d\n", @); clear(@); }' [snip] kernel`kern_accessat 231 kernel`cpu_fetch_syscall_args 324 kernel`cache_fplookup_cross_mount 340 kernel`namei 346 kernel`amd64_syscall 352 kernel`tmpfs_fplookup_vexec 388 kernel`vput 467 kernel`vget_finish 499 kernel`lockmgr_unlock 529 kernel`lockmgr_slock 558 kernel`vget_prep_smr 571 kernel`vput_final 578 kernel`vdropl 1070 kernel`memcmp 1174 kernel`0xffffffff80 2080 0x0 2231 kernel`copyinstr_smap 2492 kernel`cache_fplookup 9246 --- sys/kern/vfs_cache.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 77 insertions(+), 2 deletions(-) diff --git a/sys/kern/vfs_cache.c b/sys/kern/vfs_cache.c index 7e059d374c31..6ae4239cc11d 100644 --- a/sys/kern/vfs_cache.c +++ b/sys/kern/vfs_cache.c @@ -197,10 +197,85 @@ * - vnodes are subject to being recycled even if target inode is left in memory, * which loses the name cache entries when it perhaps should not. in case of tmpfs * names get duplicated -- kept by filesystem itself and namecache separately - * - struct namecache has a fixed size and comes in 2 variants, often wasting space. - * now hard to replace with malloc due to dependence on SMR. + * - struct namecache has a fixed size and comes in 2 variants, often wasting + * space. now hard to replace with malloc due to dependence on SMR, which + * requires UMA zones to opt in * - lack of better integration with the kernel also turns nullfs into a layered * filesystem instead of something which can take advantage of caching + * + * Appendix A: where is the time lost, expanding on paragraph III + * + * While some care went into optimizing lookups, there is still plenty of + * performance left on the table, most notably from single-threaded standpoint. + * Below is a woefully incomplete list of changes which can help. Ideas are + * mostly sketched out, no claim is made all kinks or prerequisites are laid + * out. + * + * Note there is performance lost all over VFS. + * + * === SMR-only lookup + * + * For commonly used ops like stat(2), when the terminal vnode *is* cached, + * lockless lookup could refrain from refing/locking the found vnode and + * instead return while within the SMR section. Then a call to, say, + * vop_stat_smr could do the work (or fail with EAGAIN), finally the result + * would be validated with seqc not changing. This would be faster + * single-threaded as it dodges atomics and would provide full scalability for + * multicore uses. This would *not* work for open(2) or other calls which need + * the vnode to hang around for the long haul, but would work for aforementioned + * stat(2) but also access(2), readlink(2), realpathat(2) and probably more. + * + * === hotpatching for sdt probes + * + * They result in *tons* of branches all over with rather regrettable codegen + * at times. Removing sdt probes altogether gives over 2% boost in lookup rate. + * Reworking the code to patch itself at runtime with asm goto would solve it. + * asm goto is fully supported by gcc and clang. + * + * === copyinstr + * + * On all architectures it operates one byte at a time, while it could be + * word-sized instead thanks to the Mycroft trick. + * + * API itself is rather pessimal for path lookup, accepting arbitrary sizes and + * *optionally* filling in the length parameter. + * + * Instead a new routine (copyinpath?) could be introduced, demanding a buffer + * size which is a multiply of the word (and never zero), with the length + * always returned. On top of it the routine could be allowed to transform the + * buffer in arbitrary ways, most notably writing past the found length (not to + * be confused with writing past buffer size) -- this would allow word-sized + * movs while checking for '\0' later. + * + * === detour through namei + * + * Currently one suffers being called from namei, which then has to check if + * things worked out locklessly. Instead the lockless lookup could be the + * actual entry point which calls what is currently namei as a fallback. + * + * === avoidable branches in cache_can_fplookup + * + * The cache_fast_lookup_enabled flag check could be hotpatchable (in fact if + * this is off, none of fplookup code should execute). + * + * Both audit and capsicum branches can be combined into one, but it requires + * paying off a lot of tech debt first. + * + * ni_startdir could be indicated with a flag in cn_flags, eliminating the + * branch. + * + * === mount stacks + * + * Crossing a mount requires checking if perhaps something is mounted on top. + * Instead, an additional entry could be added to struct mount with a pointer + * to the final mount on the stack. This would be recalculated on each + * mount/unmount. + * + * === root vnodes + * + * It could become part of the API contract to *always* have a rootvnode set in + * mnt_rootvnode. Such vnodes are annotated with VV_ROOT and vnlru would have + * to be modified to always skip them. */ static SYSCTL_NODE(_vfs, OID_AUTO, cache, CTLFLAG_RW | CTLFLAG_MPSAFE, 0,