ignore duplicates (Was: request for review of exports.5 update)

Wed Jul 13 08:37:09 UTC 2011

On Tue, 12 Jul 2011, John Baldwin wrote:

> On Tuesday, July 12, 2011 10:52:28 am Pan Tsu wrote:
>> As for whether it matters to descend here is an example
>>
>>   # disable caching metadata/data before test
>>   $ zfs set primarycache=none foo/usr/src
>>   $ zfs set secondarycache=none foo/usr/src
>>
>>   $ time find /usr/src/sys ! -path '*.svn*' >/dev/null
>>   $ time find /usr/src/sys ! -path '*.svn*' -or -prune >/dev/null

Not exactly what I'm looking for, but it seems that a script that adds
exotic args to some utility is needed.  (I have only 1 nontrivial vcs
script, for un-applying and then re-applying applying local patches to
cvs checkouts).

>> On my 3yo box I don't even need ministat(1) to decide
>>
>>   26.78sr 0.21su 1.09ss 4% 1420k 45s+2194u 217pr+0pf+0w 28377+0io 28394+8935cs
>>   3.68sr 0.07su 0.13ss 5% 1420k 46s+2260u 217pr+0pf+0w 3156+0io 3158+876cs

This still has some problems:
- still extremely slow.  On my 6yo system running ~5.2-CURRENT, there are
   only 10760 files in /usr/src/sys on ffs (including CVS files and about
   1000 local files, but no object files).  These take 0.04 seconds to
   find, once cached.  Breaking the cache to test the uncached case is
   too hard with ffs.  On a FreeBSD cluster machine running ~9.0-
   CURRENT, exponential bloat (mainly almost quadrupling for .svn files)
   results in 48556 files in /usr/src/sys on ffs, but only 12138 files
   after removing all .svn files and 13698 files after removing all
   .svn files except the .svn directories.  These take 0.90 seconds to
   find with a plain find(1); 1.59 seconds with the first of the above,
   and 0.34 seconds with the second of the above.  Breaking the cache to
   test the uncached case is too hard with ffs.
- the first version works, but the one with -prune finds .svn directories
   (1560 of them in FreeBSD-9-not-quite-current).

> Ah, nice.  This is a definite improvement.  I've modified my script as such:

Pruning apparently reduces the number of files stat'ed by almost a factor
of almost 4, since svn almost quadruplicates the number of files.  But why
is "find -path" so much slower than plain find?

> #!/bin/sh
> #
> # Grep inside a kernel directory skipping compile directories and revision
> # control directories
>
> find `ls` '(' ! '(' -name compile -o -name .svn -o -name CVS ')' -o -prune ')' \
>    ! -name '*cscope*' ! -type d -print0 | xargs -0 grep -H "$@"

"find -name" is much faster than "find -path" on the FreeBSD cluster machine.
It takes only 0.08 seconds, which is acceptably slower than the 0.04 seconds
on my old machine (due to the nfs overhead and 30% more files).  It has the
same problem as "find -path" when pruning -- it doesn't remove the .svn
directories.  These can be removed with another "! -name" of course.

The first version with -path should be the best one.  find(1) should be
smarter and not descend into directories that already match "! -path".

On my old machine, "find ... ! -name CVS -o -prune" (to prune a couple
of thousand CVS files but not the 910 CVS directories)) takes only 0.02
seconds.  "find ... ! -path '*CVS*' -prune" also takes 0.02 seconds;
"find ... ! -path '*CVS*' is what takes 0.04 seconds, and a plain find
takes 0.03 seconds.  In other words, -name is imperceptibly faster than
-prune.  So there seems to be another problem with -path in -current --
it is 0.35/0.08 times slower than -name.  On second thoughts, this is
probably just the nfs close-to-open-consistency pessimization, perhaps
combined with nfs opening directories more than necessary.  [l]stat(2)'s
should be cached even in nfs, but every directory open requires RPCs.

Bruce