svn commit: r211463 - head/usr.bin/grep

Mon Nov 29 20:16:08 UTC 2010

On Wed, Aug 18, 2010, Dimitry Andric wrote:
> On 2010-08-18 22:48, mdf at FreeBSD.org wrote:
> >>  - Refactor file reading code to use pure syscalls and an internal buffer
> >>    instead of stdio.  This gives BSD grep a very big performance boost,
> >>    its speed is now almost comparable to GNU grep.
> > 
> > I didn't read all of the details in the profiling mails in the thread,
> > but does this mean that work on stdio would give a performance boost
> > to many apps?  Or is there something specific about how grep(1) is
> > using its input that makes it a horse of a different color?
> 
> Originally, it was reading files 1 character at a time, using fgetc(3),
> the locking version even.  This is usually not the fastest way to read
> a large file with stdio. :)
> 
> If grep did not have to support .gz or .bz2 files, we could just have
> plugged in stdio's fgetln(3).  I tried this approach first on some
> non-compressed files, and it performed much better than fgetc'ing.
> 
> The reading code that was now committed, is basically the same algorithm
> as fgetln() uses internally, but it can handle gzip and bzip2 input too.

The gzip limitations you refer to could perhaps be worked around
with a simple application of funopen(3).  IIRC, the overhead
inherent in using fgetln(3) or getline(3) on reasonably long lines
is very small; if it's not, we should look at ways to improve stdio.

There's still a locking operation and memcpy() that can't really
be avoided with stdio, though. With getline(), you'd be able to
delete most of file.c, but it would never be quite as fast.