libregex library

Tim Robbins tjr at freebsd.org
Mon Nov 22 02:48:43 PST 2004


On Sun, 2004-11-21 at 10:06 -0800, Sean Chittenden wrote:
> >>   Has there been any thought given to moving to the modified Henry
> >> Spencer regex library used in NetBSD & OpenBSD's libc?
> >
> > des at dwp ~% head -3 /usr/src/lib/libc/regex/COPYRIGHT
> > Copyright 1992, 1993, 1994 Henry Spencer.  All rights reserved.
> > This software is not subject to any license of the American Telephone
> > and Telegraph Company or of the Regents of the University of 
> > California.
> 
> I think maybe what Ben was referring to was that Spencer has released 
> an updated version of his regexp library that doesn't penalize wide 
> character locales.  I believe our current one performs terribly on 
> everything but one byte character sets, whereas the newer Spencer 
> library performs as well as one could hope with wide characters.  The 
> PostgreSQL group did some testing and found Spencers library to be the 
> fastest wide character regexp engine while still maintaining very good 
> levels of performance for single byte character sets.  You'll have to 
> check the PostgreSQL archives for details: it's been two years since 
> that change was committed to their tree.  -sc

I think you'd be surprised at how poorly Henry Spencer's new code
performs in all but the most contrived test cases, regardless of locale.

You'll find that it performs especially poorly in multibyte locales
because the matcher itself does not work directly with multibyte
characters. Instead, the strings must first be entirely converted to
wide characters, which means reading every single input byte, calling
mbrtowc() on it, then storing the results in temporary scratch space,
even if the characters don't participate in the match at all (e.g. all
characters but the first when matching against patterns like "^x"). The
FreeBSD 5 regex code only performs the conversion when necessary, and
can often reject impossible matches without performing a single
conversion in single-byte and UTF-8 locales.

(This is assuming your input strings are given as multibyte character
strings, as is common in UNIX, not wide character strings, as may be
common in PostgreSQL).


Tim



More information about the freebsd-arch mailing list