[Bug 223532] GNU egrep -i is terrible slow if utf-8 locale is enabled

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 02 Jun 2021 20:19:55 +0000
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223532

--- Comment #8 from Stefan E├čer <se_at_FreeBSD.org> ---
(In reply to Helge Oldach from comment #5)

My comment #4 referred to the commengt #3, which used BSD fgrep (despite the
title of the PR referring to GNU egrep).

I have first compared fgrep with C or UTF-8 locale and found they had about the
same performance.

Adding -i in the UTF-8 case increased the run time from 0.03 seconds to 4.47
seconds (or by a factor of more than 100). With LANG=C the run time is 3.36
seconds, BTW.

The patch that I have attached speeds this case up to 0.09 seconds by using an
internal function instead of the regex library.

fgrep-FBSD meant fgrep-ORIG (sorry for the confusion). This is the binary as
built in -CURRENT without the patch.

WITH_INTERNAL_NOSPEC is not documented, except for by a comment in the sources
(in util.c) which explains that this option exists for systems that lack
REG_NOSPEC or REG_LITERAL and specifically mentions libgnuregex.

In fact, this function has a bit more overhead than necessary. An optimized
variant of the strcsasestr_l() function could be inlined in util.c, but I did
not try to measure the performance difference. (The optimization would cache
the locale instead of calling __getlocale() and FIX_LOCALE for each invocation
of strcasestr().)

-- 
You are receiving this mail because:
You are the assignee for the bug.
Received on Wed Jun 02 2021 - 20:19:55 UTC

Original text of this message