Grepping a list of words

Fri Aug 13 13:47:45 UTC 2010

At 10:56 AM 8.12.2010 -0700, Chip Camden wrote:
>Quoth Anonymous on Thursday, 12 August 2010:
>> Oliver Fromme <olli at lurza.secnetix.de> writes:
>> 
>> > John Levine <johnl at iecc.com> wrote:
>> >  > > > % egrep 'word1|word2|word3|...|wordn' filename.txt
>> >  > 
>> >  > > Thanks for the replies. This suggestion won't do the job as the
list of
>> >  > > words is very long, maybe 50-60. This is why I asked how to place
them all
>> >  > > in a file. One reply dealt with using a file with egrep. I'll try
that.
>> >  > 
>> >  > Gee, 50 words, that's about a 300 character pattern, that's not a
problem
>> >  > for any shell or version of grep I know.
>> >  > 
>> >  > But reading the words from a file is equivalent and as you note most
>> >  > likely easier to do.
>> >
>> > The question is what is more efficient.  This might be
>> > important if that kind of grep command is run very often
>> > by a script, or if it's run on very large files.
>> >
>> > My guess is that one large regular expression is more
>> > efficient than many small ones.  But I haven't done real
>> > benchmarks to prove this.
>> 
>> BTW, not using regular expressions is even more efficient, e.g.
>> 
>>   $ fgrep -f /usr/share/dict/words /etc/group
>> 
>> When using egrep(1) it takes considerably more time and memory.
>
>Having written a regex engine myself, I can see why.  Though I'm sure
>egrep is highly optimized, even the most optimized DFA table is going to
take more
>cycles to navigate than a simple string comparison.  Not to mention the
>initial overhead of parsing the regex and building that table.
>
>-- 
>Sterling (Chip) Camden    | sterling at camdensoftware.com | 2048D/3A978E4F

Many thanks to all of the suggestions. I found this worked very well,
ignoring concerns about use of resources:

egrep -i -o -w -f word.file main.file

The only thing it didn't do for me was the next step. My final objective
was to really determine the words in the "word.file" that were not in the
"main.file." I figured finding matches would be easy and then could then
run a sort|uniq comparison to determine the "new words" not yet in the
main.file.

Since I will have a need to run this check frequently, any suggestions for
a better approach are welcome.

Thanks again...

Jack

(^_^)
Happy trails,
Jack L. Stone

System Admin
Sage-american