svn commit: r226056 - user/gabor/tre-integration/lib/libc/regex

Thu Oct 6 11:20:22 UTC 2011

Author: gabor
Date: Thu Oct  6 11:20:21 2011
New Revision: 226056
URL: http://svn.freebsd.org/changeset/base/226056

Log:
  - Clean up and update manual page

Modified:
  user/gabor/tre-integration/lib/libc/regex/re_format.7

Modified: user/gabor/tre-integration/lib/libc/regex/re_format.7
==============================================================================

--- user/gabor/tre-integration/lib/libc/regex/re_format.7	Thu Oct  6 11:17:54 2011	(r226055)
+++ user/gabor/tre-integration/lib/libc/regex/re_format.7	Thu Oct  6 11:20:21 2011	(r226056)
@@ -1,3 +1,4 @@
+.\" Copyright (c) 2011 Gabor Kovesdan <gabor at FreeBSD.org>.
 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
 .\" Copyright (c) 1992, 1993, 1994
 .\"	The Regents of the University of California.  All rights reserved.
@@ -36,7 +37,7 @@
 .\"	@(#)re_format.7	8.3 (Berkeley) 3/20/94
 .\" $FreeBSD$
 .\"
-.Dd March 20, 1994
+.Dd October 6, 2011
 .Dt RE_FORMAT 7
 .Os
 .Sh NAME
@@ -48,32 +49,33 @@ Regular expressions
 as defined in
 .St -p1003.2 ,
 come in two forms:
-modern REs (roughly those of
+modern regular expressions (roughly those of
 .Xr egrep 1 ;
 1003.2 calls these
 .Dq extended
-REs)
-and obsolete REs (roughly those of
+regular expressions or
+.Dq EREs )
+and obsolete regular expressionss (roughly those of
 .Xr ed 1 ;
-1003.2
+1003.2 calls these
 .Dq basic
-REs).
-Obsolete REs mostly exist for backward compatibility in some old programs;
+regular expressions or
+.Dq BREs ) .
+BREs mostly exist for backward compatibility in some old programs;
 they will be discussed at the end.
 .St -p1003.2
-leaves some aspects of RE syntax and semantics open;
-`\(dd' marks decisions on these aspects that
-may not be fully portable to other
-.St -p1003.2
-implementations.
+leaves some aspects of regular expression syntax and semantics open,
+so this manual will describe the behavior of this implementation
+instead of just reproducing the same iformation that is already
+available in the standard.
 .Pp
-A (modern) RE is one\(dd or more non-empty\(dd
+An extended regular expression is one or more non-empty
 .Em branches ,
 separated by
 .Ql \&| .
 It matches anything that matches one of the branches.
 .Pp
-A branch is one\(dd or more
+A branch is one or more
 .Em pieces ,
 concatenated.
 It matches a match for the first, followed by a match for the second, etc.
@@ -81,7 +83,7 @@ It matches a match for the first, follow
 A piece is an
 .Em atom
 possibly followed
-by a single\(dd
+by a single
 .Ql \&* ,
 .Ql \&+ ,
 .Ql \&? ,
@@ -99,42 +101,30 @@ matches a sequence of 0 or 1 matches of 
 .Pp
 A
 .Em bound
-is
-.Ql \&{
-followed by an unsigned decimal integer,
-possibly followed by
-.Ql \&,
-possibly followed by another unsigned decimal integer,
-always followed by
-.Ql \&} .
+is an expression that allows the repetition of the atom
+according to the specified constraints.
+A
+.Em bound
+starts with an opening brace
+.Pq Ql \&{
+character, followed by an unsigned decimal integer, an optional comma
+.Pq Ql \&,
+followed by another unsigned decimal integer,
+always followed by a closing brace
+.Pq Ql \&} .
 The integers must lie between 0 and
 .Dv RE_DUP_MAX
-(255\(dd) inclusive,
-and if there are two of them, the first may not exceed the second.
-An atom followed by a bound containing one integer
-.Em i
-and no comma matches
-a sequence of exactly
-.Em i
-matches of the atom.
-An atom followed by a bound
-containing one integer
-.Em i
-and a comma matches
-a sequence of
-.Em i
-or more matches of the atom.
-An atom followed by a bound
-containing two integers
-.Em i
-and
-.Em j
-matches
-a sequence of
-.Em i
-through
-.Em j
-(inclusive) matches of the atom.
+inclusive.
+The integers restrict the minimum and maximum repetition count of the atom
+and the first number may not exceed the second.
+The second integer is optional and if it is missing but the comma is present,
+there is no upper limit of the repetition.
+If there is only one integer specified and the comma is also missing,
+exactly the specified number of repetitions is required.
+In this implementation,
+it is also possible to leave out the first integer and only specify the
+comma and the upper limit.
+In this case 0 is implied as a minimum repetition count.
 .Pp
 An atom is a regular expression enclosed in
 .Ql ()
@@ -142,7 +132,7 @@ An atom is a regular expression enclosed
 regular expression),
 an empty set of
 .Ql ()
-(matching the null string)\(dd,
+(matching the null string),
 a
 .Em bracket expression
 (see below),
@@ -155,47 +145,46 @@ a
 .Ql \e
 followed by one of the characters
 .Ql ^.[$()|*+?{\e
-(matching that character taken as an ordinary character),
-a
-.Ql \e
-followed by any other character\(dd
-(matching that character taken as an ordinary character,
-as if the
-.Ql \e
-had not been present\(dd),
-or a single character with no other significance (matching that character).
+(matching the escaped character taken as an ordinary character)
+or a single character with no other significance (matching the
+same character).
 A
 .Ql \&{
 followed by a character other than a digit is an ordinary
-character, not the beginning of a bound\(dd.
-It is illegal to end an RE with
+character, not the beginning of a bound.
+It is illegal to end a regular expression with
 .Ql \e .
 .Pp
 A
 .Em bracket expression
 is a list of characters enclosed in
 .Ql [] .
-It normally matches any single character from the list (but see below).
+It always matches a single character but the set of matching characters
+is determined by more specific rules.
 If the list begins with
 .Ql \&^ ,
-it matches any single character
-(but see below)
-.Em not
-from the rest of the list.
-If two characters in the list are separated by
+it matches any single character that is not present in the rest of the
+list.
+If the list does not begin with
+.Ql \&^ ,
+normally all characters that are listed in the brackets will match.
+An exception from this is the use of collating ranges.
+If there is a
 .Ql \&- ,
-this is shorthand
-for the full
-.Em range
-of characters between those two (inclusive) in the
-collating sequence,
-.No e.g. Ql [0-9]
-in ASCII matches any decimal digit.
-It is illegal\(dd for two ranges to share an
-endpoint,
-.No e.g. Ql a-c-e .
-Ranges are very collating-sequence-dependent,
-and portable programs should avoid relying on them.
+which is not the first character in the bracket,
+it will be interpreted as a collating range and will match all
+characters that fall in between the preceding and following characters
+(inclusive) in the current locale's collating order.
+.No For example, Ql [a0-9]
+in ASCII matches
+.Ql a
+or any decimal digit.
+.No For example, Ql [^agh]
+matches any character that is not
+.Ql a ,
+.Ql g ,
+or
+.Ql h .
 .Pp
 To include a literal
 .Ql \&]
@@ -235,7 +224,7 @@ can thus match more than one character,
 e.g.\& if the collating sequence includes a
 .Ql ch
 collating element,
-then the RE
+then the regular expression
 .Ql [[.ch.]]*c
 matches the first five characters
 of
@@ -263,7 +252,7 @@ then
 and
 .Ql [xy]
 are all synonymous.
-An equivalence class may not\(dd be an endpoint
+An equivalence class may not be an endpoint
 of a range.
 .Pp
 Within a bracket expression, the name of a
@@ -284,7 +273,7 @@ Standard character class names are:
 .Pp
 These stand for the character classes defined in
 .Xr ctype 3 .
-A locale may provide others.
+A particular locale may provide others.
 A character class may not be used as an endpoint of a range.
 .Pp
 A bracketed expression like
@@ -295,35 +284,16 @@ The reverse, matching any character that
 class, the negation operator of bracket expressions may be used:
 .Ql [^[:class:]] .
 .Pp
-There are two special cases\(dd of bracket expressions:
-the bracket expressions
-.Ql [[:<:]]
-and
-.Ql [[:>:]]
-match the null string at the beginning and end of a word respectively.
-A word is defined as a sequence of word characters
-which is neither preceded nor followed by
-word characters.
-A word character is an
-.Em alnum
-character (as defined by
-.Xr ctype 3 )
-or an underscore.
-This is an extension,
-compatible with but not specified by
-.St -p1003.2 ,
-and should be used with
-caution in software intended to be portable to other systems.
-.Pp
-In the event that an RE could match more than one substring of a given
-string,
-the RE matches the one starting earliest in the string.
-If the RE could match more than one substring starting at that point,
+In the event that a regular expression  could match more than one
+substring of a given string,
+the regular expression matches the one starting earliest in the string.
+If the regular expression could match more than one substring starting
+at that point,
 it matches the longest.
 Subexpressions also match the longest possible substrings, subject to
 the constraint that the whole match be as long as possible,
-with subexpressions starting earlier in the RE taking priority over
-ones starting later.
+with subexpressions starting earlier in the regular expression taking
+priority over ones starting later.
 Note that higher-level subexpressions thus take priority over
 their lower-level component subexpressions.
 .Pp
@@ -346,15 +316,14 @@ when
 .Ql (a*)*
 is matched against
 .Ql bc
-both the whole RE and the parenthesized
+both the whole regular expression and the parenthesized
 subexpression match the null string.
 .Pp
-If case-independent matching is specified,
-the effect is much as if all case distinctions had vanished from the
-alphabet.
-When an alphabetic that exists in multiple cases appears as an
-ordinary character outside a bracket expression, it is effectively
-transformed into a bracket expression containing both cases,
+The effect of case-independent match is like as if all case distinctions
+vanished from the alphabet.
+It can also be modelled as if each and every character were replaced
+by a bracket expression,
+containing both cases of the same letter,
 .No e.g. Ql x
 becomes
 .Ql [xX] .
@@ -368,15 +337,13 @@ and
 becomes
 .Ql [^xX] .
 .Pp
-No particular limit is imposed on the length of REs\(dd.
-Programs intended to be portable should not employ REs longer
-than 256 bytes,
-as an implementation can refuse to accept such REs and remain
-POSIX-compliant.
-.Pp
-Obsolete
-.Pq Dq basic
-regular expressions differ in several respects.
+No particular limit is imposed on the length of regular expression.
+Programs intended to be portable should not employ regular expressions
+longer than 256 bytes,
+as an implementation can refuse to accept such regular expressions and
+remain POSIX-compliant.
+.Pp
+Basic regular expressions differ in several respects.
 .Ql \&|
 is an ordinary character and there is no equivalent
 for its functionality.
@@ -391,7 +358,7 @@ or
 respectively).
 Also note that
 .Ql x+
-in modern REs is equivalent to
+in extended regular expressions is equivalent to
 .Ql xx* .
 The delimiters for bounds are
 .Ql \e{
@@ -412,15 +379,15 @@ and
 .Ql \&)
 by themselves ordinary characters.
 .Ql \&^
-is an ordinary character except at the beginning of the
-RE or\(dd the beginning of a parenthesized subexpression,
+is an ordinary character except at the beginning of the regular expression
+or the beginning of a parenthesized subexpression,
 .Ql \&$
 is an ordinary character except at the end of the
-RE or\(dd the end of a parenthesized subexpression,
+regular expression or the end of a parenthesized subexpression,
 and
 .Ql \&*
 is an ordinary character if it appears at the beginning of the
-RE or the beginning of a parenthesized subexpression
+regular expresssion or the beginning of a parenthesized subexpression
 (after a possible leading
 .Ql \&^ ) .
 Finally, there is one new type of atom, a
@@ -442,6 +409,9 @@ or
 .Ql cc
 but not
 .Ql bc .
+.Pp
+Back references are not defined for extended regular expressions but
+most implementations (including this) implement them.
 .Sh SEE ALSO
 .Xr regex 3
 .Rs
@@ -450,34 +420,12 @@ but not
 .%N 1003.2
 .%P section 2.8
 .Re
-.Sh BUGS
-Having two kinds of REs is a botch.
-.Pp
-The current
-.St -p1003.2
-spec says that
-.Ql \&)
-is an ordinary character in
-the absence of an unmatched
-.Ql \&( ;
-this was an unintentional result of a wording error,
-and change is likely.
-Avoid relying on it.
-.Pp
-Back references are a dreadful botch,
-posing major problems for efficient implementations.
-They are also somewhat vaguely defined
-(does
-.Ql a\e(\e(b\e)*\e2\e)*d
-match
-.Ql abbbd ? ) .
-Avoid using them.
-.Pp
-.St -p1003.2
-specification of case-independent matching is vague.
-The
-.Dq one case implies all cases
-definition given above
-is current consensus among implementors as to the right interpretation.
-.Pp
-The syntax for word boundaries is incredibly ugly.
+.Sh HISTORY
+This manual was originally written by
+.An Henry Spencer
+for an older implementation and later extended and
+tailored for TRE by
+.An Gabor Kovesdan .
+The regex implementation comes from the TRE project
+and it was included first in
+.Fx 10-CURRENT.