svn commit: r226056 - user/gabor/tre-integration/lib/libc/regex
Gabor Kovesdan
gabor at FreeBSD.org
Thu Oct 6 11:20:22 UTC 2011
Author: gabor
Date: Thu Oct 6 11:20:21 2011
New Revision: 226056
URL: http://svn.freebsd.org/changeset/base/226056
Log:
- Clean up and update manual page
Modified:
user/gabor/tre-integration/lib/libc/regex/re_format.7
Modified: user/gabor/tre-integration/lib/libc/regex/re_format.7
==============================================================================
--- user/gabor/tre-integration/lib/libc/regex/re_format.7 Thu Oct 6 11:17:54 2011 (r226055)
+++ user/gabor/tre-integration/lib/libc/regex/re_format.7 Thu Oct 6 11:20:21 2011 (r226056)
@@ -1,3 +1,4 @@
+.\" Copyright (c) 2011 Gabor Kovesdan <gabor at FreeBSD.org>.
.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
.\" Copyright (c) 1992, 1993, 1994
.\" The Regents of the University of California. All rights reserved.
@@ -36,7 +37,7 @@
.\" @(#)re_format.7 8.3 (Berkeley) 3/20/94
.\" $FreeBSD$
.\"
-.Dd March 20, 1994
+.Dd October 6, 2011
.Dt RE_FORMAT 7
.Os
.Sh NAME
@@ -48,32 +49,33 @@ Regular expressions
as defined in
.St -p1003.2 ,
come in two forms:
-modern REs (roughly those of
+modern regular expressions (roughly those of
.Xr egrep 1 ;
1003.2 calls these
.Dq extended
-REs)
-and obsolete REs (roughly those of
+regular expressions or
+.Dq EREs )
+and obsolete regular expressionss (roughly those of
.Xr ed 1 ;
-1003.2
+1003.2 calls these
.Dq basic
-REs).
-Obsolete REs mostly exist for backward compatibility in some old programs;
+regular expressions or
+.Dq BREs ) .
+BREs mostly exist for backward compatibility in some old programs;
they will be discussed at the end.
.St -p1003.2
-leaves some aspects of RE syntax and semantics open;
-`\(dd' marks decisions on these aspects that
-may not be fully portable to other
-.St -p1003.2
-implementations.
+leaves some aspects of regular expression syntax and semantics open,
+so this manual will describe the behavior of this implementation
+instead of just reproducing the same iformation that is already
+available in the standard.
.Pp
-A (modern) RE is one\(dd or more non-empty\(dd
+An extended regular expression is one or more non-empty
.Em branches ,
separated by
.Ql \&| .
It matches anything that matches one of the branches.
.Pp
-A branch is one\(dd or more
+A branch is one or more
.Em pieces ,
concatenated.
It matches a match for the first, followed by a match for the second, etc.
@@ -81,7 +83,7 @@ It matches a match for the first, follow
A piece is an
.Em atom
possibly followed
-by a single\(dd
+by a single
.Ql \&* ,
.Ql \&+ ,
.Ql \&? ,
@@ -99,42 +101,30 @@ matches a sequence of 0 or 1 matches of
.Pp
A
.Em bound
-is
-.Ql \&{
-followed by an unsigned decimal integer,
-possibly followed by
-.Ql \&,
-possibly followed by another unsigned decimal integer,
-always followed by
-.Ql \&} .
+is an expression that allows the repetition of the atom
+according to the specified constraints.
+A
+.Em bound
+starts with an opening brace
+.Pq Ql \&{
+character, followed by an unsigned decimal integer, an optional comma
+.Pq Ql \&,
+followed by another unsigned decimal integer,
+always followed by a closing brace
+.Pq Ql \&} .
The integers must lie between 0 and
.Dv RE_DUP_MAX
-(255\(dd) inclusive,
-and if there are two of them, the first may not exceed the second.
-An atom followed by a bound containing one integer
-.Em i
-and no comma matches
-a sequence of exactly
-.Em i
-matches of the atom.
-An atom followed by a bound
-containing one integer
-.Em i
-and a comma matches
-a sequence of
-.Em i
-or more matches of the atom.
-An atom followed by a bound
-containing two integers
-.Em i
-and
-.Em j
-matches
-a sequence of
-.Em i
-through
-.Em j
-(inclusive) matches of the atom.
+inclusive.
+The integers restrict the minimum and maximum repetition count of the atom
+and the first number may not exceed the second.
+The second integer is optional and if it is missing but the comma is present,
+there is no upper limit of the repetition.
+If there is only one integer specified and the comma is also missing,
+exactly the specified number of repetitions is required.
+In this implementation,
+it is also possible to leave out the first integer and only specify the
+comma and the upper limit.
+In this case 0 is implied as a minimum repetition count.
.Pp
An atom is a regular expression enclosed in
.Ql ()
@@ -142,7 +132,7 @@ An atom is a regular expression enclosed
regular expression),
an empty set of
.Ql ()
-(matching the null string)\(dd,
+(matching the null string),
a
.Em bracket expression
(see below),
@@ -155,47 +145,46 @@ a
.Ql \e
followed by one of the characters
.Ql ^.[$()|*+?{\e
-(matching that character taken as an ordinary character),
-a
-.Ql \e
-followed by any other character\(dd
-(matching that character taken as an ordinary character,
-as if the
-.Ql \e
-had not been present\(dd),
-or a single character with no other significance (matching that character).
+(matching the escaped character taken as an ordinary character)
+or a single character with no other significance (matching the
+same character).
A
.Ql \&{
followed by a character other than a digit is an ordinary
-character, not the beginning of a bound\(dd.
-It is illegal to end an RE with
+character, not the beginning of a bound.
+It is illegal to end a regular expression with
.Ql \e .
.Pp
A
.Em bracket expression
is a list of characters enclosed in
.Ql [] .
-It normally matches any single character from the list (but see below).
+It always matches a single character but the set of matching characters
+is determined by more specific rules.
If the list begins with
.Ql \&^ ,
-it matches any single character
-(but see below)
-.Em not
-from the rest of the list.
-If two characters in the list are separated by
+it matches any single character that is not present in the rest of the
+list.
+If the list does not begin with
+.Ql \&^ ,
+normally all characters that are listed in the brackets will match.
+An exception from this is the use of collating ranges.
+If there is a
.Ql \&- ,
-this is shorthand
-for the full
-.Em range
-of characters between those two (inclusive) in the
-collating sequence,
-.No e.g. Ql [0-9]
-in ASCII matches any decimal digit.
-It is illegal\(dd for two ranges to share an
-endpoint,
-.No e.g. Ql a-c-e .
-Ranges are very collating-sequence-dependent,
-and portable programs should avoid relying on them.
+which is not the first character in the bracket,
+it will be interpreted as a collating range and will match all
+characters that fall in between the preceding and following characters
+(inclusive) in the current locale's collating order.
+.No For example, Ql [a0-9]
+in ASCII matches
+.Ql a
+or any decimal digit.
+.No For example, Ql [^agh]
+matches any character that is not
+.Ql a ,
+.Ql g ,
+or
+.Ql h .
.Pp
To include a literal
.Ql \&]
@@ -235,7 +224,7 @@ can thus match more than one character,
e.g.\& if the collating sequence includes a
.Ql ch
collating element,
-then the RE
+then the regular expression
.Ql [[.ch.]]*c
matches the first five characters
of
@@ -263,7 +252,7 @@ then
and
.Ql [xy]
are all synonymous.
-An equivalence class may not\(dd be an endpoint
+An equivalence class may not be an endpoint
of a range.
.Pp
Within a bracket expression, the name of a
@@ -284,7 +273,7 @@ Standard character class names are:
.Pp
These stand for the character classes defined in
.Xr ctype 3 .
-A locale may provide others.
+A particular locale may provide others.
A character class may not be used as an endpoint of a range.
.Pp
A bracketed expression like
@@ -295,35 +284,16 @@ The reverse, matching any character that
class, the negation operator of bracket expressions may be used:
.Ql [^[:class:]] .
.Pp
-There are two special cases\(dd of bracket expressions:
-the bracket expressions
-.Ql [[:<:]]
-and
-.Ql [[:>:]]
-match the null string at the beginning and end of a word respectively.
-A word is defined as a sequence of word characters
-which is neither preceded nor followed by
-word characters.
-A word character is an
-.Em alnum
-character (as defined by
-.Xr ctype 3 )
-or an underscore.
-This is an extension,
-compatible with but not specified by
-.St -p1003.2 ,
-and should be used with
-caution in software intended to be portable to other systems.
-.Pp
-In the event that an RE could match more than one substring of a given
-string,
-the RE matches the one starting earliest in the string.
-If the RE could match more than one substring starting at that point,
+In the event that a regular expression could match more than one
+substring of a given string,
+the regular expression matches the one starting earliest in the string.
+If the regular expression could match more than one substring starting
+at that point,
it matches the longest.
Subexpressions also match the longest possible substrings, subject to
the constraint that the whole match be as long as possible,
-with subexpressions starting earlier in the RE taking priority over
-ones starting later.
+with subexpressions starting earlier in the regular expression taking
+priority over ones starting later.
Note that higher-level subexpressions thus take priority over
their lower-level component subexpressions.
.Pp
@@ -346,15 +316,14 @@ when
.Ql (a*)*
is matched against
.Ql bc
-both the whole RE and the parenthesized
+both the whole regular expression and the parenthesized
subexpression match the null string.
.Pp
-If case-independent matching is specified,
-the effect is much as if all case distinctions had vanished from the
-alphabet.
-When an alphabetic that exists in multiple cases appears as an
-ordinary character outside a bracket expression, it is effectively
-transformed into a bracket expression containing both cases,
+The effect of case-independent match is like as if all case distinctions
+vanished from the alphabet.
+It can also be modelled as if each and every character were replaced
+by a bracket expression,
+containing both cases of the same letter,
.No e.g. Ql x
becomes
.Ql [xX] .
@@ -368,15 +337,13 @@ and
becomes
.Ql [^xX] .
.Pp
-No particular limit is imposed on the length of REs\(dd.
-Programs intended to be portable should not employ REs longer
-than 256 bytes,
-as an implementation can refuse to accept such REs and remain
-POSIX-compliant.
-.Pp
-Obsolete
-.Pq Dq basic
-regular expressions differ in several respects.
+No particular limit is imposed on the length of regular expression.
+Programs intended to be portable should not employ regular expressions
+longer than 256 bytes,
+as an implementation can refuse to accept such regular expressions and
+remain POSIX-compliant.
+.Pp
+Basic regular expressions differ in several respects.
.Ql \&|
is an ordinary character and there is no equivalent
for its functionality.
@@ -391,7 +358,7 @@ or
respectively).
Also note that
.Ql x+
-in modern REs is equivalent to
+in extended regular expressions is equivalent to
.Ql xx* .
The delimiters for bounds are
.Ql \e{
@@ -412,15 +379,15 @@ and
.Ql \&)
by themselves ordinary characters.
.Ql \&^
-is an ordinary character except at the beginning of the
-RE or\(dd the beginning of a parenthesized subexpression,
+is an ordinary character except at the beginning of the regular expression
+or the beginning of a parenthesized subexpression,
.Ql \&$
is an ordinary character except at the end of the
-RE or\(dd the end of a parenthesized subexpression,
+regular expression or the end of a parenthesized subexpression,
and
.Ql \&*
is an ordinary character if it appears at the beginning of the
-RE or the beginning of a parenthesized subexpression
+regular expresssion or the beginning of a parenthesized subexpression
(after a possible leading
.Ql \&^ ) .
Finally, there is one new type of atom, a
@@ -442,6 +409,9 @@ or
.Ql cc
but not
.Ql bc .
+.Pp
+Back references are not defined for extended regular expressions but
+most implementations (including this) implement them.
.Sh SEE ALSO
.Xr regex 3
.Rs
@@ -450,34 +420,12 @@ but not
.%N 1003.2
.%P section 2.8
.Re
-.Sh BUGS
-Having two kinds of REs is a botch.
-.Pp
-The current
-.St -p1003.2
-spec says that
-.Ql \&)
-is an ordinary character in
-the absence of an unmatched
-.Ql \&( ;
-this was an unintentional result of a wording error,
-and change is likely.
-Avoid relying on it.
-.Pp
-Back references are a dreadful botch,
-posing major problems for efficient implementations.
-They are also somewhat vaguely defined
-(does
-.Ql a\e(\e(b\e)*\e2\e)*d
-match
-.Ql abbbd ? ) .
-Avoid using them.
-.Pp
-.St -p1003.2
-specification of case-independent matching is vague.
-The
-.Dq one case implies all cases
-definition given above
-is current consensus among implementors as to the right interpretation.
-.Pp
-The syntax for word boundaries is incredibly ugly.
+.Sh HISTORY
+This manual was originally written by
+.An Henry Spencer
+for an older implementation and later extended and
+tailored for TRE by
+.An Gabor Kovesdan .
+The regex implementation comes from the TRE project
+and it was included first in
+.Fx 10-CURRENT.
More information about the svn-src-user
mailing list