bin/143369: awk(1) doesn't handle RS as a regexp but as a single character

Mikolaj Golub to.my.trociny at gmail.com
Sat Jan 30 11:30:01 UTC 2010


>Number:         143369
>Category:       bin
>Synopsis:       awk(1) doesn't handle RS as a regexp but as a single character
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Jan 30 11:30:00 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator:     Mikolaj Golub
>Release:        8.0-STABLE, 7.2-STABLE
>Organization:
>Environment:
FreeBSD zhuzha.ua1 8.0-STABLE FreeBSD 8.0-STABLE #6: Sun Jan 24 21:36:17 EET 2010     root at zhuzha.ua1:/usr/obj/usr/src/sys/GENERIC  i386
>Description:
This problem with awk(1) was reported to NetBSD by John Darrow and it was fixed there.

awk allows a complete string to be put into the RS variable, but does not treat that string as a regular expression for record splitting purposes - instead, it splits only on the first character of the string.

http://www.netbsd.org/cgi-bin/query-pr-single.pl?number=30294

FreeBSD has the same problem and it would be nice to fix this.
>How-To-Repeat:
zhuzha:~% echo 'a b c d' | awk 'BEGIN {RS=" ";} {print $0}'
a
b
c
d

zhuzha:~% echo 'a b c d' | awk 'BEGIN {RS="[[:space:]]";} {print $0}'
a b c d

zhuzha:~% echo 'a[b[c[d' | awk 'BEGIN {RS="[[:space:]]";} {print $0}'
a
b
c
d

>Fix:
See the attached patch adopted from NetBSD (PR/30294: John Darrow: nawk doesn't handle RS as a RE but as a single character).

Patch attached with submission follows:

diff -ru contrib/one-true-awk.orig/lib.c contrib/one-true-awk/lib.c
--- contrib/one-true-awk.orig/lib.c	2007-10-25 15:38:02.000000000 +0300
+++ contrib/one-true-awk/lib.c	2010-01-30 13:04:13.000000000 +0200
@@ -194,22 +194,62 @@
 			;
 		if (c != EOF)
 			ungetc(c, inf);
-	}
-	for (rr = buf; ; ) {
-		for (; (c=getc(inf)) != sep && c != EOF; ) {
-			if (rr-buf+1 > bufsize)
-				if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 1"))
-					FATAL("input record `%.30s...' too long", buf);
+	} else if ((*RS)[1]) {
+		fa *pfa = makedfa(*RS, 1);
+		int tempstat = pfa->initstat;
+		char *brr = buf;
+		char *rrr = NULL;
+		int x;
+		for (rr = buf; ; ) {
+			while ((c = getc(inf)) != EOF) {
+				if (rr-buf+3 > bufsize)
+					if (!adjbuf(&buf, &bufsize, 3+rr-buf,
+					    recsize, &rr, "readrec 2"))
+						FATAL("input record `%.30s...'"
+						    " too long", buf);
+				*rr++ = c;
+				*rr = '\0';
+				if (!(x = nematch(pfa, brr))) {
+					pfa->initstat = tempstat;
+					if (rrr) {
+						rr = rrr;
+						ungetc(c, inf);
+						break;
+					}
+				} else {
+					pfa->initstat = 2;
+					brr = rrr = rr = patbeg;
+				}
+			}
+			if (rrr || c == EOF)
+				break;
+			if ((c = getc(inf)) == '\n' || c == EOF)
+				/* 2 in a row */
+				break;
+			*rr++ = '\n';
+			*rr++ = c;
+		}
+	} else {
+		for (rr = buf; ; ) {
+			for (; (c=getc(inf)) != sep && c != EOF; ) {
+				if (rr-buf+1 > bufsize)
+					if (!adjbuf(&buf, &bufsize, 1+rr-buf,
+					    recsize, &rr, "readrec 1"))
+						FATAL("input record `%.30s...'"
+						    " too long", buf);
+				*rr++ = c;
+			}
+			if (**RS == sep || c == EOF)
+				break;
+			if ((c = getc(inf)) == '\n' || c == EOF)
+				/* 2 in a row */
+				break;
+			if (!adjbuf(&buf, &bufsize, 2+rr-buf, recsize, &rr,
+			    "readrec 2"))
+				FATAL("input record `%.30s...' too long", buf);
+			*rr++ = '\n';
 			*rr++ = c;
 		}
-		if (**RS == sep || c == EOF)
-			break;
-		if ((c = getc(inf)) == '\n' || c == EOF) /* 2 in a row */
-			break;
-		if (!adjbuf(&buf, &bufsize, 2+rr-buf, recsize, &rr, "readrec 2"))
-			FATAL("input record `%.30s...' too long", buf);
-		*rr++ = '\n';
-		*rr++ = c;
 	}
 	if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 3"))
 		FATAL("input record `%.30s...' too long", buf);


>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list