Tidy and HTML tab spacing

Warren Block wblock at wonkity.com
Wed Jan 18 22:49:50 UTC 2012


HTML versions of FreeBSD documents are fed through tidy (www/tidy or 
www/tidy-devel) for cleanup.  There's a bug in tidy[1] that can cause tab 
stops to be wrong:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/porters-handbook/makefile-distfiles.html#AEN1623

Note how DISTNAME and EXTRACT_SUFX do not line up.  They are correct in the 
source book.sgml.

So what to do?

1. It might be possible to fix tidy.  This would be the neatest.  (See
    [1]).

2. An option could be added to tidy to ignore tabs.  The HTML standard
    "strongly discourages" tabs in PRE elements[2], but does not disallow
    them.  Using actual tabs has an added benefit to the user in that
    they could cut-and-paste or just drag-select Makefile examples to see
    embedded tabs.

3. Tidy could be replaced with some other tool.  However, the others
    I've found have additional dependencies on either PHP or Java, so I
    did not test them for correct handling of tabs[3],[4].  Either one
    adds some overhead not just for doc build machines but anyone who
    wants to work on FreeBSD documentation.

4. Add newlines to the HTML in the build process before it gets to
    tidy:
      s/CLASS="PROGRAMLISTING"\n>/CLASS="PROGRAMLISTING">\n/

5. Don't tidy HTML files at all (suggested as an option by Benedict
    Reuschling).  The unprocessed HTML is ugly, but few people are going
    to look at it directly.  Files that haven't been through tidy are a
    little larger, about 4% in the case of the Porter's Handbook.


Footnotes:

[1] In www/tidy-devel, line 355 of streamio.c does not realize that
characters at the beginning of the line may be inside a tag and should
not count as visible.  The pre-tidy HTML output of the example above is
----
<PRE
CLASS="PROGRAMLISTING"
>DISTNAME=	foo
EXTRACT_SUFX=	.tgz</PRE
> 
----
The '>' before DISTNAME is being wrongly counted toward the tab stop.
See http://www.wonkity.com/~wblock/tidy/ for a slightly more detailed 
example.  Tidy is mature software, and there's been a bug report for this 
problem in the bug database since 2008:
https://sourceforge.net/tracker/?func=detail&aid=1885471&group_id=27659&atid=390963
So bug fixes in this area from the tidy project are unlikely.

[2] http://www.w3.org/TR/html401/struct/text.html#edef-PRE

[3] http://htmlpurifier.org/

[4] http://htmlcleaner.sourceforge.net/index.php



More information about the freebsd-doc mailing list