Validating docbook articles...

Chuck Swiger chuck at pkix.net
Sun Feb 22 19:26:02 UTC 2004


Thierry Thomas wrote:

>Le Lun 16 fév 04 à 22:29:46 +0100, Chuck Swiger <chuck at pkix.net>
> écrivait :
>  
>
>>...tidy-devel doesn't understand the -preserve option.  Something like the 
>>following, as www/tidy-devel/files/patch-console-tidy.c:
>>    
>>
>
>Some days ago, we were speaking of this option (with Alex Dupre). It
>seems useful for documents encoded with charsets unsupported by Tidy.
>  
>

Hi, Thierry--

Thanks for your response and interest in the change I suggested.  I 
would be happy to spend more time on this issue and "do the right thing" 
rather than just turn that option into a null operation.  However, as 
you've noticed:

>There exist two possibilities:
>
>- we encode all documents in supported charsets (e.g. UTF8), and this
>option is not necessary (we can apply your patch to keep a compatibility
>with old scripts);
>
>- we have documents written in such encodings, and tidy-devel should be
>patched to actually preserve entities, or we have to keep the original
>Tidy.
>
Your latter comment suggests that the -preserve functionality in tidy is 
no longer available in tidy-devel, which matches my own attempt when 
looking though the tidy-devel code to find a comparible flag to set, and 
not finding anything?  Maybe we should ask the author, <dsr at w3.org>, or 
<html-tidy at w3.org>...?

I just checked, and the difference -preserve in the old version of tidy 
(vers 4th August 2000) makes is fairly common, tends to be things like 
angle brackets in email addresses.  For example, the input source of:

<P CLASS="ADDRESS"><CODE CLASS="EMAIL"><
<A HREF="mailto:chuck at pkix.net">chuck at pkix.net</A>></CODE></P>

...becomes either of (results compared via diff):

-<p class="ADDRESS"><code class="EMAIL"><<a href=
-"mailto:chuck at pkix.net">chuck at pkix.net</a>></code></p>
+<p class="ADDRESS"><code class="EMAIL"><<a href=
+"mailto:chuck at pkix.net">chuck at pkix.net</a>></code></p>

However, the usage of > rather than > is purely a detail of 
encoding, and I am willing to use tidy-devel without having the 
-preserve capability.

Although, then again now that I think about it, using © rather than 
&#A9; (I think?) is more portable-- the issue of whether 0xA9 actually 
is the copyright symbol in the particular character character set being 
used could be a problem.  Isn't 0xA9 not the copyright symbol in one of 
UTF8 or ISO-8859-1?  [ I ran into this issue using the W3C HTML 
validator as well. ]

A broader issue is whether tidy should generate a charset declaration 
(particularly when used with -xml/-asxml), and what should it pick if 
the user and/or the source document doesn't specify one.  I think it 
would be useful for tidy to do so by default...

-- 
-Chuck



More information about the freebsd-doc mailing list