filtering HTML tags from email

Simon Barner barner at gmx.de
Wed Feb 23 09:42:47 GMT 2005


Mike Hauber wrote:
> > Mutt saves to a temp file then calls the following command:
> > lynx -localhost -dump %s
> > where '%s' is the temporary file you saved it to.
> >
> > You could also just pipe it to the following:
> > lynx -localhost -dump -stdin
> >
> > the -localhost argument prevents lynx from simply following
> > links external to your machine - helpful to avoid generating
> > hits for unscrupulous spammers that get paid for hits on a URL.
> >
> > Just make sure lynx is installed.
> >
> > Lou
> 
> Okay, so to be sure, there is no filter (as of yet) to simply open 
> an email file, strip the HTML tags, and resave it?  I'm not 
> complaining, as this may actually be something I'm capable of 
> creating myself.  (I'll make this my first python project. :) )
> 
> I'm just making sure I'm not missing anything obvious before I 
> start working on it.  It's irritating to spend time on something 
> only to find out that it's already been done.

You probably could do it also with procmail + lynx (or w3m) during the
delivery process.

Another possibility is to have the following entries in your ~/.mailcap
file, which converts html, doc and rtf to plain text.

text/html; w3m -dump -T text/html; copiousoutput;
application/msword; antiword %s; copiousoutput
application/rtf; rtfreader %s; copiousoutput

As for your python script: I don't think that just stripping everything
matching the following expressions is correct because they might appear
in non html emails, too: <.*> <\/.*> (perl syntax).

At least, you'd need a list of valid html tags, i.e. a regular grammar
for html: <b> | </b> | <i> | </i> | ... (BNF notation).

While this is not too hard to implement (and possibly a good project to
learn a new programming language), this would be too much work for
something that can be achieved easier with existing tools (that is, for
me, personally ;-)

Simon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20050223/c5b5ed6c/attachment.bin


More information about the freebsd-questions mailing list