what's the easiest way to de-html-ize files?

Ian Smith smithi at nimnet.asn.au
Tue May 15 05:34:48 UTC 2007


On Sat, 12 May 2007 14:34:52 -0700 Gary Kline <kline at tao.thought.org> wrote:
 > On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote:
 > > On May 12, 2007, at 12:54 PM, Gary Kline wrote:
 > > >This is for those of us who appreciate ASCII or straight
 > > >	ISO_8859-15 rather than marked up files.  I have slapped together
 > > >	a crude C program that does scotch (or *cleanse*) text of
 > > >	<B></B> and so on.   Still... is there some standalone converter
 > > >	that gets rids of markup more elegantly?   Something where i
 > > >	can say
 > > >
 > > >	% cmd file_1.html ... file_N.html and output file_1.text ...
 > > >	file_N.text?
 > > 
 > > Perhaps:
 > > 
 > >   lynx -dump file1.html ... > file.text
 > > 
 > > ...?
 > 
 > 	Hm, maybe Ineed Bill Campbell's -force_html switch.  
 > 
 > 	Yes, seems that way.  USing just -dump got most of them, but
 > 	using the -force_html caught all.  Need to script something to
 > 	reformat, but the worst of it's done!

Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As'
dialog offers a picklist for 'Files of Type' that includes 'Text Files'.

This does a pretty decent job of producing text from HTML files, and is
quicker than firing up lynx (or links) if you're already viewing a page.

Cheers, Ian



More information about the freebsd-questions mailing list