what's the easiest way to de-html-ize files?
Garrett Cooper
youshi10 at u.washington.edu
Wed May 16 01:46:32 UTC 2007
Gary Kline wrote:
> On Tue, May 15, 2007 at 03:34:14PM +1000, Ian Smith wrote:
>> On Sat, 12 May 2007 14:34:52 -0700 Gary Kline <kline at tao.thought.org> wrote:
>> > On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote:
>> > > On May 12, 2007, at 12:54 PM, Gary Kline wrote:
>> > > >This is for those of us who appreciate ASCII or straight
>> > > > ISO_8859-15 rather than marked up files. I have slapped together
>> > > > a crude C program that does scotch (or *cleanse*) text of
>> > > > <B></B> and so on. Still... is there some standalone converter
>> > > > that gets rids of markup more elegantly? Something where i
>> > > > can say
>> > > >
>> > > > % cmd file_1.html ... file_N.html and output file_1.text ...
>> > > > file_N.text?
>> > >
>> > > Perhaps:
>> > >
>> > > lynx -dump file1.html ... > file.text
>> > >
>> > > ...?
>> >
>> > Hm, maybe Ineed Bill Campbell's -force_html switch.
>> >
>> > Yes, seems that way. USing just -dump got most of them, but
>> > using the -force_html caught all. Need to script something to
>> > reformat, but the worst of it's done!
>>
>> Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As'
>> dialog offers a picklist for 'Files of Type' that includes 'Text Files'.
>>
>> This does a pretty decent job of producing text from HTML files, and is
>> quicker than firing up lynx (or links) if you're already viewing a page.
>
>
> Oh sure; I've been saving html in text, ascii/8859-1 for years.
> But what I've got, and there are more saved **somewhere**, are
> files that are saved by default in markup. I have a slew of
> these on different boxen and have been moving then to one place.
> Problem is: how to de-html the bunch.
>
> I'm too lazy to write something that would automate what Can be
> automated--markup like "&foo;" are problematic. So probably the
> easiest way would be to create a dehtml.sh script that is just a
> wrapper around lynx.
>
> I don't think I'm the only hacker who wants just-plain-ascii, so
> this might mak a good project for somebody who's new to C or
> perl. That's my two pennies' worth!
>
> gary
>
>> Cheers, Ian
>>
>
If you don't want formatting and the number of tags is trivial, the
solution is fairly simple in Perl (less than 150 lines, if even that).
-Garrett
More information about the freebsd-questions
mailing list