How to divide up?

Giorgos Keramidas keramida at ceid.upatras.gr
Sun Jul 20 00:44:22 UTC 2008


On Sat, 19 Jul 2008 17:23:48 -0700, Gary Kline <kline at thought.org> wrote:
> Guys,
> Is there an easyy way of splitting yp these tags into one-per-line?
>
> I'm not obcessive [[?, :)]], but for what I've got in mind, the tags
> and stuff would look better to my eyes?  ....the outcome of this will
> go ino a special database, not html .
>
> is there some clever perl one-liner ...

I don't know about 'easy', because this looks pretty much like 'free
form HTML'.  Parsing liberally formatted HTML code from untrusted
sources is a lot like trying to reinvent Firefox's HTML parsing engine
or something similar.  That's bound to be up there in the 'insanely
difficult' and not so much in the 'easy to hack with sed and a bit of
awk or some Perl' scale.

If you have some sort of guarantee about the well-formedness of the HTML
source though (i.e. it passes some sort of validation suite), then you
can probably use tidy(1) to convert it to XML and then use xsltproc to
convert the XML source to pretty much anything imaginable.

Now, if you want to merely "hack something quick and dirty", a short
Perl script can probably do regexp substitution similar to

        #
        # WARNING: THIS HAS NOT BEEN TESTED :P
        #
        my $foo = <STDIN>;
        $foo = s:(<[^>]+>[^<]*</[^>]+>):$1\n:ge;
        print "$foo";

but you shouldn't trust the output of such a quick hack too much.



More information about the freebsd-questions mailing list