PDF to HTML translations

Mon Sep 6 18:51:19 UTC 2010

On Sun, Sep 05, 2010 at 10:31:54AM +0200, Erik Trulsson wrote:
> On Sun, Sep 05, 2010 at 08:57:11AM +0200, Roland Smith wrote:
> > On Sat, Sep 04, 2010 at 05:09:20PM -0600, Chad Perrin wrote:
> > > What PDF to HTML translators, other than pdftohtml, am I likely to be
> > > able to find in ports?  I went looking for pdf2html, expecting to find
> > > that there, but no luck.  Before I spend hours sifting through, still
> > > without knowing whether I missed something that should be obvious, 
> > 
> > Yes, you did. :-)

Apparently not.  See below.

> > 
> > > I
> > > figured I'd ask here whether anyone knows of something off the top of
> > > his/her head.
> > 
> > Try textproc/pdftohtml 
> 
> Uhm, he said "other than pdftohtml" so I suspect he already knew about
> that one.

This is indeed the case.

I appreciate the several suggestions I've received, though I see in
retrospect that I haven't been sufficiently specific, since I have not
gotten any suitable answers.

I have "inherited" a Perl script that wraps pdftohtml.  The reason a
wrapper is needed is that a substantial amount of cleanup work is needed
to produce HTML suitable to our final needs.  The output of pdftohtml is
sufficiently far from "perfect" that I would like to test the output of a
few other possible "back ends" for the script to see if a significant
amount of work being done by the script can be eliminated.

Toward that end, the simpler the tool the better -- and the tool on the
"back end" should not be something that must be contacted across a
network, or that cannot be redistributed freely.  I wanted to start with
things I have in the base system on my FreeBSD laptop (where I'm doing my
development) or through ports.  OpenOffice.org is quite a bit larger and
more unwieldy than we would really want to deal with at this point.
Using Google or Adobe tools online is well outside the range of what we
need (requiring network access for the tool to work).

I've started looking at the Xpdf tools as well as pdftohtml.  Other
suggestions from within ports would be appreciated.  Additional options
other than what can be found in ports might also be useful, understanding
the needs I sketched out above.  The script itself is Perl, in case that
matters.

To everyone who has replied so far: thank you for your time.

-- 
Chad Perrin [ original content licensed OWL: http://owl.apotheon.org ]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20100906/4cd5ec6f/attachment.pgp