any way to turn a pdf file into something OCR-able?

Roland Smith rsmith at xs4all.nl
Mon Dec 1 17:07:33 PST 2008


On Mon, Dec 01, 2008 at 03:14:43PM -0800, Gary Kline wrote:
> 	pdftotext fail on the large [32MB] file I've got.  Is there any
> 	other way I can translate this huge textfile to ascii or html or
> 	text?

Please define "fail" in this context? I've used pdftotxt on documents
exceeding 40MB. However there are of course things that don't work;

1) Some PDFs are just wrappers around JPEG images. In this case there is
no text for pdftotext to convert => epic fail.

2) If the text contains ligatures etc. you should use the proper
encoding that contains such characters (e.g. '-enc UTF-8') or you will
loose them.

3) Things like equations will not render well, if at all. This also
depends on the encoding.

Roland
-- 
R.F.Smith                                   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20081202/39d11e5c/attachment.pgp


More information about the freebsd-questions mailing list