any way to turn a pdf file into something OCR-able?
rsmith at xs4all.nl
Tue Dec 2 10:56:23 PST 2008
On Mon, Dec 01, 2008 at 08:23:09PM -0500, Robert Huff wrote:
> Roland Smith writes:
> > > pdftotext fail on the large [32MB] file I've got. Is there any
> > > other way I can translate this huge textfile to ascii or html or
> > > text?
> > Please define "fail" in this context? I've used pdftotxt on
> > documents exceeding 40MB. However there are of course things that
> > don't work;
> > 1) Some PDFs are just wrappers around JPEG images. In this case
> > there is no text for pdftotext to convert => epic fail.
> In this case "convert" from the ImageMagick port will get you a
> series of .jpg/.gif/.<whatever>. Read the manual carefully before
> attempting; also note this can be a slow process.
Which still doesn't give plain text. But in this case one would need an
There is a new one available in ports called cuneiform. It is supposed
to be quite good, but I haven't had the need to try it yet.
I've tried gocr and tesseract in the past but was not really impressed
with them. For short documents it's easier to do the OCR with the Mk I
eyeball & brain. :-) You'll have to completely check an OCR-ed document
for errors anyway.
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20081202/42aebb7a/attachment.pgp
More information about the freebsd-questions