any way to turn a pdf file into something OCR-able?

Tue Dec 2 10:56:23 PST 2008

On Mon, Dec 01, 2008 at 08:23:09PM -0500, Robert Huff wrote:
> 
> Roland Smith writes:
> 
> >  > 	pdftotext fail on the large [32MB] file I've got.  Is there any
> >  > 	other way I can translate this huge textfile to ascii or html or
> >  > 	text?
> >  
> 
> >  Please define "fail" in this context? I've used pdftotxt on
> >  documents exceeding 40MB. However there are of course things that
> >  don't work;
> >  
> >  1) Some PDFs are just wrappers around JPEG images. In this case
> >  there is no text for pdftotext to convert => epic fail.
> 
> 	In this case "convert" from the ImageMagick port will get you a
> series of .jpg/.gif/.<whatever>.  Read the manual carefully before
> attempting; also note this can be a slow process.

Which still doesn't give plain text. But in this case one would need an
OCR app.

There is a new one available in ports called cuneiform. It is supposed
to be quite good, but I haven't had the need to try it yet. 

I've tried gocr and tesseract in the past but was not really impressed
with them. For short documents it's easier to do the OCR with the Mk I
eyeball & brain. :-) You'll have to completely check an OCR-ed document
for errors anyway.

Roland
-- 
R.F.Smith                                   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20081202/42aebb7a/attachment.pgp