any way to turn a pdf file into something OCR-able?

Gary Kline kline at
Tue Dec 2 16:07:46 PST 2008

On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote:
> Gary Kline <kline at> writes:
> > 	pdftotext fail on the large [32MB] file I've got.  Is there any other way I
> > 	can translate this huge textfile to ascii or html or text?
> I wrote some code using Python PDF library 'pypdf' to split a multipage
> PDF scan into individual pages, then used the tesseract OCR to convert
> to text.  Not 100% of course, and it really got confused by pages that
> were not right-side-up, but not a bad start for pages that are really
> scans -- images -- rather than PDF representation of text. 
> Sadly, I haven't gotten it into a suitable state to release. 

	Well, sounds hopeful for when I scan around 200 pages of pre-1923 journal 
	articles.  These are in columnal form IIRC correctly.  

	--Be WONDERFUL if there were some kind of hardware top translate Old books
	and journals automagically.  ... .


 Gary Kline  kline at  Public Service Unix
 Flash: The alpha release of Jottings is available:

More information about the freebsd-questions mailing list