any way to turn a pdf file into something OCR-able?

Tue Dec 2 16:07:46 PST 2008

On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote:
> Gary Kline <kline at thought.org> writes:
> 
> > 	pdftotext fail on the large [32MB] file I've got.  Is there any other way I
> > 	can translate this huge textfile to ascii or html or text?
> 
> I wrote some code using Python PDF library 'pypdf' to split a multipage
> PDF scan into individual pages, then used the tesseract OCR to convert
> to text.  Not 100% of course, and it really got confused by pages that
> were not right-side-up, but not a bad start for pages that are really
> scans -- images -- rather than PDF representation of text. 
> 
> Sadly, I haven't gotten it into a suitable state to release. 


	Well, sounds hopeful for when I scan around 200 pages of pre-1923 journal 
	articles.  These are in columnal form IIRC correctly.  

	--Be WONDERFUL if there were some kind of hardware top translate Old books
	and journals automagically.  ... .

	gary


-- 
 Gary Kline  kline at thought.org  http://www.thought.org  Public Service Unix
        http://jottings.thought.org   http://transfinite.thought.org
 Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php