any way to turn a pdf file into something OCR-able?
Gary Kline
kline at thought.org
Tue Dec 2 16:07:46 PST 2008
On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote:
> Gary Kline <kline at thought.org> writes:
>
> > pdftotext fail on the large [32MB] file I've got. Is there any other way I
> > can translate this huge textfile to ascii or html or text?
>
> I wrote some code using Python PDF library 'pypdf' to split a multipage
> PDF scan into individual pages, then used the tesseract OCR to convert
> to text. Not 100% of course, and it really got confused by pages that
> were not right-side-up, but not a bad start for pages that are really
> scans -- images -- rather than PDF representation of text.
>
> Sadly, I haven't gotten it into a suitable state to release.
Well, sounds hopeful for when I scan around 200 pages of pre-1923 journal
articles. These are in columnal form IIRC correctly.
--Be WONDERFUL if there were some kind of hardware top translate Old books
and journals automagically. ... .
gary
--
Gary Kline kline at thought.org http://www.thought.org Public Service Unix
http://jottings.thought.org http://transfinite.thought.org
Flash: The alpha release of Jottings is available: http://jottings.thought.org/index.php
More information about the freebsd-questions
mailing list