editing pdf files
Gary Kline
kline at thought.org
Sat Oct 13 20:07:04 UTC 2012
On Sat, Oct 13, 2012 at 04:40:23AM +0200, C. P. Ghost wrote:
> On Sat, Oct 13, 2012 at 1:46 AM, Gary Kline <kline at thought.org> wrote:
> > On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote:
> >> 10.10.2012 02:35, Gary Aitken пишет:
> >>
> >> > Can someone give me advice on editing pdf files?
> >>
> >> Take a look at graphics/inkscape.
> >>
> >> --
> >> WBR, Boris Samorodov (bsam)
> >> FreeBSD Committer, http://www.FreeBSD.org The Power To Serve
> >
> >
> > ive got a question that fits in here. hopefully.
> >
> > last week I found a book from 1901 that google had scanned and listed
> > as a pdf file. it was text plus photos of the rich/famous of the
> > 1800s. somehow, google found the exact string that matched my great
> > grandfather [from the civil war]. I d'loaded the file (maybe 2mbytes)
> > and searched using acroread. nada. I used the pdftotext utility.
> > same: nothing but some 600 page numbers.
> >
> > my guess is that google just took photos of the book and used other
> > tools to create a pdf file. I am not =that= serious about genealogy,
> > but I would like to know if there are any tools to edit this kind of
> > pdf file.
>
> I suspect the following: they scanned the book and put all the images
> into the PDF. The PDF itself is merely a container for scanned pages;
> it thus contains no text (save for the page numbers).
>
> That Google was able to search in this file is probably due to them running
> some OCR program on the image files, and then indexing the (approximate)
> text that the OCR program generated. Probably they used something like
> tesseract-ocr from ports graphics/tesseract:
> http://code.google.com/p/tesseract-ocr/
>
in more recent google stuff--text--sci-tech zines or whatever--it
sseems like they have used some very high-end ocr programs and
=then= turned the file into pdf. I have been able to get very
good textfiles from a small sample of google's work.
a few years ago I tried the ocr ports we have. very poor results.
it may be time to see if the newer versions gives me better results.
gary
ps: tesseract was one I tried [circa '10] ... time to look at the
actual Code!
>
> -cpghost.
>
> --
> Cordula's Web. http://www.cordula.ws/
--
Gary Kline kline at thought.org http://www.thought.org Public Service Unix
Twenty-six years of service to the Unix community.
More information about the freebsd-questions
mailing list