editing pdf files

Sat Oct 13 20:07:04 UTC 2012

On Sat, Oct 13, 2012 at 04:40:23AM +0200, C. P. Ghost wrote:
> On Sat, Oct 13, 2012 at 1:46 AM, Gary Kline <kline at thought.org> wrote:
> > On Fri, Oct 12, 2012 at 10:40:29PM +0400, Boris Samorodov wrote:
> >> 10.10.2012 02:35, Gary Aitken пишет:
> >>
> >> > Can someone give me advice on editing pdf files?
> >>
> >> Take a look at graphics/inkscape.
> >>
> >> --
> >> WBR, Boris Samorodov (bsam)
> >> FreeBSD Committer, http://www.FreeBSD.org The Power To Serve
> >
> >
> >         ive got a question that fits in here.  hopefully.
> >
> >         last week  I found a book from 1901 that google had scanned and listed
> >         as a pdf file.  it was text plus photos of the rich/famous of the
> >         1800s.  somehow, google found the exact string that matched my great
> >         grandfather [from the civil war].  I d'loaded the file (maybe 2mbytes)
> >         and searched using acroread.  nada.  I used the pdftotext utility.
> >         same: nothing but  some 600 page numbers.
> >
> >         my guess is that google just took photos of the book and used other
> >         tools to create a pdf file.  I am not =that= serious  about genealogy,
> >         but I would like to know if there are any tools to edit this kind of
> >         pdf file.
> 
> I suspect the following: they scanned the book and put all the images
> into the PDF. The PDF itself is merely a container for scanned pages;
> it thus contains no text (save for the page numbers).
> 
> That Google was able to search in this file is probably due to them running
> some OCR program on the image files, and then indexing the (approximate)
> text that the OCR program generated. Probably they used something like
> tesseract-ocr from ports graphics/tesseract:
>   http://code.google.com/p/tesseract-ocr/
> 

	in more recent google stuff--text--sci-tech zines or whatever--it 
	sseems like they have used some very high-end ocr programs and
	=then= turned the file into pdf.  I have been able to get very
	good textfiles from a small sample of google's work.  

	a few years ago I tried the ocr ports we have.  very poor results.
	it may be time to see if the newer versions gives me better results.

	gary

	ps: tesseract was one I tried [circa '10] ...  time to look at the
	actual Code!


> 
> -cpghost.
> 
> -- 
> Cordula's Web. http://www.cordula.ws/

-- 
 Gary Kline  kline at thought.org  http://www.thought.org  Public Service Unix
              Twenty-six years of service to the Unix community.