can i split a pdf file?

cpghost cpghost at cordula.ws
Mon Jan 26 14:38:28 PST 2009


On Mon, Jan 26, 2009 at 02:06:23PM -0800, Gary Kline wrote:
> On Mon, Jan 26, 2009 at 09:16:23AM +0100, Polytropon wrote:
> > On Mon, 26 Jan 2009 00:06:18 -0800, Gary Kline <kline at thought.org> wrote:
> > > 	Thanks, Gents,
> > > 
> > > 	But according to one smallish pdf file that I send to a web based
> > > 	tool, it was not a real pdf.  Or, more accurately, it (the pdf to 
> > > 	speech program) couldn't decode it.
> > 
> > This is a typical problem with "poorly engineered" PDFs where the
> > author puts in the text as images (you'll see this stupidity across
> > the Web, too).
> 
> 
> 	So what kind of moron is going to photograph pages --or maybe just
> 	get-screenshot-of-this-page" and upload it?

It happens quite frequently nowadays. Those PDFs are usually scanned,
and the scanner software (usually on Windows) assembles all screenshots
into a PDF of images. That's what you find on the Net.

This is not such a bad idea, esp. when it comes to technical textbooks,
which usually contain a lot of diagrams, formulae, tables etc...; since
an OCR software that would be able to reverse all this into LaTeX and
EPS figures has yet to be programmed (that's a difficult task).

>   Or a Real question:
> 	I read an online pdf of "The Art of War" from the 1880's [?], and
> 	it was in an old-English or olden-Deutsch type font.  In PDF.  i
> 	have other p.d. texts in pdf and am wondering in there is some
> 	sort of scanner than can take a book-length script and create a
> 	pdf file.  Anybody know?  

It all depends how the PDF is created. Some PDFs encode the fonts
in a special section, and then use text (sometimes compressed
or encrypted), which refers to those fonts. In such a case, you
could extract the pure text from the PDF.

Other PDFs simply encode the book as a set of bitmaps (see above);
and then your only chance is to find an OCR software that would not
only be able to recognize the characters in the bitmaps, but also
to cope with those Fraktur- or other exotic fonts. Some OCR programs
are interactive and trainable, so that you can say: this is an 'S',
and that is a 'T'..., but AFAIK, there's no free and open source
OCR program with this capability (yet).

> > A good tool to check if the PDF file can be (audibly) read is the
> > use of the tool pdftotext from the port xpdf.
> > 
> > 	% pdftotext bla.pdf && less bla.txt
> > 
> > Then, even the FF speech plugin should work correctly - as long as
> > the PDF file contains decodable text. If it's just a bunch of images,
> > well, what are we expecting, hm? FF-speech: "You see a pretty image of
> > some text..." :-)
> 
> 	Yeah, that's about right!  I got a bunch of ^L bytes and nothing
> 	else.  Now I'm looking at the file with od -c and, yup, it's and
> 	image. The parts inbetween pages are in ASCII.  Do you know what
> 	"MediaBox" is?

So it's a set of images. There's not much you could do about it.

Oh, you can still try to extract the images from the PDF by using the
program 'pdfimages' (part of the graphics/xpdf port); and look at them
individually with an image processor (Gimp etc...). Then run an OCR
program on those images. Try graphics/gocr for example. But it would
still be tedious, to say the least.

> 	At least the web article was not an image!  Google had it both in
> 	PDF and HTML.
> 
> 	gary

-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/


More information about the freebsd-questions mailing list