Convert PDF to Excel

Polytropon freebsd at edvax.de
Sat Jan 23 08:40:45 UTC 2021


On Sat, 23 Jan 2021 10:36:21 +0300, Odhiambo Washington wrote:
> On Sat, 23 Jan 2021 at 07:42, Polytropon <freebsd at edvax.de> wrote:
> 
> > On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote:
> > > I have a situation where I'd like to convert PDF to XLSX.
> > > The documents are 35MB and 105MB but contain several thousand pages.
> > >
> > > Does anyone know a good tool that can handle this?
> >
> > Depends on what is in the PDFs.
> >
> > If this is rendered text, you can maybe extract the text with
> > the tool pdftotext and convert it to CSV, then import the CSV
> > in "Excel".
> >
> > But if it's images of text, use the tool pdfimages to extract the
> > images, and then a OCR tool (maybe esseract) to obtain the data.
> >
> > It might be worth checking if LibreOffice an open a PDF file and
> > export to (or save as) directly an "Excel"-compatible file, either
> > CSV or one of the binary formats (XLS, XLSX).
> >
> > Restructuring with some sed / awk / perl might be needed, though.
> > Keep in mind those steps can be automated, so if you have lots of
> > PDF files, write a simple shell wrapper that processes all of them,
> > so you get a bunch of result files without further handholding. :-)
> >
> >
> To make the story short, I need to do some manipulation on the two
> documents in this link:
> 
> https://bit.ly/2KEvCwr
> 
> I thought they are simple PDFs, but now I am not sure what/how the creators
> did.

They contain text, so the OCR problem is out of the way.
Sadly, the text is re-arranged so the optimal solution (one
line in a table equals one line of text, with the columns
being separated by whitespace) does not appear, instead it
is the other way round: one line equals one column.



> I just need to count how many duplicate records are in these.

Define "duplicate". :-)



> Any script guru to assist?? :-)

I'd suggest something like this:

	pdftotext <file> | ... | paste | ... | sort | uniq -d | wc -l

This will probably almost do what you need, given sufficient
assumptions... :-)



-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...


More information about the freebsd-questions mailing list