Convert PDF to Excel

Odhiambo Washington odhiambo at gmail.com
Sat Jan 23 07:37:01 UTC 2021


On Sat, 23 Jan 2021 at 07:42, Polytropon <freebsd at edvax.de> wrote:

> On Fri, 22 Jan 2021 19:45:11 +0300, Odhiambo Washington wrote:
> > I have a situation where I'd like to convert PDF to XLSX.
> > The documents are 35MB and 105MB but contain several thousand pages.
> >
> > Does anyone know a good tool that can handle this?
>
> Depends on what is in the PDFs.
>
> If this is rendered text, you can maybe extract the text with
> the tool pdftotext and convert it to CSV, then import the CSV
> in "Excel".
>
> But if it's images of text, use the tool pdfimages to extract the
> images, and then a OCR tool (maybe esseract) to obtain the data.
>
> It might be worth checking if LibreOffice an open a PDF file and
> export to (or save as) directly an "Excel"-compatible file, either
> CSV or one of the binary formats (XLS, XLSX).
>
> Restructuring with some sed / awk / perl might be needed, though.
> Keep in mind those steps can be automated, so if you have lots of
> PDF files, write a simple shell wrapper that processes all of them,
> so you get a bunch of result files without further handholding. :-)
>
>
To make the story short, I need to do some manipulation on the two
documents in this link:

https://bit.ly/2KEvCwr

I thought they are simple PDFs, but now I am not sure what/how the creators
did.
I just need to count how many duplicate records are in these.

Any script guru to assist?? :-)




-- 
Best regards,
Odhiambo WASHINGTON,
Nairobi,KE
+254 7 3200 0004/+254 7 2274 3223
"Oh, the cruft.", grep ^[^#] :-)


More information about the freebsd-questions mailing list