Have you ever tried to get data out of a PDF and onto a spreadsheet, like Microsoft Excel, only to realize it’s pretty much impossible?
Here at TrendCT, this happens to us all the time. Quite often when we get data from from government agencies, it is in a PDF. And immediately, our hearts drop because we know the task of getting it into a spreadsheet will range from annoying to aggravating.
Why are PDFs evil?
But first, why is it important to get data out of a PDF and into a spreadsheet?
A PDF, or Portable Document Format, is meant to display a document exactly the same way on every computer. This is important for, say, tax forms. In addition, PDFs are meant for presentation, which means editing them isn’t a layman’s job. So a lot of organizations publish words, numbers and charts in PDF format.
In short, PDFs make stuff look good to humans.
But quite often, you’ll see massive tables in PDF format. For example, here’s a table of 2012 election results:
And we want to use a computer to add together all the votes, because humans are much slower at this task. But because PDFs are for humans to read, and not computers, you can’t easily turn this PDF into a spreadsheet.
Computers want spreadsheets, though, so what is there to do? Here a walkthrough.
Is it an image?
Sometimes, people turn scanned images of text into a PDF document. Here’s the same document as the one above, except turned into an image. You’ll usually know it’s an image because, if you open the document in your browser or on your desktop, you won’t be able to highlight the text with your cursor. (Click here to download and see for yourself.)
If it’s actually text, and not an image, your life just got significantly easier. A computer can read text a lot more easily than it can read images. But if it’s an image, then a computer needs to do something called “Optical Character Recognition” — or OCR. It needs to actually read the document! Sometimes, it can even read handwriting, although it’s not always accurate.
Keep this distinction in mind as we move forward.
There are a handful of free tools that allow you to convert a PDF into an Excel spreadsheet. But with most free things, there’s a give and take.
For example, here’s one site called smallpdf that is quite good at converting PDFs into spreadsheets. But if your PDF is an image, then it won’t be able to turn it into a spreadsheet because it doesn’t employ a technology that can do so.
Here’s another one called PDF Converter, and here’s one by Nitro. They can actually read your document even if it’s an image, but there’s a limit to how much you can convert because they want you to sign up for their paid service.
Lastly, here’s one called Tabula, which is an app that runs in your browser. Tabula can also read images, but mass conversions are a bit of a pain. If you only have a page or two to convert, this is a great option.
So at the end of the day, if you just need to convert one or two small documents a month — and not on deadline — this array of tools is perfect. That said, because you’re not paying for these services, they may not work when you need them to.
When you need to convert big documents, or convert documents often, paying for a service isn’t a bad idea.
The one I prefer is CometDocs, because I’ve found it to be the most reliable with large documents — and it’s relatively affordable.
All these tools have relatively easy interfaces, so it requires very little technical expertise.
If you know of other good options, let us know in the comments.
Check your data
After you’ve converted the document, check your data with the original. In fact, check it thoroughly — especially if it’s converted from an image — because there’s a chance your data is off by a row or there’s some systematic error that you need to fix.
Clean your data
Inevitably, the spreadsheet won’t look the way you want it to. It’ll have some odd column headers, you’ll get extra noise at the bottom of the document or the spreadsheet will look like a 4-year-old’s room. The downside is that you have to clean this. The upside is that, because it was programmatically converted, there will probably be a pattern you can follow to clean your data.
Share your data
At TrendCT, we’re trying to get better at sharing our cleaned data, especially if it comes out of a PDF. I assume that, if you’re reading this, you will eventually be publishing your finding somewhere. Be a good data citizen and upload your cleaned dataset so others can save themselves a lot of pain. If you did something to the dataset beyond reproducing the PDF version, just say so.
On the flip side, if you see someone who has already cleaned the data, it doesn’t hurt to ask for the clean data. We get these e-mails all the time, and we’re always willing to share.