Tutorial: How to get data out of PDFs

Print More

Have you ever tried to get data out of a PDF and onto a spreadsheet, like Microsoft Excel, only to realize it’s pretty much impossible?

Here at TrendCT, this happens to us all the time. Quite often when we get data from from government agencies, it is in a PDF. And immediately, our hearts drop because we know the task of getting it into a spreadsheet will range from annoying to aggravating.

Why are PDFs evil?

But first, why is it important to get data out of a PDF and into a spreadsheet?

A PDF, or Portable Document Format, is meant to display a document exactly the same way on every computer. This is important for, say, tax forms. In addition, PDFs are meant for presentation, which means editing them isn’t a layman’s job. So a lot of organizations publish words, numbers and charts in PDF format.

In short, PDFs make stuff look good to humans.

But quite often, you’ll see massive tables in PDF format. For example, here’s a table of 2012 election results:

2012 Election (Text)

And we want to use a computer to add together all the votes, because humans are much slower at this task. But because PDFs are for humans to read, and not computers, you can’t easily turn this PDF into a spreadsheet.

Computers want spreadsheets, though, so what is there to do? Here a walkthrough.

Is it an image?

Sometimes, people turn scanned images of text into a PDF document. Here’s the same document as the one above, except turned into an image. You’ll usually know it’s an image because, if you open the document in your browser or on your desktop, you won’t be able to highlight the text with your cursor. (Click here to download and see for yourself.)

image – election example (Text)

If it’s actually text, and not an image, your life just got significantly easier. A computer can read text a lot more easily than it can read images. But if it’s an image, then a computer needs to do something called “Optical Character Recognition” — or OCR. It needs to actually read the document! Sometimes, it can even read handwriting, although it’s not always accurate.

Keep this distinction in mind as we move forward.

Free tools

There are a handful of free tools that allow you to convert a PDF into an Excel spreadsheet. But with most free things, there’s a give and take.

For example, here’s one site called smallpdf that is quite good at converting PDFs into spreadsheets. But if your PDF is an image, then it won’t be able to turn it into a spreadsheet because it doesn’t employ a technology that can do so.

Here’s another one called PDF Converter, and here’s one by Nitro. They can actually read your document even if it’s an image, but there’s a limit to how much you can convert because they want you to sign up for their paid service.

Lastly, here’s one called Tabula, which is an app that runs in your browser. Tabula can also read images, but mass conversions are a bit of a pain. If you only have a page or two to convert, this is a great option.

So at the end of the day, if you just need to convert one or two small documents a month — and not on deadline — this array of tools is perfect. That said, because you’re not paying for these services, they may not work when you need them to.

Paid tools

When you need to convert big documents, or convert documents often, paying for a service isn’t a bad idea.

As we mentioned above, there is PDF Converter and Nitro, which both offer a premium service. In my experience, they are mostly reliable, although they have failed at crucial times once or twice.

The one I prefer is CometDocs, because I’ve found it to be the most reliable with large documents — and it’s relatively affordable.

All these tools have relatively easy interfaces, so it requires very little technical expertise.

If you know of other good options, let us know in the comments.

Check your data

After you’ve converted the document, check your data with the original. In fact, check it thoroughly — especially if it’s converted from an image — because there’s a chance your data is off by a row or there’s some systematic error that you need to fix.

Clean your data

Inevitably, the spreadsheet won’t look the way you want it to. It’ll have some odd column headers, you’ll get extra noise at the bottom of the document or the spreadsheet will look like a 4-year-old’s room. The downside is that you have to clean this. The upside is that, because it was programmatically converted, there will probably be a pattern you can follow to clean your data.

Share your data

At TrendCT, we’re trying to get better at sharing our cleaned data, especially if it comes out of a PDF. I assume that, if you’re reading this, you will eventually be publishing your finding somewhere. Be a good data citizen and upload your cleaned dataset so others can save themselves a lot of pain. If you did something to the dataset beyond reproducing the PDF version, just say so.

On the flip side, if you see someone who has already cleaned the data, it doesn’t hurt to ask for the clean data. We get these e-mails all the time, and we’re always willing to share.

What do you think?

  • Fly_Dog

    Thanks, looking forward to using these tools.

  • Joseph Brzezinski

    For Microsoft Office users, Word can read pdf files and items can be selected and copied into spreadsheets. Some versions of Word have premium add in services to convert to from pdf documents as well. Depending on the need, Office may be worth trying if you have it available. I don’t know if OpenOffice may have similar capability.

    • alvinschang

      Great point. I’ve had about a 30 to 40 percent success rate with converting from PDF to Word in a way that I can pull out the table easily. But it’s definitely one of the easier ways to accomplish this.