Real-world Python for data-crunching journalists

Print More

Reporters don’t interview human sources without a notebook, and they shouldn’t analyze data without one either.

In this walkthrough we’ll see how to use code instead of visual spreadsheet applications like Excel to keep an audit trail of our data analysis. Like we do for many of our stories and projects at Trend CT.

In the hands-on portion, we’ll use the Python programming language to find the biggest recipients of vendor payments from the Connecticut Airport Authority in a format readable by anyone, not just people who know how to write code, called a “notebook.”

The main drawback with visual tools is that, while they’re intuitive, they don’t provide a mechanism to show your work step-by-step. (See Appendix B for why we think this is important).

The tools we’ll use are the Python programming language, a Python library for statistical analysis called Pandas, and a notebook program called Jupyter. There are other tools you can use that offer similar capabilities (See Appendix E).

The end product will be a Python notebook file that look like this (and looks like this when published on GitHub):

Screen Shot 2016-07-20 at 3.14.07 PM
Difficulty level

Challenging. If you’ve never written any code before, you’ll need to spend time on the resources linked in the article.

This isn’t meant to be an exhaustive introduction to any of the technologies used, but a look at how they can be used in concert with a real-world example. That means there are some real-world problems we’ll solve as they arise.

For a more comprehensive introduction, check out the Pandas Cookbook or the other resources in Appendix C of this guide.

Installing the software

For this guide, you’ll need to install Python, a python library called Pandas, and an interactive programming environment for data scientists called Jupyter. (If you have a Mac, Python comes pre-installed.)

I haven’t used it myself, but Anaconda is a popular package that combines a lot of free software for data science in Python (and R), including all the software needed for this tutorial.

Setting up the environment

Before we can start analyzing data, we need some data to analyze and a folder to work in. All the folders, data and code for this tutorial can be found in this GitHub repository.

1. Make a new folder on your Desktop and call it pandas-tutorial. That’s your project folder.

2. Open your project folder and make a new folder inside called “data”. The data folder is where you’ll keep your raw source data.

Best practices: I always put the raw data in a data subdirectory, and I don’t alter the file directly. This way anyone auditing my code can obtain the data directly from the source and reproduce my analysis. I try never to “pre-process” raw data in some other program, like Excel.

3. Download the source csv from the repo for this tutorial on GitHub and put it in your data folder.

4. Open up a terminal emulator. On Mac, it’s a program called “Terminal,” which can be found in the /Applications/Utilities folder or in Spotlight (Command+Space) by searching for “Terminal.” If you’ve never used Terminal, check out this iMore article.

On a Windows computer, this process will be different. This guide should help you get familiar with the Windows command line.

5. In the terminal, navigate to the folder you created on the desktop (or downloaded from the GitHub repo).

On a Mac you can do that with the cd command:

$ cd ~/Desktop/pandas-tutorial

6. Launch Jupyter Notebook:

$ jupyter notebook

You’ll see a bunch of text in the terminal window that you can safely ignore before a new browser window appears that looks like this:

jupyter home screen
Creating a notebook

In the Jupyter Notebook home page you can see the data directory. (If you don’t see it, you launched Jupyter Notebook from the wrong folder.)

Now we’re ready to start a notebook.

1. Click the “New” button on the right side of the page and select Python 2 (or Python 3. It just depends which version of Python you have installed).

jupyter new notebook button

2. In the new page that opens up, give it a name by clicking the text area that says “Untitled” and entering a name, like “Intro to Pandas.”

3. Click in the first “cell” and then click the dropdown that says “code” and change it to “Markdown.”

Markdown is a shorthand that gets converted to HTML. In markdown fields, like the one we just created, we can put our annotations. Headlines are preceded with # or ## or ###. The more #’s the smaller they appear. For more on markdown syntax, read this.

4. Click in the markdown field, enter a headline and explanation of the project. Then hit shift+enter to render the cell into HTML and jump to the next cell (in this case, a new cell is created. The default cell type is “code”).


Let’s add a field for code:

5. Click in the newly created cell and type

print "HELLO, WORLD!"

6. Hit shift+enter to execute the code (same as the shortcut to render the Markdown cell).

Unlike when the markdown cell rendered, the code cell doesn’t change in appearance. Instead the output from the code appears below, if there is any. In this case, the output is the string “HELLO, WORLD!”

Best practices: Don’t skimp on the markdown cells. Use them throughout your code to explain what topics you’re going to explore, and what questions you want to answer. Non-technical readers should be able to read just the markdown sections of your notebook and get a better understanding of how you analyzed the data.
Tinkering with some real data

This section moves quickly. Describing how to write code in Python is beyond the scope of this tutorial, so if you’re confused about any of the Python code in this section, you should read (or watch) any of the great resources in Appendix D.

1. Import the pandas library with a commonly used alias “pd” by putting the following in a new code cell and executing it (remember, shift+enter):

import pandas as pd

2. Next, we’ll import the spreadsheet to a data frame. We import data using pandas’ “read_” methods. In this case, we’ll use read_csv:

caa_data = pd.read_csv("data/Connecticut_Airport_Authority_Financial_Data_FY_15.csv")
Screen Shot 2016-07-19 at 2.48.25 PM.png

Now the dataframe is stored in the variable caa_data. (A dataframe is a python object that you can think of as a table or “relation”).

3. Use the .head() method on the dataframe to see examine first few rows:

Screen Shot 2016-07-19 at 2.51.55 PM.png

The .read_csv() method worked well in this case, but we got lucky. Because of the structure of the spreadsheet, it was able to infer the column headings. By default, the read_ methods use the first row for column headings, but you can specify a certain number of rows to skip, if the spreadsheet isn’t formatted as nicely as this one. (Here’s the documentation for .read_csv()).

4. To view just the values in the “Amount” column, we can use the following syntax:


This results in too many rows to display, so in the middle of the output there is a row “…” indicating that rows were skipped. For more on selecting just the data you want, read this.

5. To select just rows where the category is “Office Supplies” we can use:

caa_data[caa_data["Expense Category"] == "Office Supplies"]

Note that the results of queries like the one above are also dataframes. When the output from the code in a cell is a dataframe, the output is display in Jupyter Notebook as an HTML table.

Seeking useful information

Now we’re ready to do something more useful. From this dataset we should be able to calculate the total dollar amount of all the payments combined, and also the total dollar amount for the 10 biggest expense categories.

Start by stating what you want to find out before you start your analysis, with a new markdown cell, then start coding.

Best practices: I’ve found it’s a good discipline to state what you’re looking for in the data before you start writing the code to find it. It helps stay on track and not just focus on the easiest interrogations.

1. Use the .sum() function on the “Amount” column:

Screen Shot 2016-07-19 at 3.09.39 PM

Well, that didn’t work! When Pandas imported the spreadsheet, it tried to infer what type of data was in each column. It didn’t recgonize the “$” signs in the “Amount” column as numbers, so instead it treated them as strings of text. Calling the .sum() function on strings just concatenates them together.

We first have to convert the values in the “Amount” column to numeric values. Here’s some code that will do that, and saves it in a new column called “Amount_numeric”:

import re
caa_data["Amount_numeric"] = caa_data["Amount"].apply(lambda x: \

I won’t get into the details of how this code works – I’ll save that for a later article that talks specifically about strategies for cleaning up data. But if you’re interested in unraveling it you’ll want to read up on lambda functions in Python and regular expressions in Python.

2. Now, try that .sum() method again.

Screen Shot 2016-07-19 at 3.34.20 PM

Huzzah! Now let’s group the data by Expense category and get the sum of the payments for each group, limiting results to the top 10 results in descending order. (Keep in mind, since we haven’t asked anyone questions about this data, we don’t know if the sum of the values in this column is a meaningful number. There could be duplicate entries, for instance. We’d have to ask the maintainer of the data set, or read a data dictionary, before we know enough to use these numbers in a story.)

# We can use the method .agg method to get summary data 
# from the grouped object. We'll need to import another library,
# called numpy to do it. Read more on this groupby method here:
import numpy as np
caa_data.groupby("Expense Category").agg({"Amount_numeric":sum})\
.sort_values(by="Amount_numeric", ascending=False)\
Screen Shot 2016-07-19 at 3.24.47 PM
That’s all, folks!

You should now have a sense of the environment in which you can analyze data in a reproducible, publishable way.

In a future installment I’ll describe how to solve some specific problems data journalists run into, like turning text into useful data.

Appendix A: What is data journalism?

Data journalism looks different in every newsroom but most of it fits into two categories: data analysis and data visualization.

Visualization means illustrating data with charts, graphics and interactive content.

Analysis includes obtaining, cleaning up and “interrogating” the data. But isn’t that just called “journalism?” Yes. Data isn’t just a beat, like tech or business journalism but a practice virtually every journalist is engaged in.

Appendix B: Guiding principles

These principles are up for debate, but we at Trend CT believe they should apply to any journalist engaging in the practice of data journalism (see previous section), regardless of whether their title includes the word “data” and regardless of what tools he or she is using.

1. Accountability. When we present numbers we calculate ourselves, we should also publish the steps we took to get there from source data, like an excel file from the FEC.

2. Openness. We believe in friendly, productive competition, and we understand the initial reluctance to sharing your work. What if the competition uses it to publish the same story and steal your thunder? Well, first of all, that happens with every kind of story, data or not, and they could lift (or “aggregate”) without even seeing your methodology. The upside of sharing is that other news outlets or policymakers or just interested readers can check our work. They might find a mistake, which we can correct. The other big upside is that they can take our work and build on it, better serving readers. We want to see good journalism, even if we didn’t produce it.

3. Reproducibility. This isn’t just an ethical thing or an accuracy thing. When you analyze data, you should do it in a way you can reproduce so that the next time a new version of the data comes out, you can just “hit go” and not have to redo all your work. Time is a precious resource for newsrooms. Think about future-you. Don’t waste his or her time.

Appendix C: Why Python

I use Python because I don’t just write code for data analysis. I write web applications and other tools in Python. I’ve been writing Python code since long before I got started in data journalism. I’m comfortable with Python, and I’m able to use code I write for data analysis in applications I write, and vice versa. Within Python, there are other data analysis libraries available, such as Agate, a well-designed data analysis library specifically geared for journalists.

Where Python is a general purpose language, R is finely tuned for data science. See Appendix E for more on alternative tools.

Appendix D: Further reading

If you’ve never done any Python programming at all, you’ll want to start with one or more of these guides:


These guides are interactive:

Or this excellent computer science introduction course from MIT available to watch for free on EdX:

For more on Pandas, check out this compilation of tutorials, including this very thorough “Pandas Cookbook”:

Appendix E: Different tools for data journalists

In addition to Python with Pandas, there are other programming languages and tools that can offer comparable functionality, such as the R statistical programming language with R studio or using a database and writing SQL.

One of the key features that makes all of these tools sufficient for data analysis is their ability to perform relational algebra operations. A relation is a fancy word for a table. Relational operations can take a table as input and return a table as output. For example, we might have a relation (read: table) called census with two “attributes” (fancy word for columns): names of states and estimated population. To narrow down the relation to the “tuples” (rows) with a population greater than 5 million, we’d make a new relation using a selection operation. A selection can be written with Greek symbols, but my Greek is rusty, so I’ll use SQL, which you can probably decipher even if you’ve never read a line of code:

SELECT state, population
FROM census
WHERE population > 5000000

The exact same selection can be written in R or Python with Pandas. These tools also offer a suite of statistical functions that, combined with relational algebra, allow for powerful transformation and analysis.

For an alternative to Pandas, Agate is a well-designed data analysis library for journalists maintained by a good programmer and journalist Christopher Groskopf:


You can also use database applications to store and analyze data. This guide by AP data journalist Troy Thibodeaux walks through using SQLite, a server-less database:


If you want to learn R, my colleague Andrew Ba Tran has written a tutorial for transitioning from spreadsheets to R.

What do you think?