Tutorial: How to understand and retrieve Census data — for beginners

Print More

The United States Census is a crucial part of how this country functions. Early in the Constitution, it says that an enumeration must be made every 10 years, which determines how many federal representatives each state receives — and how federal resources are parsed out. In the past, surveys filled out as part of the Census have helped policymakers decide what to do in times of crisis, like during the Great Depression when they needed an idea of how grim the situation actually was.

But another effect of the Census is that the public can use the data to answer their own questions.

For some social and data scientists, working with Census data is a breeze. But for everyone else, it can be overwhelming. So today we’ll talk about understanding the Census — and how total beginners can retrieve datasets.

The history of counting a lot of people

To best understand how the Census works, it’s important to understand it’s history.

The first Census in 1790 mainly wanted to track males older than 16 — to gauge military potential — as well as the number of free people and slaves. It surveyed the head of each household, which looked like this:

Screen Shot 2015-08-12 at 2.49.36 PM

As the population grew, among the biggest challenges was figuring out how to gather large amounts of data — and to count it without computers. In 1870, the first tallying machine was introduced, but it was still a laborious process — and it took nearly a decade to count all the surveys, at which point it was time to begin the next Census. It wasn’t until 1890 that an electronic tabulating system helped speed up counting.

Another large problem was — and still is — the undercounting of some minority populations. For example, analysis of the 1980 survey found that African Americans were severely undercounted. One solution was to use statistical sampling to adjust the data for these populations, but the Supreme Court ruled in 1999 that using these methods to allocate seats in the U.S. House of Representatives violated the Census Act of 1976. It did not rule out the use of sampling for redistricting or the allocation of federal resources.

In 2010, the census bureau estimated that it missed 1.5 million minorities.

All this is to say: Counting large numbers of people is hard, and it’s important to keep that in mind when looking at Census data.

There’s a new survey in town

Before 2010, there used to be two surveys: a “short” survey, which all households received, and a “long-form” survey, which one-in-six households received in 2000.

But in 2010, that changed. Every household received a 10-question survey for the biennial Census. But the in-depth questions went to the American Community Survey, which surveys about 295,000 people a month. It is a running survey, which gives more up-to-date data compared to the previous long-form Census survey taken once every 10 years. Not everyone supports this mandatory survey, but it helps us answer some very important questions.

Because it’s not a survey of the entire population, you can’t answer questions about small groups of people using just one year’s worth of data — otherwise known as “one-year estimates.” So the Census combines data from multiple years to provide a better estimate for smaller locales or demographic groups. That’s what it means to use “three-year estimates” or “five-year estimates.”

Find out what’s in the survey

I find that one of the best ways to begin asking questions of Census and ACS data is to actually read the survey. You can see the Census survey here and the ACS one here.

It can help you understand what kinds of queries you can make of the data.

Getting the data

There are multiple ways to get the data — all with the upsides and downsides. In short, as the interface becomes easier, the flexibility to find custom datasets becomes harder. I’ll run through three options here.

Easy: Census Reporter

The easiest site is CensusReporter.org. You input a location you’re looking for, then it walks you through the available datasets. For example, we can find the median age by sex for Connecticut.

From there, you can create various geographic breakdowns. In Connecticut, if you want to divide it by town, use “county subdivisions.” If you want to divide it by Census tract — which are plots of land with about 4,000 people each — then you can select that.

The magic, but also the downside, is that Census Reporter tries to guess which dataset you want. It’ll switch between ACS five-year estimates and one-year estimates, and you don’t have to make that choice.

Once you have the dataset you want, you download the data on the top right.

Medium: American FactFinder

American FactFinder gives you a little more flexibility.

You can select your topic and your geographic constraints, and then it will push out datasets that match your query. It’s a lot easier to explore what’s available with FactFinder, because it categorizes the different datasets available.

If you want your hand held as your explore this, go to this slideshow I made. Otherwise, it’s pretty easy.


The Integrated Public Use Microdata Series, or IPUMS, is an incredibly powerful tool that lets you extract data from 1850 to the present. The other tools don’t allow for this.

For example: Do you want to find the average household size from 1900 to the present? This is your tool.

There are a lot of nuances to historic Census data, with various things affecting it, such as when the Census Bureau began primarily mailing the surveys versus conducting them in person. But if you’re using IPUMS, you might already know this. In other words: Since this tutorial is for beginners, I just mentioned IPUMS so you know that it exists.

I hardly consider myself an expert with IPUMS, but if you want to have a long weekend learning, I suggest watching these YouTube videos, which have helped me extract data in the past.

If any readers want to write a tutorial on how to better use IPUMS, please let me know.

What do you think?

  • Joseph Brzezinski

    Good discussion. Perhaps some future article couls address enumerating additional public sources of data such as economic measures, labor statistics, education statistics, government data at town/county/state breakdown levels, taxation, weather/climate measures, crime/public safety data, etc.

    • alvinschang

      We’re on the same page! We’re touching on one of these subject Monday.

      • Joseph Brzezinski

        Sounds great. Any insights you can offer on syncronizin data from multiple sources (e.g. time period, geographic level, etc.) would be helpful as well.

        • alvinschang

          Good suggestion. Do you have a good example?

          • Joseph Brzezinski

            I always have trouble with “town” for Connecticut. Some sources have a fips code, some have codes that are unique to the source, and for name sources vary with mixed case or uooer case with sone having county and state as part of the name. Linking town data to a spatial shape file then becomes an additional issue.
            Year is another example, is it calendar, fiscal, or what?