Pivot tables is a quick and powerful way to analyze data in Excel (and now Google Spreadsheets, too). It’s a way for a user to to summarize and large large sets of data.
We’ll be looking the State of Connecticut employee compensation data set for June 2013 – 2014.
It’d be easy to do in Excel with its clicking, dragging, and dropping. But it involves a lot of copying and pasting, which can get difficult to track for a large project. Plus, it’s not time efficient reproducing the process at a larger scale. However it’s much easier to replicate in R with its command line structure.
This walkthrough will help you learn how to do it in base R (to better grasp the basics, hopefully) and then with the assistance of the dplyr package.
We’ll also be using the gender R package to infer if a worker is female or male to better see if there’s a government pay gap.
Beginner to intermediate.
If you’re reading this then you already know how to use pivot tables in Excel or have hopefully already gone through my previous tutorial, Introduction to R for Excel users.
Setting up R
- Download and install R [cran.r-project.org]
- Next, download and install RStudio [rstudio.com]
- Finally, download these files and open pivot_tables.Rpoj to follow along.
Importing the data into R
Two options when getting started. Set the working directory in RStudio to where you downloaded the csv of the state payroll. Or download the folder from TrendCT Github repo and open the pivot_tables.Rproj— the directory will be set automatically.
Type this in the RStudio console to import the data and check it out:
payroll <- read.csv("CT-Payroll-FY2014.csv", stringsAsFactors=FALSE)
This is just the first 10 rows of almost 96,000 state employees. That's a lot of data. Let's think about what this type of information can tell us.
- How many employees there are in each department
- What the average or median pay is by department
Pivot table I
It's a fairly quick process to get an employee count by department in Excel. Though it might take a little bit of time since the file size is 8 mb. However, to sort generated pivot table requires copying and pasting the data into a new dataframe. Here's how to do it in R.
The basic R command to summarize by type is the
depts <- table(payroll$AGENCY_DESCRIPTION)
This command counted the similar instances that occurred in the AGENCY_DESCRIPTION column that you saw from above by using the
depts <- depts[order(-depts)]
This command reorders the dataframe in descending order.
This command lists the first 10 in the new reordered depts dataframe.
At any point you can export the object or dataframe as a spreadsheet with
In Excel, to determine the average pay per department, drag in the TOTAL_AMT variable into the values window and click the i to select the average.
In R, a couple lines of code will do the same thing and sort it as well. This time instead of using
table(), we'll use
tapply(), which applies a function on an array of data.
depts_avg <- tapply(payroll$TOTAL_AMT, payroll$AGENCY_DESCRIPTION, mean)
Note: You can't do a pivot table analysis in Excel with median. You can in R. To do so, replace
The code from above would look like
depts_median <- tapply(payroll$TOTAL_AMT, payroll$AGENCY_DESCRIPTION, median)
Pivot table II
Sure it's interesting to see that the Capital Region Dev Authority gets paid on average more than any other state employee in Connecticut. But you're probably curious for more.
Ready for some advanced stuff?
Let's see if we can drill down further into the salary dataset by inferring the gender based on the first names of the employees. We'll use the gender package from the rOpenSci project that predicts gender based an algorithm using historical data from the U.S. Social Security Administration and the U.S. Census Bureau.
That's a lot of unknowns. Way too many. What's going on?
View(payroll) in the console, scroll around, and you'll see that Students are listed in the name frame. In fact, there's more than 16,000 on the state payroll with no name listed except as Student. So let's take them out of the data frame to get a cleaner data set.
payroll <- subset(payroll, first_name!="Student")
Much better. That brought 20,000 down to 3,000.
What's the gender breakdown?
Men who work for Connecticut get paid an average of about $10,000 more than women, according to the data analysis we've just conducted. Pay for both genders go up when looking at the median salary. The difference is about $8,000.
This is by no means official. This data is based on what an algorithm inferred. For the most accurate figures, we'll have to request it from the state.
Pivot table III
Let's go deeper!
Which departments pay women or men the most? Which departments have the greatest disparity in pay?
- The highest average pay for men is in the Capital Region Dev Authority at $178,000.
- The highest average pay for women is in the Asian Pacific American Commission at $138,000. But there's only one woman in that department, so moving on to a the next department with more than one female employee is the Psychiatric Security Review Board with an average of $110,000 total pay.
- The biggest disparity in pay between men and women is in the Capital Region Dev Authority. The difference between average salaries is $97,000.
- On the other side, women are paid more than men on average in the Office of Early Childhood by about $79,000 on average.
Pivot table simplified
We won't get into this too deeply. But I wanted to show you a less-involved way to conduct this analysis using pipes and the dplyr package.
These five lines of code replicated everything in the Pivot table II chapter.
I'll try to break it down for you.
gender_df <- payroll %>% -
- everything that happens after the arrow will be assigned to the gender_df object
- start with the payroll data frame that was originally brought in on the first line
%>%is an operator that allows functions to be strung together, which just looks nicer and is much more efficient than stringing or nesting line after line of code. [Read more about pipes].
group_by() is a function that will pivot the data frame based on the column name passed to it. In this instance, it's gender.
summarize() is where you can add columns to the data frame based on calculations like count(
mean()), and median(
median())). You can also assign it column names within the function.
mutate() lets you create more columns with advanced functions. In this example, we're calculating percent.
arrange() is a much simpler, more intuitive way to rearrange a data frame than what we've used before (
And how many lines does it take to recreate the analysis from the Pivot Table III chapter?
Five lines, just like above. But the only difference is that there's an extra column name in the
The structure of the resulting data frame looks different from the one we created before. But it's much better suited this way to work with visualization packages like ggplot2 and d3.
Here's the final R script.