Tutorial: How to create R functions and packages

Print More

In your data analysis process, do you find yourself repeating the same tasks over and over again?

For example:

  • Analyzing data that’s published regularly
  • Always replacing a single county name so it matches a lookup file
  • Adjusting summary state data for population

There’s a principle in programming called DRY, as in don’t repeat yourself. It’s to reduce the repetition of processes because a) it saves you time and b) it helps avoid bugs that you might introduce each time you recreate the process from scratch.

Don’t be like Sisyphus.

Save your most common tasks as functions and packages in R. This tutorial will show you how.

Difficulty level

Intermediate.

The package dplyr will be used to join some dataframes.

Also, we’ll be creating a function and then turning that function into a package. And that is gonna involve altering metadata and namespaces and using the package roxygen2.

Adjust for population

Let’s say we’re working with some Starbucks store location data.

What would be a problem with this data set if we tried to map it?

That’s right, it would just show how states with a lot of people have a lot of Starbucks. The raw data hasn’t been adjusted for population yet.

We need to 1) join the sb data frame with data on population by state and then 2) calculate the number of Starbucks by state per capita.

This is a problem that we have to deal with all the time.

Let’s make it easier on ourselves in the future.

Lookup file

Start with a lookup file that has a column for state names, state abbreviations, and population data from the Census.

I like to use Google Sheets to store my lookup tables. It’s quick and easy to alter and the address stays the same.

You can bring in a Google Sheet if you publish as a CSV and copy the link over.

Next, we figure out the columns to join by.

This is what it looks like when joined.

State_Abbreviation Starbucks State Population
AK 42 Alaska 741894
AL 65 Alabama 4863300
AR 37 Arkansas 6931071

And how it looks after some math.

State_Abbreviation Starbucks State Population per_capita
AK 42 Alaska 741894 5.661186
AL 65 Alabama 4863300 1.336541
AR 37 Arkansas 6931071 0.533828
Generalizing the code

Alright, we’ve walked through a specific scenario to normalize data for population.

Let’s adjust what we’ve written so we can accommodate other general data sets.

First, establish some rules

For this to work, we need to make some assumptions about the data set we’ll be joining with the population lookup table.

  • The first column will contain either the full state name or the state abbreviation
  • The second column will have values you want to adjust for population

Now, rewrite the code so it’s more generalized.

As in, it can deal with any data set you give it. That means renaming the data frame and renaming the columns so it’s easier to join later on in the process.

Abbrev Starbucks State Population per_capita
AK 42 Alaska 741894 5.661186
AL 65 Alabama 4863300 1.336541
AR 37 Arkansas 6931071 0.533828
Turning the code into a function

Function and variables

Turn your lines of code into a function by wrapping your generalized code with

function(arg1, arg2, ... ){ and closing it off with }

Notice arg1 and arg2 in the code above?

Those are variables that will be passed on into the code.

Remember how there were two types of State ID data?

Full name and abbreviations. That’s one of the variables we’ll pass into the function so it knows what type to join by.

We can write the function so you can tell it to join based on what type it should join by.

Save the function as pc_adjust.

Run that and test it with the sb data set we imported before.

Apply the function pc_adjust to sb with the variable Abbrev.

This is what the first three rows will look like now of test.

Abbrev Starbucks State Population per_capita
AK 42 Alaska 741894 56.61186
AL 65 Alabama 4863300 13.36541
AR 37 Arkansas 6931071 5.33828

Success!

Test it on another data set.

Alright, we’ve got it working with Starbucks data.

Let’s try it with Dunkin’ Donuts data.

State Dunkin
Alabama 18
Alaska 0
Arizona 59
Arkansas 7
California 2
Colorado 8

The state identification is spelled out this time and not abbreviated.

Fortunately, we accounted for that when making the formula.

Run this code.

State Dunkin Abbrev Population per_capita
Alabama 18 AL 4863300 3.7011905
Alaska 0 AK 741894 0.0000000
Arizona 59 AZ 2988248 19.7440105
Arkansas 7 AR 6931071 1.0099449
California 2 CA 39250017 0.0509554
Colorado 8 CO 5540545 1.4439013

Yay, we did it!

pc_adjust() is your tiny perfect function.

Keep going!

Turn a function into a package

Save it so you can share with your future self (perhaps if you work on a different computer) and for others to use.

We’ll be using R Studio for this part.

Select File > New Project > New Directory > R Package

Name the package

One word. Some tips on figuring out the best name.


This is what pops up in R Studio.

Three components

  • An R/ folder where you save your function code – more details
  • A basic DESCRIPTION file for package metadata – more details
  • A basic NAMESPACE file, which is only necessary to be filled out completely if you’re submitting to CRAN- more details

Welcome script

When you first create a new package, a hello.R script will appear with some template code.

Next, edit the DESCRIPTION file

Fill it out with details like the name of the package, version number, a line describing the purpose of the package, the author name, and license information, as well as any libraries that will need to be imported (in this instance, we’re relying on dplyr to join, so list that).

Questions about which License to use? Check out the options.

Also, notice that I added Imports: dplyr because this function won’t work without the left_join function from dplyr.

Create a new script

Copy and paste the pc_adjust function you made into a new script file.

Save as new script in the R folder

Name the file after the function, pc_adjust and save it into the R/ folder

This R/ folder is where you can save other function scripts that you’ll build up over time.

Add documentation to your script

Go back to your pc_adjust.R script and add these lines above the code.

What is all that gibberish?

These special comments above the function will be compiled as metadata into other files for the package to operate correctly.

Watch.

Run these lines from the roxygen2 package in the console.

It took the code you put above the function in the pc_adjust.R script and wrote to the NAMESPACE file and created a pc_adjust.Rd file based on the special comments.

Find and open pc_adjust.Rd in the man folder.

This would’ve been tough to put together by hand.

Build your package

Press Cmd + Shift + B to build the package.

Now, you have your package forever and ever.

Just run

install.packages("whateveryoucalledyourpackage")

and you can run pc_adjust whenever you want.

Your help file

Type ?pc_adjust in your console.

This is what your special comments above your R function helped generate.

That’s it!

Hold up…

Upload package to Github

Upload your package folder to Github so that others can download and install and run your awesome package.

We won’t get into the mechanics of how to do that now.

But here’s how to do it the easy way and here’s the official way.

This means you have to add some clean documentation, such as a readme.MD file.

But after you upload your package folder to Github, this is how others would download it.

You’d have to use the devtools package and the install_github function.

Next steps

Keep adding functions to your package.

Perhaps, create a Shiny version of it for those who don’t use R.

Over time you’ll build up a bunch that you’ll rely on over and over again.

If it’s awesome, submit it to CRAN.

This was an extremely simple version of making a package.

For better details creating R packages, check out the free book from Hadley Wickham.

What do you think?

  • Joseph Brzezinski

    Good article. Maybe it would be less daunting if you had encapsulated much of the data handling in the Rstudio api or some other GUI based apI such as Alteryx which I use frequently. Raw R may be more suitable for programming power users.