Matt Roumaya


July 25, 2020

The past few months of work-life have been constantly busy with survey design and analysis. COVID has really changed the landscape for just about everything, and leaders of projects and departments want and need to implement change quickly, and survey research is helping to inform speedy decision making.

(From Luyi Wang’s Reigning Heads)

I’m here to demo the first R package that I’ve ever written that has been helping me quickly clean and tidy data from SurveyMonkey to prepare for analysis and reporting. This is also a continuation of my previous post about tidying the SurveyMonkey Double Header.

Background Info

doubleheadr will be useful if you:
* Download or inherit data from SurveyMonkey as .xlsx or .csv
* Commonly find yourself struggling to quickly tidy the header and sub-header, and intuitively rename columns
* Are fed up with your current process for the two steps above and are looking for a new approach

Getting survey data ready for analysis shouldn’t take more than a few lines of code (if any!), and that has been my personal objective ever since I started using R for survey analysis, reporting, and dashboard creation.

As a quick side note, if you have a SurveyMonkey account, you should absolutely check out Sam Firke’s surveymonkey package, which makes it super easy to pull data from SurveyMonkey’s API. One limitation is that the API is limited to 500 calls per day, (meaning you will max out with any survey having over 50k responses). It’s also common for analysts and data wranglers to inherit response files from other departments or clients, and that’s where doubleheadr can help out.

doubleheadr in Action

You’ll need devtools (to run install_github()), tidyverse, and janitor (also by Sam Firke) to run doubleheadr.

Step 1: install packages

I’ll assume you are familiar with installing from GitHub, and I’m working on being more concise.


Step 2: check out the demo file


demo <- doubleheadr::demo

doubleheadr comes with a built-in demo data set, which helps illustrate the two functions available: clean_headr and trim_headr. The data set demo mimics a .xlsx export from SurveyMonkey and looks like this:

Respondent ID Please provide your contact information: ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 I wish it would have snowed more this winter.
NA Name Company Address Address 2 City/Town State/Province ZIP/Postal Code Country Email Address Phone Number Response
11385284375 Benjamin Franklin Poor Richard's NA NA Philadelphia PA 19104 NA 215-555-4444 Strongly disagree
11385273621 Mae Jemison NASA NA NA Decatur Alabama 20104 NA 221-134-4646 Strongly agree
11385258069 Carl Sagan Smithsonian NA NA Washington D.C. 33321 NA 999-999-4422 Neither agree nor disagree

As analyzers of this data, we will want to paste the column names and values in the first row together. It’s a bit more complicated than that, which is detailed in my previous post, but we’re being concise here.

Step 3: clean_headr

We can call clean_headr on our data.frame or tibble object, which takes three arguments:
* dat: a data.frame object (in this case, inherited from LHS)
* rep_val: the repeated value as a character string. Our column names have a repeated value of ‘…’ (or ‘..’, or ‘.’)
* clean_names: this is a janitor function that will convert all column names to snake_case and will strip out any non-alphanumeric characters.

demo %>% 
  clean_headr(rep_val = '...', clean_names = TRUE) %>% 
 [1] "respondent_id"                                          
 [2] "please_provide_your_contact_information_name"           
 [3] "please_provide_your_contact_information_company"        
 [4] "please_provide_your_contact_information_address"        
 [5] "please_provide_your_contact_information_address_2"      
 [6] "please_provide_your_contact_information_city_town"      
 [7] "please_provide_your_contact_information_state_province" 
 [8] "please_provide_your_contact_information_zip_postal_code"
 [9] "please_provide_your_contact_information_country"        
[10] "please_provide_your_contact_information_email_address"  
[11] "please_provide_your_contact_information_phone_number"   
[12] "i_wish_it_would_have_snowed_more_this_winter_response"  

We can compare this with clean_names = FALSE, which creates column names that are very similar to the actual survey questions that we’ve asked.

demo %>% 
  clean_headr(rep_val = '...', clean_names = FALSE) %>% 
 [1] "Respondent ID"                                           
 [2] "Please provide your contact information: Name"           
 [3] "Please provide your contact information: Company"        
 [4] "Please provide your contact information: Address"        
 [5] "Please provide your contact information: Address 2"      
 [6] "Please provide your contact information: City/Town"      
 [7] "Please provide your contact information: State/Province" 
 [8] "Please provide your contact information: ZIP/Postal Code"
 [9] "Please provide your contact information: Country"        
[10] "Please provide your contact information: Email Address"  
[11] "Please provide your contact information: Phone Number"   
[12] "I wish it would have snowed more this winter. Response"  

Step 4: trim_headr

Whether we use clean_names or not, we will most likely want to shorten some of the column names so that our code is more legible. The demo data here is pretty conservative - if you have a verbose survey question, you can imagine how long your column names could be!

trim_headr makes it easy to shorten column names, and is really just composed of a concatenated gsub call on all of the column names. Let’s say we do use clean_names and we want to remove “please_provide_your_contact_” from our column names, so that we will still be left with shorter, descriptive names like information_name, information_company, etc. Let’s also shorten our last column name to snowed_more_this_winter.

demo %>% 
  clean_headr(rep_val = '...') %>% 
  trim_headr(c('please_provide_your_contact_', 'i_wish_it_would_have_', '_response')) %>% 
 [1] "respondent_id"               "information_name"           
 [3] "information_company"         "information_address"        
 [5] "information_address_2"       "information_city_town"      
 [7] "information_state_province"  "information_zip_postal_code"
 [9] "information_country"         "information_email_address"  
[11] "information_phone_number"    "snowed_more_this_winter"    

Easy! Depending on the number of survey questions, getting to this point could take a really long time by first manually renaming columns and then deleting the first row. I’ve found this workflow to be more intuitive, and leaves me more time to focus on the data and creating effective dashboards and reports.

If you happen to try out doubleheadr and find it helpful, or more likely, find some bugs or inefficiencies, I would love to hear about it!