The past few months of work-life have been constantly busy with survey design and analysis. COVID has really changed the landscape for just about everything, and leaders of projects and departments want and need to implement change quickly, and survey research is helping to inform speedy decision making.
I’m here to demo the first R package that I’ve ever written that has been helping me quickly clean and tidy data from SurveyMonkey to prepare for analysis and reporting. This is also a continuation of my previous post about tidying the SurveyMonkey Double Header.
doubleheadr will be useful if you:
* Download or inherit data from SurveyMonkey as
* Commonly find yourself struggling to quickly tidy the header and sub-header, and intuitively rename columns
* Are fed up with your current process for the two steps above and are looking for a new approach
Getting survey data ready for analysis shouldn’t take more than a few lines of code (if any!), and that has been my personal objective ever since I started using R for survey analysis, reporting, and dashboard creation.
As a quick side note, if you have a SurveyMonkey account, you should absolutely check out Sam Firke’s
surveymonkey package, which makes it super easy to pull data from SurveyMonkey’s API. One limitation is that the API is limited to 500 calls per day, (meaning you will max out with any survey having over 50k responses). It’s also common for analysts and data wranglers to inherit response files from other departments or clients, and that’s where
doubleheadr can help out.
doubleheadr in Action
devtools (to run
janitor (also by Sam Firke) to run
Step 1: install packages
I’ll assume you are familiar with installing from GitHub, and I’m working on being more concise.
Step 2: check out the demo file
library(tidyverse) library(doubleheadr) demo <- doubleheadr::demo
doubleheadr comes with a built-in demo data set, which helps illustrate the two functions available:
trim_headr. The data set
demo mimics a
.xlsx export from SurveyMonkey and looks like this:
|Respondent ID||Please provide your contact information:||…3||…4||…5||…6||…7||…8||…9||…10||…11||I wish it would have snowed more this winter.|
|NA||Name||Company||Address||Address 2||City/Town||State/Province||ZIP/Postal Code||Country||Email Address||Phone Number||Response|
|11385284375||Benjamin Franklin||Poor Richard’s||NA||NA||Philadelphia||PA||19104||NAfirstname.lastname@example.org||215-555-4444||Strongly disagree|
|11385273621||Mae Jemison||NASA||NA||NA||Decatur||Alabama||20104||NAemail@example.com||221-134-4646||Strongly agree|
|11385258069||Carl Sagan||Smithsonian||NA||NA||Washington||D.C.||33321||NAfirstname.lastname@example.org||999-999-4422||Neither agree nor disagree|
As analyzers of this data, we will want to paste the column names and values in the first row together. It’s a bit more complicated than that, which is detailed in my previous post, but we’re being concise here.
We can call
clean_headr on our
tibble object, which takes three arguments:
data.frame object (in this case, inherited from LHS)
rep_val: the repeated value as a character string. Our column names have a repeated value of ‘…’ (or ‘..’, or ‘.’)
clean_names: this is a
janitor function that will convert all column names to snake_case and will strip out any non-alphanumeric characters.
demo %>% clean_headr(rep_val = '...', clean_names = TRUE) %>% colnames()
##  "respondent_id" ##  "please_provide_your_contact_information_name" ##  "please_provide_your_contact_information_company" ##  "please_provide_your_contact_information_address" ##  "please_provide_your_contact_information_address_2" ##  "please_provide_your_contact_information_city_town" ##  "please_provide_your_contact_information_state_province" ##  "please_provide_your_contact_information_zip_postal_code" ##  "please_provide_your_contact_information_country" ##  "please_provide_your_contact_information_email_address" ##  "please_provide_your_contact_information_phone_number" ##  "i_wish_it_would_have_snowed_more_this_winter_response"
We can compare this with
clean_names = FALSE, which creates column names that are very similar to the actual survey questions that we’ve asked.
demo %>% clean_headr(rep_val = '...', clean_names = FALSE) %>% colnames()
##  "Respondent ID" ##  "Please provide your contact information: Name" ##  "Please provide your contact information: Company" ##  "Please provide your contact information: Address" ##  "Please provide your contact information: Address 2" ##  "Please provide your contact information: City/Town" ##  "Please provide your contact information: State/Province" ##  "Please provide your contact information: ZIP/Postal Code" ##  "Please provide your contact information: Country" ##  "Please provide your contact information: Email Address" ##  "Please provide your contact information: Phone Number" ##  "I wish it would have snowed more this winter. Response"
Whether we use
clean_names or not, we will most likely want to shorten some of the column names so that our code is more legible. The demo data here is pretty conservative - if you have a verbose survey question, you can imagine how long your column names could be!
trim_headr makes it easy to shorten column names, and is really just composed of a concatenated
gsub call on all of the column names. Let’s say we do use
clean_names and we want to remove “please_provide_your_contact_” from our column names, so that we will still be left with shorter, descriptive names like
information_company, etc. Let’s also shorten our last column name to
demo %>% clean_headr(rep_val = '...') %>% trim_headr(c('please_provide_your_contact_', 'i_wish_it_would_have_', '_response')) %>% colnames()
##  "respondent_id" "information_name" ##  "information_company" "information_address" ##  "information_address_2" "information_city_town" ##  "information_state_province" "information_zip_postal_code" ##  "information_country" "information_email_address" ##  "information_phone_number" "snowed_more_this_winter"
Easy! Depending on the number of survey questions, getting to this point could take a really long time by first manually renaming columns and then deleting the first row. I’ve found this workflow to be more intuitive, and leaves me more time to focus on the data and creating effective dashboards and reports.
If you happen to try out
doubleheadr and find it helpful, or more likely, find some bugs or inefficiencies, I would love to hear about it!