Tidying The Survey Monkey Doubleheader

SurveyMonkey 🐒 is a popular online survey development software that outputs a very frustrating kind of response file. If you’ve ever received exported results from SurveyMonkey, you probably know what I’m talking about.

When response data is exported as a .csv or .xlsx file, it looks something like this:

The Doubleheader

The Doubleheader

You can see that the response data for the question “Please provide your contact information:” contains several inputs for Name, Company, Address, etc. Deleting the first row might seem to make sense at first, but when your survey contains dozens of questions with multiple question types, things can get out of hand really fast.

This post is an overview of one approach for cleaning up SurveyMonkey columns. I struggled with this for a while, but now have a workflow that seems pretty efficient, especially if your goal is to set up RMarkdown reports while your survey is being administered. If it is helpful for you, that’s awesome! If you have a better way of approaching this issue, please let me know!

The workflow

  1. A function that creates column names that are concatenated from the question + response in_snake_case (or camelCase, or whatever) using the aptly-named janitor package.
  2. Cleaning up any annoying/extremely_long_column_names_from_very_long_questions.
  3. If needed, subsetting columns into dataframes or lists for analyses/visualization. (The dplyr functions starts_with and ends_with work really well with this workflow).
  4. That’s it, now you’re ready to do some meaningful analyses!

Let’s start by viewing our response data. You’ll need to use the packages tidyverse, janitor, and readxl if you’re using .xlsx.

(I always save the original version without any changes so I can build in QC checks, and tend to use _preserve as an identifier.) You’ll see that some columns are missing values, and will be read in with “…#”

path <- "C:/Users/MR/Desktop/Dummy Data Survey.xlsx"

# Keep this file for QC
survey_data_preserve <- as_tibble(read_excel(path))
## New names:
## * `` -> ...11
## * `` -> ...12
## * `` -> ...13
## * `` -> ...14
## * `` -> ...15
## * ...
survey_data <- survey_data_preserve
kable(head(survey_data[,10:15]))
Please provide your contact information: …11 …12 …13 …14 …15
Name Company Address Address 2 City/Town State/Province
Benjamin Franklin Poor Richard’s NA NA Philadelphia PA
Mae Jemison NASA NA NA Decatur Alabama
Carl Sagan Smithsonian NA NA Washington D.C.
W. E. B. Du Bois NAACP NA NA Great Barrington MA
Florence Nightingale Public Health Co NA NA Florence IT

Ugh! Look at those terrible column names. 😠

Oof!!!

Oof!!!

It’s okay, we’ve got a function for that! (thanks to akrun on SO)

The big reveal!

Here’s the function in its entirety, where you only need to supply a value for x, which should be your SurveyMonkey dataframe. Below, I’ll go through the function line-by-line (for the most part), and at the end, show how I would clean up this data file.

P.S. If snake_case isn’t your thing, there are a bunch of different cases you can pass to clean_names(case = X) at the end of the function.

double_header <- function(x) {
  
  df <- as_tibble(x)
  
  keydat <- df %>%
    slice(1) %>%
    select_if(negate(is.na)) %>%
    pivot_longer(everything()) %>%
    group_by(grp = cumsum(!startsWith(name, "..."))) %>%
    mutate(value = sprintf("%s (%s)", first(name), value)) %>%
    ungroup %>%
    select(-grp)
  
  df <- df %>%
    rename_at(vars(keydat$name), ~ keydat$value) %>%
    slice(-1) %>%
    clean_names()
}

The double header breakdown

  1. slice(1) selects the first row, which contains the other names we need.
  2. select_if(negate(is.na)) selects all columns where the first row is not NA, because we don’t need to alter these column names.
  3. pivot_longer(everything()) transforms our dataframe from wide to long, and automatically creates the columns name and value.
  • (name holds all of our column names and value holds all of our secondary column names.)
## # A tibble: 6 x 2
##   name                                     value         
##   <chr>                                    <chr>         
## 1 Please provide your contact information: Name          
## 2 ...11                                    Company       
## 3 ...12                                    Address       
## 4 ...13                                    Address 2     
## 5 ...14                                    City/Town     
## 6 ...15                                    State/Province
  1. group_by(grp = cumsum(!startsWith(name, "..."))) groups rows and then applies a cumulative sum for all rows in the name column that do not start with “…” until a row other than “…” is encountered. This is better shown in the table below:
name value grp
Please provide your contact information: Name 1
…11 Company 1
…12 Address 1
…13 Address 2 1
…14 City/Town 1
…15 State/Province 1
…16 ZIP/Postal Code 1
…17 Country 1
…18 Email Address 1
…19 Phone Number 1
I wish it would have snowed more this winter. Response 2
I have consumed these beverages in the past week (select all that apply.) Water 3
…22 Coffee 3
…23 Tea 3
…24 Soda 3
…25 Beer 3
…26 Wine 3
…27 Hard Liquor 3
…28 Juice 3
Choose all of the places you have lived for 6 months or longer. Pennsylvania 4
…30 New Jersey 4
…31 Delaware 4
…32 New York 4
…33 California 4
…34 Texas 4
…35 Florida 4
…36 Other (please specify) 4
Describe your perfect day off: Open-Ended Response 5
Describe your perfect pizza: Open-Ended Response 6
  1. mutate(value = sprintf("%s (%s)", first(name), value)) updates our value column and concatenates our names so that they’re meaningful. We’re almost there!
name value grp
Please provide your contact information: Please provide your contact information: (Name) 1
…11 Please provide your contact information: (Company) 1
…12 Please provide your contact information: (Address) 1
…13 Please provide your contact information: (Address 2) 1
…14 Please provide your contact information: (City/Town) 1
…15 Please provide your contact information: (State/Province) 1
  1. Then we just ungroup and drop the grp column.
  2. Finally, we rename the columns in our survey_data by using our updated names in keydat$value, and call clean_names() to convert to snake_case 🐍

All together now

Let’s run the function on survey_data

survey_data <- double_header(survey_data)
Here’s a comparison of our original names and our cleaned names:
Old Names New Names
Respondent ID respondent_id
Collector ID collector_id
Start Date start_date
End Date end_date
IP Address ip_address
Email Address email_address
First Name first_name
Last Name last_name
Custom Data 1 custom_data_1
Please provide your contact information: please_provide_your_contact_information_name
…11 please_provide_your_contact_information_company
…12 please_provide_your_contact_information_address
…13 please_provide_your_contact_information_address_2
…14 please_provide_your_contact_information_city_town
…15 please_provide_your_contact_information_state_province
…16 please_provide_your_contact_information_zip_postal_code
…17 please_provide_your_contact_information_country
…18 please_provide_your_contact_information_email_address
…19 please_provide_your_contact_information_phone_number
I wish it would have snowed more this winter. i_wish_it_would_have_snowed_more_this_winter_response
I have consumed these beverages in the past week (select all that apply.) i_have_consumed_these_beverages_in_the_past_week_select_all_that_apply_water
…22 i_have_consumed_these_beverages_in_the_past_week_select_all_that_apply_coffee
…23 i_have_consumed_these_beverages_in_the_past_week_select_all_that_apply_tea
…24 i_have_consumed_these_beverages_in_the_past_week_select_all_that_apply_soda
…25 i_have_consumed_these_beverages_in_the_past_week_select_all_that_apply_beer
…26 i_have_consumed_these_beverages_in_the_past_week_select_all_that_apply_wine
…27 i_have_consumed_these_beverages_in_the_past_week_select_all_that_apply_hard_liquor
…28 i_have_consumed_these_beverages_in_the_past_week_select_all_that_apply_juice
Choose all of the places you have lived for 6 months or longer. choose_all_of_the_places_you_have_lived_for_6_months_or_longer_pennsylvania
…30 choose_all_of_the_places_you_have_lived_for_6_months_or_longer_new_jersey
…31 choose_all_of_the_places_you_have_lived_for_6_months_or_longer_delaware
…32 choose_all_of_the_places_you_have_lived_for_6_months_or_longer_new_york
…33 choose_all_of_the_places_you_have_lived_for_6_months_or_longer_california
…34 choose_all_of_the_places_you_have_lived_for_6_months_or_longer_texas
…35 choose_all_of_the_places_you_have_lived_for_6_months_or_longer_florida
…36 choose_all_of_the_places_you_have_lived_for_6_months_or_longer_other_please_specify
Describe your perfect day off: describe_your_perfect_day_off_open_ended_response
Describe your perfect pizza: describe_your_perfect_pizza_open_ended_response

Subsetting made easy!

Now if we want to subset data based on certain questions/columns, we can do it really easily using starts_with and ends_with.

Sometimes it’s easier to rename columns so that they’re shorter and easier to work with, and sometimes it’s fine to keep some really long column names if your survey contains a lot of similar questions.

Below, we’ll combine all of the questions that start with please_provide_your_contact_information and shorten the names to only start with contact_information + value.

starts_with example:

contact_info <- survey_data %>%
  select(starts_with("please_provide_your_contact_information")) %>%
  rename_at(vars(starts_with("please")), ~str_remove(.,"please_provide_your_"))

kable(head(contact_info))
contact_information_name contact_information_company contact_information_address contact_information_address_2 contact_information_city_town contact_information_state_province contact_information_zip_postal_code contact_information_country contact_information_email_address contact_information_phone_number
Benjamin Franklin Poor Richard’s NA NA Philadelphia PA 19104 NA 215-555-4444
Mae Jemison NASA NA NA Decatur Alabama 20104 NA 221-134-4646
Carl Sagan Smithsonian NA NA Washington D.C. 33321 NA 999-999-4422
W. E. B. Du Bois NAACP NA NA Great Barrington MA 1230 NA 999-000-1234
Florence Nightingale Public Health Co NA NA Florence IT 33225 NA 123-456-7899
Galileo Galilei NASA NA NA Pisa IT 12345 NA 111-888-9944

ends_with is really helpful in combination with our double_header function and SurveyMonkey data, because open-ended or free-text responses all end with …open_ended_response

Depending on the report or project you’re building, it’s sometimes useful to add all of your free-text responses as dataframes in a list, to include as an appendix, or to select important comments to include in an executive summary.

ends_with example:

open_ended <- survey_data %>%
  select(ends_with("open_ended_response")) %>%
  map(., function(x) as.data.frame(x))

print(names(open_ended))
## [1] "describe_your_perfect_day_off_open_ended_response"
## [2] "describe_your_perfect_pizza_open_ended_response"

That’s it for this post! I’d love to hear from you if you found this workflow helpful, or if there is any way it could be improved.

Some ideas for future posts include building this function into a package (for the sole purpose of learning how to build R packages), showing a few tricks I’ve learned with the HH and lattice packages for visualizing Likert scale responses, and some more trivial posts about Rummy, House Hunters, and any other reality TV my wife and I are currently fixated on.

Matt Roumaya
Matt Roumaya

Data Analyst in Philadelphia, PA