Welcome back to Rawlex. Consistency is key and I am glad you are back to learn more about R. In the last module, we learned how to manage our data (e.g., subset, merge, and transform) using base R.

In R, there are many ways to do the same thing. It is eventually up to you to choose which way fits your “coding style” the most and stick with it. Thus, in this module, I will provide you with other ways to manage your data, but this time using a different R package/ set of packages (i.e., the Tidyverse; hope you still remember what R packages are from Module 1).

Goals for this module

By the end of this module, you will

1, Understand what the Tidyverse is and what are its components and usages. 2, Familiarize yourself with different forms of data management, including subsetting and merging, using Tidyverse. 3, Understand how to transform variables and create new ones, using Tidyverse. 4, Be able to reorder and reshape your data flexibly.

What is the Tidyverse?

The Tidyverse is a collection of powerful and popular R packages designed to make data manipulation, visualization, and analysis easier and more efficient (Wickham et al., 2019). Each Tidyverse package does a different thing (see image below) but they can also work together (hence, the beauty of the whole Tidyverse).

$The Tidyverse$ Here is a short walkthrough of some of the available packages in R:

readr: Importing data
dplyr and tidyr: Cleaning data (what we are learning today)
ggplot2: Visualizing data (more on this in Module 7, so you can skip ahead if you wish)
purrr (I know, right? Lovely name): Analyze data

What’s so special about the Tidyverse?

It is intuitive and efficient (these are just words for now but once you learn it, you will know what I mean)
We can have a multi-step command and link them into a single command. Thus, your codes will look and feel much neater.
For our module, tidyr and dplyr makes data management much easier.

Installing the Tidyverse

The commands to install the Tidyverse packages are not different from how you would install other packages (hope you still remember how). You have 2 options: (1) if you install the Tidyverse, you will automatically install all the packages within the Tidyverse (which may take a while but not too long), (2) you can install the specific packages within the Tidyverse. I demonstrate both here.

# Installing the whole Tidyverse
install.packages("tidyverse")

#Installing individual Tidyverse packages
install.packages("tidyr")
install.packages("dplyr")

Piping(Pipes) in Tidyverse

The pipe operator %>% (don’t worry if it looks scary now, it’s not) allows you to chain together multiple operations in a sequence, making it easier to read and understand complex data manipulation or analysis code. Here is an example. Don’t worry about understanding what the functions do for now, just try to understand how the pipes %>% work.

# Here are 2 separate commands, one to filter and one to summarize the filtered data
filtered_data <- filter(example_dataframe, variable1 == "Alex")
summary_data <- summarise(filtered_data, avg_homework = mean(homework))

# We can link these 2 commands into 1 using pipes as followed
summary_data <- example_dataframe %>%                # Get into the dataframe
  filter(variable1 == "Alex") %>%                    # Filter
  summarise(avg_homework = mean(homework))          # Then summarize

And that’s it for your Introduction to the Tidyverse! Remember to load them in so you can use them!

library(tidyverse) #if you want everything loaded

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tidyr) #if you want individual packages loaded
library(dplyr)

Subsetting Data

Last module, we learned about the “subset” function from base R to subset our dataframes. This module, we will learn how to subset our dataframe using “select” and “filter” commands from the tidyverse.

Let’s use the same dataset as we did last module. For more information on the dataset, please consult module 3.

setwd("C:/Users/linht/Downloads/RBlog/Module4")
library(haven)

## Warning: package 'haven' was built under R version 4.2.3

Study2A <- read_sav("Study2a.sav")

Stop: Make sure the data are loaded in correctly. Try to do this yourself.

As you open Study2a, you will see a lot of variables you do not need (e.g., First Name, Last Name you should delete for anonymity. You will not need Start Date, End Date either). Thus, you will need to create a subset of the original study that does not include the variables you do not need. To do this, we use these codes.

Study2A_subset1 <- Study2A %>% 
  select(q0003_0001, q0003_0002, q0003_0003, q0003_0004, q0003_0005,
         q0003_0006, q0003_0007, q0003_0008, q0004_0001, q0004_0002,
         q0004_0003, q0004_0004, q0004_0005, q0004_0006, q0004_0007,
         q0004_0008, q0005_0001, q0005_0002, q0005_0004, q0005_0005,
         q0005_0006, q0006_0001, q0006_0002, q0006_0003, q0006_0004,
         q0006_0005, q0006_0006, q0007)

# Check it!
head(Study2A_subset1)

## # A tibble: 6 × 28
##   q0003_0001   q0003_0002 q0003_0003 q0003_0004 q0003_0005 q0003_0006 q0003_0007
##   <dbl+lbl>    <dbl+lbl>  <dbl+lbl>  <dbl+lbl>  <dbl+lbl>  <dbl+lbl>  <dbl+lbl> 
## 1 1 [Extremel… 6 [Somewh… 4 [Neithe… 3 [Slight… 5 [Slight… 4 [Neithe… 3 [Slight…
## 2 1 [Extremel… 7 [Extrem… 7 [Extrem… 1 [Extrem… 4 [Neithe… 7 [Extrem… 5 [Slight…
## 3 1 [Extremel… 5 [Slight… 4 [Neithe… 2 [Somewh… 3 [Slight… 2 [Somewh… 5 [Slight…
## 4 4 [Neither … 4 [Neithe… 4 [Neithe… 4 [Neithe… 4 [Neithe… 4 [Neithe… 4 [Neithe…
## 5 4 [Neither … 4 [Neithe… 4 [Neithe… 4 [Neithe… 4 [Neithe… 4 [Neithe… 4 [Neithe…
## 6 3 [Slightly… 5 [Slight… 4 [Neithe… 3 [Slight… 3 [Slight… 4 [Neithe… 4 [Neithe…
## # ℹ 21 more variables: q0003_0008 <dbl+lbl>, q0004_0001 <dbl+lbl>,
## #   q0004_0002 <dbl+lbl>, q0004_0003 <dbl+lbl>, q0004_0004 <dbl+lbl>,
## #   q0004_0005 <dbl+lbl>, q0004_0006 <dbl+lbl>, q0004_0007 <dbl+lbl>,
## #   q0004_0008 <dbl+lbl>, q0005_0001 <dbl+lbl>, q0005_0002 <dbl+lbl>,
## #   q0005_0004 <dbl+lbl>, q0005_0005 <dbl+lbl>, q0005_0006 <dbl+lbl>,
## #   q0006_0001 <dbl+lbl>, q0006_0002 <dbl+lbl>, q0006_0003 <dbl+lbl>,
## #   q0006_0004 <dbl+lbl>, q0006_0005 <dbl+lbl>, q0006_0006 <dbl+lbl>, …

Unlike select, the filter function allows you to keep only certain observations (while keeping all the variables). For example, in the study, the authors have set up a bot check by having participants answer “45” regardless of the question. This variable is q0007. Before we conduct our analysis, we would want to remove the potential bots who failed this bot check (i.e., did not respond 45), right? We can use the filter function to do this.

Study2A_nobots <- Study2A %>% filter(q0007 == "45")

# Check it!
summary(Study2A_nobots$q0007)

##    Length     Class      Mode 
##       120 character character

Merging data

Unlike base R, Tidyverse (or dplyr to be exact) provides much more flexibility concerning how we can merge 2 dataframes together. There are 3 ways we can do so: left_join, right_join, and full_join. To demonstrate what each of them are, here is a fabricated example of 2 datasets to be merged.

# Fabricating data

df1 <- data.frame(
  student_id = c(1, 2, 3, 4, 5),
  name = c("Alice", "Alex", "Amy", "Ashley", "Andrew") # I'm not the most creative person when it comes to names
)

# Sample data frame 2
df2 <- data.frame(
  student_id = c(3, 4, 5, 6, 7),
  score = c(85, 92, 78, 88, 76)
)

# Here is what happens for each of those commands

left_joined <- left_join(df1, df2, by = "student_id")
print(left_joined)

##   student_id   name score
## 1          1  Alice    NA
## 2          2   Alex    NA
## 3          3    Amy    85
## 4          4 Ashley    92
## 5          5 Andrew    78

right_joined <- right_join(df1, df2, by = "student_id")
print(right_joined)

##   student_id   name score
## 1          3    Amy    85
## 2          4 Ashley    92
## 3          5 Andrew    78
## 4          6   <NA>    88
## 5          7   <NA>    76

full_joined <- full_join(df1, df2, by = "student_id")
print(full_joined)

##   student_id   name score
## 1          1  Alice    NA
## 2          2   Alex    NA
## 3          3    Amy    85
## 4          4 Ashley    92
## 5          5 Andrew    78
## 6          6   <NA>    88
## 7          7   <NA>    76

As you can see, the left_join() function makes the first dataframe the “reference dataframe” and the structure of the final dataframe resembles the structure of the first dataframe, with added information wherever possible from the second dataframe.

Conversely, the right_join() function makes the second dataframe the “reference dataframe” and the structure of the final dataframe resembles the structure of the second dataframe, with added information wherever possible from the first dataframe.

Finally, the full_join() function combines information from both dataframes together and add in information wherever possible from both dataframes.

Transforming variables

Amazing. You have mastered the filter, select, and join functions. Now, we will learn how to calculate and recode variables.

In Study 2A, q0009 is the variable for participants’ gender, with 1 being male and 2 being female. We can recode that variable using the mutate() function:

Study2A$q0009 <- as.character(Study2A$q0009) #I am converting this variable to a categorical variable.
Study2A_recoded <- Study2A %>% mutate(gender = recode(q0009, "1" = "male", "2" = "female"))

#Don't forget to check
table(Study2A_recoded$gender)

## 
##      3 female   male 
##      1     78     49

What about calculating new variables? Try calculating the q0003 average among all the q0003 items.

Study2A_newvar <- Study2A %>% mutate(q0003avg = (q0003_0001 + q0003_0002 + q0003_0003 + q0003_0004
                                         + q0003_0005 + q0003_0006 + q0003_0007 + q0003_0008)/8)

Reordering data

Let’s say you want to sort the data in Study 2A such that participants’ number of courses in Psychology they have taken ascend instead of descend (this is variable q0013 by the way).

Study2A_sorted <- Study2A %>% arrange(q0013)

What about descending instead of ascending?

Study2A_sorted <- Study2A %>% arrange(desc(q0013))

Easy as pie! You got this.

Reshaping data

Alright! This will be the hardest section in this module, but stick with me. You got this. The final thing we can do with tidyr (part of the Tidyverse) is to reshape our data between a wide format and a long format. To illustrate, let’s look at an example.

In this example, you have two intervention plans for STEM students (A - learning with interactive exercises, B - learning without interactive exercises) administered to each students at different months. Your dependent measure is your students’ score on the final exam.

Let’s make up this data.

#Fabricating the data

Edu_Data <- data.frame(
  Month = c("Jan", "Jan", "Feb", "Feb", "Mar", "Mar"),
  Intervention = c("A", "B", "A", "B", "A", "B"),
  Scores = c(100, 70, 95, 80, 95, 60)
)

print(Edu_Data)

##   Month Intervention Scores
## 1   Jan            A    100
## 2   Jan            B     70
## 3   Feb            A     95
## 4   Feb            B     80
## 5   Mar            A     95
## 6   Mar            B     60

The pivot_wider() function is used to reshape data from a longer format to a wider format. In this example, we’ll pivot the Edu_Data dataframe so that each unique value in the “Intervention” column becomes a separate column, and the corresponding scores are filled in the new columns.

# Pivot wider to create separate columns for each product's sales
Edu_Data_Wide <- pivot_wider(
  data = Edu_Data,
  names_from = Intervention,
  values_from = Scores
)

print(Edu_Data_Wide)

## # A tibble: 3 × 3
##   Month     A     B
##   <chr> <dbl> <dbl>
## 1 Jan     100    70
## 2 Feb      95    80
## 3 Mar      95    60

Wow! We can really see the difference between intervention A (with interactive exercises) and intervention B (without interactive exercises) - remember this is fabricated data!

The pivot_longer() function is used to reshape data from a wider format to a longer format. In this example, we’ll pivot the Edu_Data_Wide dataframe back to the original format, where the Intervention are combined into a single column, and the Scores are placed in a separate column.

# Pivot longer to combine product columns into a single column and create a sales column
Edu_Data_Long <- pivot_longer(
  data = Edu_Data_Wide,
  cols = c("A", "B"),
  names_to = "Intervention",
  values_to = "Scores"
)

print(Edu_Data_Long)

## # A tibble: 6 × 3
##   Month Intervention Scores
##   <chr> <chr>         <dbl>
## 1 Jan   A               100
## 2 Jan   B                70
## 3 Feb   A                95
## 4 Feb   B                80
## 5 Mar   A                95
## 6 Mar   B                60

Look at you, flexibly reshaping data.

Practice makes perfect

This is the same practice as module 3. However, I want you to complete these practice exercises using Tidyverse commands instead of base R commands, as we learn in this Module 4. The purpose of this is for you to experience different methods of doing the same thing in R and determine for yourself which way you prefer more.

Download the data to Study 3 in the same project. Their datasets maybe downloaded here.

1, Load the data into R

2, Eliminate the following variables from the original dataset: - CollectorNm - RespondentID - CollectorID - StartDate - EndDate - EmailAddress - FirstName - LastName - CustomData1

3, Variable q0005_0006 needs to be reverse coded (5 becomes 1, 4 becomes 2, 2 becomes 4, and 5 becomes 1). Reverse code this variable and name it q0005_0006r.

4, This dataset has 4 main variables:

Positive masculine traits: q0006_0005, q0007_0009, q0006_0007, q0004_0007, q0005_0006r, q0007_0002, q0007_0004, q0004_0008

Negative masculine traits: q0006_0004, q0007_0008, q0004_0005, q0004_0006, q0005_0007, q0005_0004

Positive feminine traits: q0007_0001, q0007_0005, q0005_0009, q0005_0010, q0006_0009, q0007_0006, q0004_0003, q0004_0009

Negative feminine traits: q0007_0007, q0006_0001, q0004_0004, q0007_0003, q0005_0008, q0006_0006, q0005_0003, q0006_0003

Calculate these variables.

5, q0011 the variable that reports the participants’ race.

1 denotes Asian

2 denotes African American/Black

3 denotes Latino/a/Hispanic

4 denotes Multiracial

5 denotes Native American/Alaskan Native

6 denotes White

7 denotes Other race

Dummy code these variables.

Concluding words

We have finished learning the basics of data management, both using base R and Tidyverse commands. Next up, describing your data. Get excited!

References

Boysen, G. A., Chicosky, R. L., Rose, F. R., & Delmore, E. E. (2022). Evidence for a gender stereotype about psychology and its effect on perceptions of men’s and women’s fit in the field. The Journal of Social Psychology, 162(4), 485–503. https://doi.org/10.1080/00224545.2021.1921682

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Module 4: Data Management II

Alex Tran (Tran Ngo Quang Anh)

2023-12-06