Welcome back to Rawlex. Congratulations on successfully loading your data into R. I guarantee you that was the hardest step. Now, you just need to analyze what you have already loaded in. As I said in the closing portions of the last module, the data you get are often messy, missing, incomplete, and/or have too much information. Thus, in this module, you will learn how to manage data using different methods. There are many ways to do the same thing in R. In this module, you will learn how to manage your data using base R (the foundation set of functions and packages that come built-in with the R programming language. When you download R, you already have these). In the next module, we will learn how to manage data using different packages.

In this module, my example will be drawn from Boysen et al. (2022) - a very interesting paper you should read by the way. Broadly, the authors demonstrate that people believed that women represent the majority of people in both the major and profession of psychology and these perceptions affect their beliefs about who fits the field.

Specifically, we will be looking at their Studies 2A and 2B. Their datasets maybe downloaded here.

First let’s load them into R. Note that I renamed the files to Study2a and Study2b before I load them in and I used my own working directory. You should load it in from your working directory, which probably involves a different file path.

Stop: Do you still remember how to load data into R? Try to do it for yourself before looking at my codes.

setwd("C:/Users/linht/Downloads/RBlog/Module3")
library(haven)

## Warning: package 'haven' was built under R version 4.2.3

Study2A <- read_sav("Study2a.sav")
Study2B <- read_sav("Study2b.sav")

Stop: Check if your data loads in correctly (instructions in the last module). This you should do it yourself. I believe in you.

Goals for this module

By the end of this module, you will

1, Familiarize yourself with different forms of data management, including subsetting and merging, using base R commands. 2, Understand how to transform variables and create new ones, using base R commands.

Subsetting Data

As you open Study2a, you will see a lot of variables you do not need (e.g., First Name, Last Name you should delete for anonymity. You will not need Start Date, End Date either). Thus, you will need to create a subset of the original study that does not include the variables you do not need. To do this, we use these codes.

Study2A_subset1 <- subset(Study2A, select = c(q0003_0001, q0003_0002, q0003_0003, q0003_0004, q0003_0005,
                                             q0003_0006, q0003_0007, q0003_0008, q0004_0001, q0004_0002,
                                             q0004_0003, q0004_0004, q0004_0005, q0004_0006, q0004_0007,
                                             q0004_0008, q0005_0001, q0005_0002, q0005_0004, q0005_0005,
                                             q0005_0006, q0006_0001, q0006_0002, q0006_0003, q0006_0004,
                                             q0006_0005, q0006_0006, q0007))

What you just did in those codes is to create a new dataframe called Study2A_subset, which contains only selected variables from Study2A, the variables you are interested in.

Note: If you name the name of the new dataframe the same as the old dataframe, R will override the data and you will not get the old dataframe back. In other words, if you make your code Study2A <- subset(Study2A, select = c()), you will not get the old Study2A back.

If you are neurotic like me and you want to double-check if R did what you wanted it to, you can always check the names of the remaining variables in the new dataframe (we talked about how to do this last module).

names(Study2A_subset1)

##  [1] "q0003_0001" "q0003_0002" "q0003_0003" "q0003_0004" "q0003_0005"
##  [6] "q0003_0006" "q0003_0007" "q0003_0008" "q0004_0001" "q0004_0002"
## [11] "q0004_0003" "q0004_0004" "q0004_0005" "q0004_0006" "q0004_0007"
## [16] "q0004_0008" "q0005_0001" "q0005_0002" "q0005_0004" "q0005_0005"
## [21] "q0005_0006" "q0006_0001" "q0006_0002" "q0006_0003" "q0006_0004"
## [26] "q0006_0005" "q0006_0006" "q0007"

What if you want to drop variables instead of handpicking the variables you want to choose? Easy, just add -

Study2A_subset2 <- subset(Study2A, select = -c(StartDate, EndDate, EmailAddress, FirstName, LastName))

Double-check please.

names(Study2A_subset2)

##  [1] "CollectorNm"  "RespondentID" "CollectorID"  "CustomData1"  "q0001_0001"  
##  [6] "q0002"        "q0003_0001"   "q0003_0002"   "q0003_0003"   "q0003_0004"  
## [11] "q0003_0005"   "q0003_0006"   "q0003_0007"   "q0003_0008"   "q0004_0001"  
## [16] "q0004_0002"   "q0004_0003"   "q0004_0004"   "q0004_0005"   "q0004_0006"  
## [21] "q0004_0007"   "q0004_0008"   "q0005_0001"   "q0005_0002"   "q0005_0003"  
## [26] "q0005_0004"   "q0005_0005"   "q0005_0006"   "q0006_0001"   "q0006_0002"  
## [31] "q0006_0003"   "q0006_0004"   "q0006_0005"   "q0006_0006"   "q0007"       
## [36] "VAR00007"     "q0008"        "q0009"        "q0010"        "q0011"       
## [41] "q0012"        "q0013"        "q0014"        "q0015"        "clicks"      
## [46] "attention"    "attention1"   "attention2"   "attention3"   "x"           
## [51] "filter_$"

In the study, the authors have set up a bot check by having participants answer “45” regardless of the question. This variable is q0007. Before we conduct our analysis, we would want to remove the potential bots who failed this bot check (i.e., did not respond 45), right? We can also use the subset to do this.

Study2A_nobots <- subset(Study2A, q0007 == "45")

Merge data

Great. Now you know how to create a subset of a larger dataframe. Let’s learn how to merge dataframes.

Sometimes the data you want come in two separate datasets. In this case, you would have to merge the two or more files together. The key thing to remember with merging is that the observations in both datasets must be defined or identified the same way. For example, Study2A and Study2B basically collect the same variables from participants. The only difference is that Study2A was collected online and Study2B was collected in-person. Here are the codes to merge them.

merged_data<-merge(x=Study2A, y=Study2B)

#Always double check!
View(merged_data)

Ok, what about when your data comes in two separate files and you need to merge the variables together, matching based on your participants_ID? Although I do not have an example for this in the dataset, here are the codes you would use.

merged_data <- merge(
  x = Dataframe1,     # The first data frame to be merged
  y = Dataframe2,              # The second data frame to be merged
  by.x = c("column_a", "column_b"),  # The columns to use as the merge keys in the first data frame
  by.y = c("column_x", "column_y")   # The columns to use as the merge keys in the second data frame
)

Note: You can merge on as many merge keys/columns as you want. However, you want to make sure to keep the order consistent among all the dataframes (e.g., Column_a in the first dataframe must match with Column_x in the second data frame; Column_b in the first dataframe must match with Column_y in the second data frame).

Transforming data

Great job mastering subset and merge. Now, we learn how to create new variables by transforming them.

There are various ways to transforming a variable. Most of these transformations use R’s arithmetic functions. To transform a variable, just use one of these operators:

Addition
Subtraction
Multiplication

/ Division

^ Exponentiation

log() Natural log

For example, if you want to calculate the average of the variables, you would write

Study2A$q0003avg <- rowMeans(Study2A[, c("q0003_0001", "q0003_0002", "q0003_0003", "q0003_0004",     #Study2A$q0003avg tells R to create the q0003avg variable in the Study2A dataframe using the formula
                                         "q0003_0005", "q0003_0006", "q0003_0007", "q0003_0008")], 
                             na.rm = TRUE)                                                          #na.rm = TRUE ensures that any rows with missing values in the selected columns are excluded from the calculation, so the average is calculated only for rows with complete data.

Finally, we can recode categorical variables. For example, in Study 2A, q0009 is the variable for participants’ gender, with 1 being male and 2 being female. To recode, we would need the package “car”. We can recode that variable as follows:

install.packages("car")

library(car)

## Loading required package: carData

Study2A$q0009 <- as.numeric(Study2A$q0009) #I am converting this variable to a numerical variable.
Study2A$q0009 <- recode(Study2A$q0009, "1='male'; 2='female'")

#Don't forget to check
table(Study2A$q0009)

## 
##      3 female   male 
##      1     78     49

Practice makes perfect

Download the data to Study 3 in the same project. Their datasets maybe downloaded here.

1, Load the data into R

2, Eliminate the following variables from the original dataset: - CollectorNm - RespondentID - CollectorID - StartDate - EndDate - EmailAddress - FirstName - LastName - CustomData1

3, Variable q0005_0006 needs to be reverse coded (5 becomes 1, 4 becomes 2, 2 becomes 4, and 5 becomes 1). Reverse code this variable and name it q0005_0006r.

4, This dataset has 4 main variables:

Positive masculine traits: q0006_0005, q0007_0009, q0006_0007, q0004_0007, q0005_0006r, q0007_0002, q0007_0004, q0004_0008

Negative masculine traits: q0006_0004, q0007_0008, q0004_0005, q0004_0006, q0005_0007, q0005_0004

Positive feminine traits: q0007_0001, q0007_0005, q0005_0009, q0005_0010, q0006_0009, q0007_0006, q0004_0003, q0004_0009

Negative feminine traits: q0007_0007, q0006_0001, q0004_0004, q0007_0003, q0005_0008, q0006_0006, q0005_0003, q0006_0003

Calculate these variables.

5, q0011 the variable that reports the participants’ race.

1 denotes Asian

2 denotes African American/Black

3 denotes Latino/a/Hispanic

4 denotes Multiracial

5 denotes Native American/Alaskan Native

6 denotes White

7 denotes Other race

Dummy code these variables.

Concluding words

We have finished learning the most basic steps of data management using base R - subsetting, merging, and transforming variables. In the next module, we will continue to learn about data management, but this time, we will explore the powerful package “tidyverse”.

References

Boysen, G. A., Chicosky, R. L., Rose, F. R., & Delmore, E. E. (2022). Evidence for a gender stereotype about psychology and its effect on perceptions of men’s and women’s fit in the field. The Journal of Social Psychology, 162(4), 485–503. https://doi.org/10.1080/00224545.2021.1921682

Module 3: Data Management I

Alex Tran (Tran Ngo Quang Anh)

2023-12-06