Welcome back to Rawlex. In the last module, we went through different ways of visualizing your data using base R commands. In this module, we will learn data visualization using ggplot2 (part of the Tidyverse). Then, you will be able to determine for yourself which ways best fit your goals and coding style.
In this module, the data we will be using is the same as last module, from Joshi et al. (2023). Here is the paper. As a refresher, this paper examines the effects of leaders’ facial structures on people’s perceptions of opportunities availability working with such leaders. You can download their data, available for open access, here
By the end of this module, you will
1, Be able to make basic univariate figures in ggplot2.
2, Customize your figures using ggplot2.
3, Learn how to create bivariate visualizations using ggplot2.
The ggplot2 package was designed to create professional-looking figures in R. This package uses slightly different syntax than the base R plotting functions. ggplot2 figures are much more elaborate and flexible than those created using base R. You have more control over the aesthetics of your figure using ggplot2. Let’s get right to it.
Let’s start with the core ggplot2 command
ggplot(data=datatname, # data to be used
aes(x= , y= , )) # x and y axes information
That’s the core of ggplot2. Any command after, such as changing the color scheme, that can be added using a + (this will be demonstrated later).
Here are the commands for the most basic figures:
As you can see, most of these commands are somewhat intuitive. You do not have to do much memorization here. And, as usual, practice makes perfect. The more you do it, the more you will fall in love with ggplot2 (hopefully).
To begin, let’s load in Study1 as we did in the last module along with the package.
setwd("C:/Users/linht/Downloads/RBlog/Module6/") # don't forget to change this to your own directory
library(haven)
## Warning: package 'haven' was built under R version 4.2.3
Study1 <- read_sav("study1a_dataset.sav")
# Check it out!
head(Study1)
## # A tibble: 6 × 22
## Block age pGender race education polictical MTageend FTageend MDageend
## <dbl+lb> <dbl> <dbl+l> <dbl+l> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl>
## 1 1 [bloc… 21 1 [Mal… 1 [Whi… 2 [Assoc… 4 [Somewh… 3.75 3.88 4.11
## 2 2 [bloc… 21 1 [Mal… 7 [His… 2 [Assoc… 4 [Somewh… 4 4 4.62
## 3 2 [bloc… 21 2 [Fem… 1 [Whi… 2 [Assoc… 3 [Modera… 5 5.38 5.5
## 4 2 [bloc… 22 2 [Fem… 1 [Whi… 1 [High … 2 [Somewh… 4.75 3.88 6.12
## 5 2 [bloc… 22 2 [Fem… 4 [Mul… 2 [Assoc… 3 [Modera… 4.88 4.5 4.25
## 6 2 [bloc… 22 1 [Mal… 1 [Whi… 1 [High … 4 [Somewh… 5.5 6.75 6.25
## # ℹ 13 more variables: FDageend <dbl>, MTcomend <dbl>, FTcomend <dbl>,
## # MDcomend <dbl>, FDcomend <dbl>, MTageopp <dbl>, FTageopp <dbl>,
## # MDageopp <dbl>, FDageopp <dbl>, MTcomopp <dbl>, FTcomopp <dbl>,
## # MDcomopp <dbl>, FDcomopp <dbl>
names(Study1)
## [1] "Block" "age" "pGender" "race" "education"
## [6] "polictical" "MTageend" "FTageend" "MDageend" "FDageend"
## [11] "MTcomend" "FTcomend" "MDcomend" "FDcomend" "MTageopp"
## [16] "FTageopp" "MDageopp" "FDageopp" "MTcomopp" "FTcomopp"
## [21] "MDcomopp" "FDcomopp"
#load in the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
As we learned in the previous module, histograms are great univariate displays for interval/continuous variables. They show the distribution of the variable (i.e. the values of the variable and the number of observations at each value). On the horizontal, or x, axis, there are ranges of values of the variable. On the vertical, or y, axis, there is the frequency, or the number of observations of that variable that fall into the x-axis range. Histograms can give you a sense of the mean, median, mode, and skewness of a variable.
Let’s try to make a histogram for the FDageend variable (for a refresher of what the variable’s names mean, visit the last module, module 6).
ggplot(data=Study1,
aes(x=FDageend)) +
geom_histogram() #try to see how this resembles the core command of ggplot2 as shown above
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Like last module, this first draft is somewhat rough. Let’s try adding a title for the whole histogram and a title for the x-axis.
ggplot(data=Study1,
aes(x=FDageend)) +
geom_histogram() +
ggtitle("Distribution of the Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces", subtitle= "Study1") +
xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next, we can change how wide or narrow each bar of the histogram is. The default is bin=30. Let’s make it a little thinner.
ggplot(data=Study1,
aes(x=FDageend)) +
geom_histogram(bins=20) + #new addition here
ggtitle("Distribution of the Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces", subtitle= "Study1") +
xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces")
Ok. But the color makes it a little hard to distinguish between bars, right? Let’s change the color of the bars and the outline color of the bars.
ggplot(data=Study1,
aes(x=FDageend)) +
geom_histogram(bins=20, color="black", fill="white") + #new addition here
ggtitle("Distribution of the Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces", subtitle= "Study1") +
xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces")
Finally, let’s change the background of the figure from grey to white.
ggplot(data=Study1,
aes(x=FDageend)) +
geom_histogram(bins=20, color="white", fill="grey") +
theme_bw() + #new addition here
ggtitle("Distribution of the Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces", subtitle= "Study1") +
xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces")
And that’s it. You can see the code may be overwhelming, but it is just being added up from a lot of aesthetic commands (yes, aesthetics matter!)
A great way to show the distribution of a nominal variable is to use a barplot. Let’s use a barplot to see the distribution of participants’ gender (pGender)
#First, let's recode
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Study1 <- Study1 %>%
mutate(pGender = case_when(
pGender == 1 ~ "Male",
pGender == 2 ~ "Female",
pGender == 3 ~ "Transgender",
pGender == 5 ~ "Rather not say",
TRUE ~ NA_character_ # Default value if none of the conditions are met
))
#ok, to the plot
ggplot(data=Study1,
aes(x=pGender)) +
geom_bar()
Now, let’s add some labels (titles for the figure, label for x and y axes).
ggplot(data=Study1, aes(x=pGender)) +
geom_bar() +
ggtitle("Bar Chart of Participants' Gender") +
xlab("Participants' Gender") +
ylab("Count")
Let’s flip it horizontally, just to see how it looks.
ggplot(data=Study1, aes(x=pGender)) +
geom_bar() +
coord_flip() + #new addition here
ggtitle("Bar Chart of Participants' Gender") +
xlab("Participants' Gender") +
ylab("Count")
Making the bars horizontal actually makes the figure quite confusing. Let’s not do that. How about changing the colors of the bars?
ggplot(data=Study1, aes(x=pGender, fill=pGender)) + #new addition here
geom_bar() +
ggtitle("Bar Chart of Participants' Gender") +
xlab("Participants' Gender") +
ylab("Count")
Again, let’s change our theme to black and white as before, and also remove the legend of the figure.
ggplot(data=Study1, aes(x=pGender, fill=pGender)) +
geom_bar() +
ggtitle("Bar Chart of Participants' Gender") +
xlab("Participants' Gender") +
ylab("Count") +
theme_bw() + #remove the grey background color
theme(legend.position = "none") #remove the legends
In this dataset, we do not have anyone who chose not to report their gender (i.e., NAs values). This is quite rare. What if we want to remove all the NAs? We would filter them while we are making the figure! That’s the beauty of the Tidyverse. You do not have to create 2 different commands - one for filtering the NA values, one for creating the figure. You can merge them into 1 command. Here is how you do it.
#to filter, we need a specific Tidyverse package. Remember which one?
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.2.3
#now let's remove all the NA values
ggplot(data=Study1 %>% filter(!is.na(pGender)), #filter goes here
aes(x=pGender, fill=pGender)) +
geom_bar() +
ggtitle("Bar Chart of Participants' Gender") +
xlab("Participants' Gender") +
ylab("Count") +
theme_bw() +
theme(legend.position = "none")
Perfect.
Box plots are great presentations fo bivariate (meaning two variables). However, one of the variables must be either nominal or ordinal (categorical) while the other must be interval (or continuous) level data.
Let’s try to graph the distribution of FDageend across pGender.
ggplot(data=Study1 %>% filter(!is.na(pGender)) %>% filter(!is.na(FDageend)),
aes(x=pGender, y=FDageend)) +
geom_boxplot()
Now let’s add the rest as we have been doing for the other types of graphs.
ggplot(data=Study1 %>% filter(!is.na(pGender)) %>% filter(!is.na(FDageend)),
aes(x=pGender, y=FDageend)) +
geom_boxplot() +
ggtitle("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces by Participants' Gender") +
xlab("Participants' Gender")+
ylab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces") +
theme_bw()
Scatterplots show relationships between two continuous variables. Personally, I like ggplot2 scatterplots so much better than base R scatterplots.
Let’s graph the relationship between FDageend and FDcomend.
ggplot(data=Study1,
aes(x=FDageend , y=FDcomend)) +
geom_point()
That looks like a positive correlation alright. Let’s add what we have learned into it per the usual.
ggplot(data=Study1,
aes(x=FDageend , y=FDcomend)) +
geom_point() +
xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces") +
ylab("Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
ggtitle("The Relationship between Perceived Agentic Goals Endorsement and Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
theme_bw()
The beauty of ggplot2 is that I can even add in another categorical variable (e.g., pGender) by changing the color of each of the level of that variable (e.g., male vs. female). Here is how you do it.
ggplot(data=Study1,
aes(x=FDageend , y=FDcomend, color=pGender)) + #addition here
geom_point() +
xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces") +
ylab("Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
ggtitle("The Relationship between Perceived Agentic Goals Endorsement and Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
theme_bw()
Alright, that does not add much to this graph - let’s remove the participants’ gender colors. Finally, and most importantly, you can add in a best-fit line to your scatterplot. Here is how you do it.
ggplot(data=Study1,
aes(x=FDageend , y=FDcomend)) + #addition here
geom_point() +
xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces") +
ylab("Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
ggtitle("The Relationship between Perceived Agentic Goals Endorsement and Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
theme_bw() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Not so linear…
The practice for this module 7 is somewhat similar to module 6 practice. However, I want you to complete this practice using Tidyverse (ggplot2) only.
Download the data to Study 2 in the same project. Their datasets maybe downloaded here.
1, Make a density plot and a histogram for the following variables:
MD_AgeAff
MT_AgeAff
FD_AgeAff
FT_AgeAff
MD_ComAff
MT_ComAff
FD_ComAff
FT_ComAff
2, Create a barplot for the following variables:
age
pgender
race
3a, Create a scatterplot to visualize the relationship between these variables:
MD_AgeAff and MD_ComAff
MT_AgeAff and MT_ComAff
FD_AgeAff and FD_ComAff
FT_AgeAff and FT_ComAff
3b, For each of the scatterplots, create a color for each participants’ race (“race” variable).
3c, Insert a best-fit line for each of the scatterplots.
4, Create a Box and Whisker Plot to visualize the distribution of the MD_AgeAff variable across participants’ race (“race” variable)
Congratulations, you have finished all the modules in Data Visualization. This is a crash course on basic Data Visualization. You probably have seen more elaborate figures before. They can all be made in R. With the tools I have provided you, you have the foundation to learn how to make any figure you want and add any features you want. Here is a document with a lot of extra guidelines for data visualization in R. Go wild.
Hehman, E., & Xie, S. Y. (2021). Doing Better Data Visualization. Advances in Methods and Practices in Psychological Science, 4(4), 25152459211045336. https://doi.org/10.1177/25152459211045334
Joshi, M. P., Lloyd, E. P., Diekman, A. B., & Hugenberg, K. (2023). In the Face of Opportunities: Facial Structures of Scientists Shape Expectations of STEM Environments. Personality & Social Psychology Bulletin, 49(5), 673–691. https://doi.org/10.1177/01461672221077801