Welcome back to Rawlex. In the last module, we went through different ways of visualizing your data using base R commands. In this module, we will learn data visualization using ggplot2 (part of the Tidyverse). Then, you will be able to determine for yourself which ways best fit your goals and coding style.

In this module, the data we will be using is the same as last module, from Joshi et al. (2023). Here is the paper. As a refresher, this paper examines the effects of leaders’ facial structures on people’s perceptions of opportunities availability working with such leaders. You can download their data, available for open access, here

Goals for this module

By the end of this module, you will

1, Be able to make basic univariate figures in ggplot2.

2, Customize your figures using ggplot2.

3, Learn how to create bivariate visualizations using ggplot2.

Introduction to ggplot2

The ggplot2 package was designed to create professional-looking figures in R. This package uses slightly different syntax than the base R plotting functions. ggplot2 figures are much more elaborate and flexible than those created using base R. You have more control over the aesthetics of your figure using ggplot2. Let’s get right to it.

Let’s start with the core ggplot2 command

ggplot(data=datatname,    # data to be used
      aes(x= , y= , ))    # x and y axes information

That’s the core of ggplot2. Any command after, such as changing the color scheme, that can be added using a + (this will be demonstrated later).

Here are the commands for the most basic figures:

As you can see, most of these commands are somewhat intuitive. You do not have to do much memorization here. And, as usual, practice makes perfect. The more you do it, the more you will fall in love with ggplot2 (hopefully).

To begin, let’s load in Study1 as we did in the last module along with the package.

setwd("C:/Users/linht/Downloads/RBlog/Module6/") # don't forget to change this to your own directory

library(haven)
## Warning: package 'haven' was built under R version 4.2.3
Study1 <- read_sav("study1a_dataset.sav")

# Check it out!
head(Study1)
## # A tibble: 6 × 22
##   Block      age pGender race    education polictical MTageend FTageend MDageend
##   <dbl+lb> <dbl> <dbl+l> <dbl+l> <dbl+lbl> <dbl+lbl>     <dbl>    <dbl>    <dbl>
## 1 1 [bloc…    21 1 [Mal… 1 [Whi… 2 [Assoc… 4 [Somewh…     3.75     3.88     4.11
## 2 2 [bloc…    21 1 [Mal… 7 [His… 2 [Assoc… 4 [Somewh…     4        4        4.62
## 3 2 [bloc…    21 2 [Fem… 1 [Whi… 2 [Assoc… 3 [Modera…     5        5.38     5.5 
## 4 2 [bloc…    22 2 [Fem… 1 [Whi… 1 [High … 2 [Somewh…     4.75     3.88     6.12
## 5 2 [bloc…    22 2 [Fem… 4 [Mul… 2 [Assoc… 3 [Modera…     4.88     4.5      4.25
## 6 2 [bloc…    22 1 [Mal… 1 [Whi… 1 [High … 4 [Somewh…     5.5      6.75     6.25
## # ℹ 13 more variables: FDageend <dbl>, MTcomend <dbl>, FTcomend <dbl>,
## #   MDcomend <dbl>, FDcomend <dbl>, MTageopp <dbl>, FTageopp <dbl>,
## #   MDageopp <dbl>, FDageopp <dbl>, MTcomopp <dbl>, FTcomopp <dbl>,
## #   MDcomopp <dbl>, FDcomopp <dbl>
names(Study1)
##  [1] "Block"      "age"        "pGender"    "race"       "education" 
##  [6] "polictical" "MTageend"   "FTageend"   "MDageend"   "FDageend"  
## [11] "MTcomend"   "FTcomend"   "MDcomend"   "FDcomend"   "MTageopp"  
## [16] "FTageopp"   "MDageopp"   "FDageopp"   "MTcomopp"   "FTcomopp"  
## [21] "MDcomopp"   "FDcomopp"
#load in the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3

Histograms

As we learned in the previous module, histograms are great univariate displays for interval/continuous variables. They show the distribution of the variable (i.e. the values of the variable and the number of observations at each value). On the horizontal, or x, axis, there are ranges of values of the variable. On the vertical, or y, axis, there is the frequency, or the number of observations of that variable that fall into the x-axis range. Histograms can give you a sense of the mean, median, mode, and skewness of a variable.

Let’s try to make a histogram for the FDageend variable (for a refresher of what the variable’s names mean, visit the last module, module 6).

ggplot(data=Study1, 
       aes(x=FDageend)) + 
          geom_histogram() #try to see how this resembles the core command of ggplot2 as shown above
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Like last module, this first draft is somewhat rough. Let’s try adding a title for the whole histogram and a title for the x-axis.

ggplot(data=Study1, 
       aes(x=FDageend)) + 
  geom_histogram() +
  ggtitle("Distribution of the Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces", subtitle= "Study1") +
   xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next, we can change how wide or narrow each bar of the histogram is. The default is bin=30. Let’s make it a little thinner.

ggplot(data=Study1, 
       aes(x=FDageend)) + 
  geom_histogram(bins=20) + #new addition here
  ggtitle("Distribution of the Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces", subtitle= "Study1") +
   xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces")

Ok. But the color makes it a little hard to distinguish between bars, right? Let’s change the color of the bars and the outline color of the bars.

ggplot(data=Study1, 
       aes(x=FDageend)) + 
  geom_histogram(bins=20, color="black", fill="white") + #new addition here
  ggtitle("Distribution of the Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces", subtitle= "Study1") +
   xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces")

Finally, let’s change the background of the figure from grey to white.

ggplot(data=Study1, 
       aes(x=FDageend)) + 
  geom_histogram(bins=20, color="white", fill="grey") + 
  theme_bw() + #new addition here
  ggtitle("Distribution of the Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces", subtitle= "Study1") +
   xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces")

And that’s it. You can see the code may be overwhelming, but it is just being added up from a lot of aesthetic commands (yes, aesthetics matter!)

Bar chart

A great way to show the distribution of a nominal variable is to use a barplot. Let’s use a barplot to see the distribution of participants’ gender (pGender)

#First, let's recode
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Study1 <- Study1 %>%
  mutate(pGender = case_when(
    pGender == 1 ~ "Male",
    pGender == 2 ~ "Female",
    pGender == 3 ~ "Transgender",
    pGender == 5 ~ "Rather not say",
    TRUE ~ NA_character_  # Default value if none of the conditions are met
  ))

#ok, to the plot
ggplot(data=Study1, 
       aes(x=pGender)) + 
  geom_bar()

Now, let’s add some labels (titles for the figure, label for x and y axes).

ggplot(data=Study1, aes(x=pGender)) + 
  geom_bar() + 
  ggtitle("Bar Chart of Participants' Gender") + 
  xlab("Participants' Gender") + 
  ylab("Count")

Let’s flip it horizontally, just to see how it looks.

ggplot(data=Study1, aes(x=pGender)) + 
  geom_bar() + 
  coord_flip() + #new addition here
  ggtitle("Bar Chart of Participants' Gender") + 
  xlab("Participants' Gender") + 
  ylab("Count")

Making the bars horizontal actually makes the figure quite confusing. Let’s not do that. How about changing the colors of the bars?

ggplot(data=Study1, aes(x=pGender, fill=pGender)) + #new addition here
  geom_bar() + 
  ggtitle("Bar Chart of Participants' Gender") + 
  xlab("Participants' Gender") + 
  ylab("Count")

Again, let’s change our theme to black and white as before, and also remove the legend of the figure.

ggplot(data=Study1, aes(x=pGender, fill=pGender)) + 
  geom_bar() + 
  ggtitle("Bar Chart of Participants' Gender") + 
  xlab("Participants' Gender") + 
  ylab("Count") +
  theme_bw() + #remove the grey background color
  theme(legend.position = "none") #remove the legends

In this dataset, we do not have anyone who chose not to report their gender (i.e., NAs values). This is quite rare. What if we want to remove all the NAs? We would filter them while we are making the figure! That’s the beauty of the Tidyverse. You do not have to create 2 different commands - one for filtering the NA values, one for creating the figure. You can merge them into 1 command. Here is how you do it.

#to filter, we need a specific Tidyverse package. Remember which one?
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.2.3
#now let's remove all the NA values
ggplot(data=Study1 %>% filter(!is.na(pGender)), #filter goes here
       aes(x=pGender, fill=pGender)) + 
  geom_bar() + 
  ggtitle("Bar Chart of Participants' Gender") + 
  xlab("Participants' Gender") + 
  ylab("Count") +
  theme_bw() + 
  theme(legend.position = "none") 

Perfect.

Boxplots

Box plots are great presentations fo bivariate (meaning two variables). However, one of the variables must be either nominal or ordinal (categorical) while the other must be interval (or continuous) level data.

Let’s try to graph the distribution of FDageend across pGender.

ggplot(data=Study1 %>% filter(!is.na(pGender)) %>% filter(!is.na(FDageend)), 
       aes(x=pGender, y=FDageend)) + 
  geom_boxplot()

Now let’s add the rest as we have been doing for the other types of graphs.

ggplot(data=Study1 %>% filter(!is.na(pGender)) %>% filter(!is.na(FDageend)), 
       aes(x=pGender, y=FDageend)) + 
  geom_boxplot() +
  ggtitle("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces by Participants' Gender") + 
  xlab("Participants' Gender")+
  ylab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces") +
  theme_bw()

Scatterplots

Scatterplots show relationships between two continuous variables. Personally, I like ggplot2 scatterplots so much better than base R scatterplots.

Let’s graph the relationship between FDageend and FDcomend.

ggplot(data=Study1, 
       aes(x=FDageend , y=FDcomend)) + 
  geom_point()

That looks like a positive correlation alright. Let’s add what we have learned into it per the usual.

ggplot(data=Study1, 
       aes(x=FDageend , y=FDcomend)) + 
  geom_point() +
  xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces") +
  ylab("Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
  ggtitle("The Relationship between Perceived Agentic Goals Endorsement and Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
  theme_bw()

The beauty of ggplot2 is that I can even add in another categorical variable (e.g., pGender) by changing the color of each of the level of that variable (e.g., male vs. female). Here is how you do it.

ggplot(data=Study1, 
       aes(x=FDageend , y=FDcomend, color=pGender)) + #addition here
  geom_point() +
  xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces") +
  ylab("Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
  ggtitle("The Relationship between Perceived Agentic Goals Endorsement and Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
  theme_bw()

Alright, that does not add much to this graph - let’s remove the participants’ gender colors. Finally, and most importantly, you can add in a best-fit line to your scatterplot. Here is how you do it.

ggplot(data=Study1, 
       aes(x=FDageend , y=FDcomend)) + #addition here
  geom_point() +
  xlab("Perceived Agentic Goals Endorsement for Leaders with Female Dominant Faces") +
  ylab("Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
  ggtitle("The Relationship between Perceived Agentic Goals Endorsement and Perceived Communal Goals Endorsement for Leaders with Female Dominant Faces") +
  theme_bw() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Not so linear…

Practice makes perfect

The practice for this module 7 is somewhat similar to module 6 practice. However, I want you to complete this practice using Tidyverse (ggplot2) only.

Download the data to Study 2 in the same project. Their datasets maybe downloaded here.

1, Make a density plot and a histogram for the following variables:

MD_AgeAff

MT_AgeAff

FD_AgeAff

FT_AgeAff

MD_ComAff

MT_ComAff

FD_ComAff

FT_ComAff

2, Create a barplot for the following variables:

age

pgender

race

3a, Create a scatterplot to visualize the relationship between these variables:

MD_AgeAff and MD_ComAff

MT_AgeAff and MT_ComAff

FD_AgeAff and FD_ComAff

FT_AgeAff and FT_ComAff

3b, For each of the scatterplots, create a color for each participants’ race (“race” variable).

3c, Insert a best-fit line for each of the scatterplots.

4, Create a Box and Whisker Plot to visualize the distribution of the MD_AgeAff variable across participants’ race (“race” variable)

Concluding words

Congratulations, you have finished all the modules in Data Visualization. This is a crash course on basic Data Visualization. You probably have seen more elaborate figures before. They can all be made in R. With the tools I have provided you, you have the foundation to learn how to make any figure you want and add any features you want. Here is a document with a lot of extra guidelines for data visualization in R. Go wild.

References

Hehman, E., & Xie, S. Y. (2021). Doing Better Data Visualization. Advances in Methods and Practices in Psychological Science, 4(4), 25152459211045336. https://doi.org/10.1177/25152459211045334

Joshi, M. P., Lloyd, E. P., Diekman, A. B., & Hugenberg, K. (2023). In the Face of Opportunities: Facial Structures of Scientists Shape Expectations of STEM Environments. Personality & Social Psychology Bulletin, 49(5), 673–691. https://doi.org/10.1177/01461672221077801