Module 6: Data Visualization I

Welcome back to Rawlex. Congratulations on finishing the basics of data module. Now, we are moving on to the next “chapter” of modules: data visualization. This will consist of 2 modules: one using base R and one using ggplot2 (part of Tidyverse) - a reprise of our common theme: there are many ways to do the same thing, some better than others. In this module, we will explore how to visualize our data using base R.

In this module, the data we will be using comes from Joshi et al. (2023). If you have time, I highly recommend you read this paper. All in all, this paper examines the effects of leaders’ facial structures on people’s perceptions of opportunities availability working with such leaders. As we dig deeper into some of their studies, I will give more contexts. You can download their data, available for open access, here

Goals for this module

By the end of this module, you will

1, Be able to make basic univariate figures in Base R

2, Customize your figures using Base R.

3, Learn how to create bivariate visualizations using base R.

General guidelines for making figures in Base R

Base R is a fast and easy way to make simple figures. More complicated (and aesthetically-pleasing) figures should (can only) be made in ggplot2, which is part of the Tidyverse.

In this section, I will provide some guidelines for your data visualizations. This definitely does not apply to every data, but I find it helpful to think about these considerations whenever I visualize my data.

Clean data make clean figures: We did not just spend the last 3 modules learning data management for nothing. Before you can visualize your data, you have to clean them first. Figures can only be so beautiful and informative to the extent that the data used to create them is clean and comprehensible.
Always add descriptive title and labels: Remember, your reader do not know your data as well as you do. In fact, you are creating data visualizations to familiarize your readers to your data better. That’s why, adding title and labels can help make your figures even more comprehensible and organized.
Plots should be easily read and understood; don’t do too much on one figure: There is a right visualization plan for every data. Think about what visualization plan makes sense, and think about how much you can put into a figure. Condensed figures are the best, where they present a lot of information in an organized and comprehensible way. When you cannot think of a way to make a condensed figure, it is best to reduce it to smaller ones. Do not stuff too much information into one figure.
When using colors, consider the accessibility of your figures: I find it best to make my figure monochromatic (one color, different shades).

There is no hard-fast rule to data visualization, but there sure are wrong ways to visualize your data. For a full list of recommendations on data visualization in R, read this well-written paper from Hehman & Xie, 2021.

Univariate (one variable) display

The first group of plots we will cover are univariate plots. One of these plots is the density plot. This is a graphical representation of the probability density function (PDF, not the pdf we usually know) of a continuous random variable. It is a way to visualize the distribution of data as a smooth curve. Density plots are particularly useful for understanding the underlying distribution of a dataset.

Density plots can be helpful for several purposes:

-Visualizing the distribution of a dataset and identifying its central tendency and variability.

Understanding the shape of the data and identifying any potential skewness or multimodality.
Comparing the distribution of different groups or variables in a dataset.
Assessing data patterns and underlying trends.

Let’s demonstrate with Study 1A from Joshi et al. (2023). You can download their Study1A on their OSF link.

setwd("C:/Users/linht/Downloads/RBlog/Module6/") # don't forget to change this to your own directory

library(haven)

## Warning: package 'haven' was built under R version 4.2.3

Study1 <- read_sav("C:/Users/linht/Downloads/RBlog/Module6/study1a_dataset.sav")

# Check it out!
head(Study1)

## # A tibble: 6 × 22
##   Block      age pGender race    education polictical MTageend FTageend MDageend
##   <dbl+lb> <dbl> <dbl+l> <dbl+l> <dbl+lbl> <dbl+lbl>     <dbl>    <dbl>    <dbl>
## 1 1 [bloc…    21 1 [Mal… 1 [Whi… 2 [Assoc… 4 [Somewh…     3.75     3.88     4.11
## 2 2 [bloc…    21 1 [Mal… 7 [His… 2 [Assoc… 4 [Somewh…     4        4        4.62
## 3 2 [bloc…    21 2 [Fem… 1 [Whi… 2 [Assoc… 3 [Modera…     5        5.38     5.5 
## 4 2 [bloc…    22 2 [Fem… 1 [Whi… 1 [High … 2 [Somewh…     4.75     3.88     6.12
## 5 2 [bloc…    22 2 [Fem… 4 [Mul… 2 [Assoc… 3 [Modera…     4.88     4.5      4.25
## 6 2 [bloc…    22 1 [Mal… 1 [Whi… 1 [High … 4 [Somewh…     5.5      6.75     6.25
## # ℹ 13 more variables: FDageend <dbl>, MTcomend <dbl>, FTcomend <dbl>,
## #   MDcomend <dbl>, FDcomend <dbl>, MTageopp <dbl>, FTageopp <dbl>,
## #   MDageopp <dbl>, FDageopp <dbl>, MTcomopp <dbl>, FTcomopp <dbl>,
## #   MDcomopp <dbl>, FDcomopp <dbl>

names(Study1)

##  [1] "Block"      "age"        "pGender"    "race"       "education" 
##  [6] "polictical" "MTageend"   "FTageend"   "MDageend"   "FDageend"  
## [11] "MTcomend"   "FTcomend"   "MDcomend"   "FDcomend"   "MTageopp"  
## [16] "FTageopp"   "MDageopp"   "FDageopp"   "MTcomopp"   "FTcomopp"  
## [21] "MDcomopp"   "FDcomopp"

Let’s get ourselves acquainted with the data. In this study, participants are presented with either a male (M) face or a female (F) face. These faces can be manipulated to become more dominant (D) or trustworthy (T). Thus, you will see at the beginning of some of the variable names, there are “MT” “FT” “MD” “FD” to denote the conditions. “-age-” denotes agentic values (e.g., attaining power, gaining financial rewards), whereas “-com-” denotes communal values (e.g., help others, help society). Finally, “-end” denotes endorsement (i.e., how much participants think the person with the face presented would endorse the values), whereas “-opp” denotes opportunities (i.e., how much participants think the person with the face presented would provide the opportunities to fulfill these values). Now, can you tell what the “MTcomopp” variable means?

Here is how you create a density plot for the “MTcomopp” function:

plot(density(Study1$MTcomopp, na.rm = TRUE)) #don't forget to filter out the NA values.

Wow, that’s quite a normal distribution (-ish!). However, I don’t quite like the current title. We can change this a title by using options in the base plot function. Remember to use quotation marks.

plot(density(Study1$MTcomopp, na.rm = TRUE), main = "Density of the Perceived Communal Opportuntities for Male Trustworthy Faces")

Ah, much better. Ok, but the scale only goes from 1 to 7, can we show that on the figure? Of course.

plot(density(Study1$MTcomopp, na.rm = TRUE), main="Density of the Perceived Communal Opportuntities for Male Trustworthy Faces", xlim=c(1, 7))

Not so “normal” anymore, isn’t it? Pay attention to these considerations as you graph out your data.

Density plots are great for continuous variables, however, they do not provide any information for a variable with more discrete values (i.e only a finite number of values). In this case, we would use a histogram. A histogram, like the density, shows the frequency of the value of a variable in your observations. Unlike a density plot, a histogram is not a continuous line. As the variable itself is a categorical/discrete variable, it makes sense to have a plot that has set “bins” or sections. In fact, our MTcomopp variable would be better represented in a histogram, because participants are presented with a Likert scale of 1-7, with no intervals in between.

hist(Study1$MTcomopp)

Ew, that does not look too good. Let’s first change the title and add a title to the x-axis, as we did in the density plot:

hist(Study1$MTcomopp, main="Histogram of the Perceived Communal Opportuntities for Male Trustworthy Faces", xlab="Perceived Communal Opportuntities")

Better. Now, let’s change the tick marks under the histogram to make it more representative of our measure:

hist(Study1$MTcomopp, main="Histogram of the Perceived Communal Opportuntities for Male Trustworthy Faces", xlab="Perceived Communal Opportuntities", 
     breaks=seq(1,7, by=1)) #by=1 means one tick by every value, from 1 to 7

Perfect.

Histograms are only useful when R recognizes that object as “numeric”. What happens for variables that are labeled categorical variables, without numeric labels? We need the bar plot. The key to using a barplot effectively is to include the table() command, which tallies up the number of observations that have that value of the variable.

Let’s say we want to make a barplot of the participants’ gender (pGender)

#First, let's recode
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Study1 <- Study1 %>%
  mutate(pGender = case_when(
    pGender == 1 ~ "Male",
    pGender == 2 ~ "Female",
    pGender == 3 ~ "Transgender",
    pGender == 5 ~ "Rather not say",
    TRUE ~ NA_character_  # Default value if none of the conditions are met
  ))

#ok, to the plot
barplot(table(Study1$pGender))

Amazing! But what if you don’t want to recode the variable in a separate step (although I would not recommend this)? Well, you can recode the variable as you draft up your graph.

#Changing the category labels
barplot(table(Study1$pGender),
        names.arg = c("Male", "Female"), cex.names = .85)

#note that the order of the labels read right to left, and cex.names is used to control the size of the label texts.

Now let’s add the graph title and a title for our y-axis.

barplot(table(Study1$pGender), main="Barplot of Participants' Gender", ylab="Frequency")

Finally, some aesthetics considerations. You can change the color and direction of the bar plot.

# Changing colors
barplot(table(Study1$pGender), main="Barplot of Participants' Gender", ylab="Frequency", col = "aquamarine4")

# Changing direction
barplot(table(Study1$pGender), main="Barplot of Participants' Gender", ylab="Frequency", horiz = TRUE)

Bivariate (two variables) display

What if we want to visualize the relationship between 2 variables? This is very common in whatever data analysis project we engage with.

The most basic plot for a bivariate display is a scatterplot. Scatterplots are used to show relationships between two continuous variables. Let’s examine the relationship between MTageopp and MTcomopp. In other words, we are examining relationship between the perceived agentic opportunities and the perceived communal opportunities for a male leader with a trustworthy face.

plot(x=Study1$MTageopp, y=Study1$MTcomopp)

As usual, this is a rough first draft of a visualization. Let’s add in the titles.

plot(x=Study1$MTageopp, y=Study1$MTcomopp, xlab = "Perceived agentic affordances", ylab="Perceived communal affordances", 
     main="Scatterplot between Perceived Agentic Affordances and Perceived Communal Affordances of Leaders with Male Trustworthy Faces")

Finally, aesthetic things. You can change the color of the dots using “col” and the shape of the dots using “pch”. Play around and find out.

plot(x=Study1$MTageopp, y=Study1$MTcomopp, xlab = "Perceived agentic affordances", ylab="Perceived communal affordances", 
     main="Scatterplot between Perceived Agentic Affordances and Perceived Communal Affordances of Leaders with Male Trustworthy Faces",
     pch=16, col='pink')

#yes, pink is my favorite color.

What happens if our x variable is not a continuous variable? Well, scatterplots start to look a bit strange. In this case, we would use a Box-and-whisker plot to show a relationship between a continuous y variable (dependent variable) and a categorical x variable (independent variable).

Let’s try to see how ratings of perceived agentic opportunities offered by leaders with male trusthworthy faces (MTageopp) differ by participants’ gender (pGender).

boxplot(MTageopp~pGender, data=Study1)

Not so different across participants’ gender, isn’t it? We also see an outlier for one of the female participants as well (the dot). As usual, let’s add titles to this rough visualization.

boxplot(MTageopp~pGender, data=Study1, xlab = "Participants' Gender", 
        ylab="Perceived Agentic Affordances", main="Box and Whisker Plot for Perceived Agentic Affordances")

Better. Finally, let’s consider some aesthetic options. We can change the color using “col” and the border surrounding the plot using “border” (I know, such creative names).

boxplot(MTageopp~pGender, data=Study1, xlab = "Participants' Gender", 
        ylab="Perceived Agentic Affordances", main="Box and Whisker Plot for Perceived Agentic Affordances", col="lightblue", border = "navy")

Look at that!

Practice makes perfect

Download the data to Study 2 in the same project. Their datasets maybe downloaded here.

1, Make a density plot and a histogram for the following variables:

MD_AgeAff

MT_AgeAff

FD_AgeAff

FT_AgeAff

MD_ComAff

MT_ComAff

FD_ComAff

FT_ComAff

2, Create a barplot for the following variables:

age

pgender

race

3, Create a scatterplot to visualize the relationship between these variables:

MD_AgeAff and MD_ComAff

MT_AgeAff and MT_ComAff

FD_AgeAff and FD_ComAff

FT_AgeAff and FT_ComAff

4, Create a Box and Whisker Plot to visualize the distribution of the MD_AgeAff variable across participants’ race (“race” variable)

Concluding words

In this module, you learned how to visualize your data using the commands in base R. As you can see, they are pretty straightforward and the resulting visualizations are not too bad. In the next module, we will learn how to create beautiful visualizations using ggplot2 (Tidyverse). Get excited.

References

Hehman, E., & Xie, S. Y. (2021). Doing Better Data Visualization. Advances in Methods and Practices in Psychological Science, 4(4), 25152459211045336. https://doi.org/10.1177/25152459211045334

Joshi, M. P., Lloyd, E. P., Diekman, A. B., & Hugenberg, K. (2023). In the Face of Opportunities: Facial Structures of Scientists Shape Expectations of STEM Environments. Personality & Social Psychology Bulletin, 49(5), 673–691. https://doi.org/10.1177/01461672221077801