Module 5: Descriptive Statistics

Welcome back to Rawlex. You have finished learning all the prerequisites in R before you conduct data analysis. So far, we have learned multiple ways to load in and mangage our data. Now, we will start learning about data analysis, beginning with descriptive statistics. It is always good to know your data and be able to describe it to your readers. Thus, in this module, I will provide you with multiple ways to get different descriptive statistics from your data. Once again, it is your call to pick a way that suits you and your goals the most. Let’s get started.

In this module, the data we will be using comes from Joshi & Diekman (2022). If you have time, I highly recommend you read this paper. All in all, this paper examines the effects of the presence of women in leadership roles. As we dig deeper into some of their studies, I will give more contexts. You can download their data, available for open access, here

Goals for this module

By the end of this module, you will

1, Review the basics of descriptive statistics.

2, Be able to get descriptive statistics from your data using R.

3, Learn how to present the descriptive statistics using R.

Descriptive Statistics - A review

Descriptive statistics, in a nutshell, are the statistics that summarize a distribution (i.e., your data). In this module, we will be using these descriptive statistics: central tendency (i.e., mean, median), variation (i.e., standard deviation), range, and skewness. Let’s review what each of these means.

Central tendency tells us where the data tends to be. There are 3 statistics we will be looking at: (1) The mean refers to the arithmetic average of a set of numerical values. (2) The median refers to the observation in the middle of an ordered data.

The variation tells us how spread out (or not) the data is. A measure of variation we will be looking at is standard deviation.

The range tells us the difference between the smallest (min) and largest (max) observations in our data.

Skewness tells us about the (a)symmetry of our distribution.

I understand this is a sparse review of the statistics but this is only a quick review; our focus is on R. Should you want to learn more about descriptive statistics, consult this source

Pulling descriptive statistics out using base R

Great. Now that we are caught up on different descriptive statistics, let’s learn how to get them using R. For the demonstration, I will be using Study 1 from Joshi & Diekman (2022). In this study, the researchers examine whether seeing female leaders in STEM (compared to male leaders) increase perceivers’ trust in the organizations and their perceived opportunities offer by those organizations.

Let’s load in the data.

Stop: Try loading in the data yourself.

setwd("C:/Users/linht/Downloads/RBlog/Module5/")

library(haven)

## Warning: package 'haven' was built under R version 4.2.3

Study1 <- read_sav("C:/Users/linht/Downloads/RBlog/Module5/study1labadvisor_mturkdata.sav")

# Check it out!
View(Study1)

The most basic of R’s descriptive statistics commands is the summary command. Summary can be used on an entire dataset or on a single variable in that dataset.

# Let's try summarizing the whole dataset

summary(Study1)

##     filter_$     Pgender           age           FageAff         MageAff     
##  Min.   :1    Min.   :1.000   Min.   :21.00   Min.   :3.000   Min.   :3.000  
##  1st Qu.:1    1st Qu.:1.000   1st Qu.:29.00   1st Qu.:4.672   1st Qu.:4.641  
##  Median :1    Median :1.000   Median :33.00   Median :5.469   Median :5.438  
##  Mean   :1    Mean   :1.496   Mean   :37.26   Mean   :5.379   Mean   :5.341  
##  3rd Qu.:1    3rd Qu.:2.000   3rd Qu.:42.00   3rd Qu.:6.141   3rd Qu.:6.000  
##  Max.   :1    Max.   :2.000   Max.   :72.00   Max.   :7.000   Max.   :7.000  
##  NA's   :23   NA's   :19      NA's   :19      NA's   :6       NA's   :8      
##     FcomAff         McomAff          Ftrust          Mtrust     
##  Min.   :3.000   Min.   :3.000   Min.   :2.850   Min.   :2.800  
##  1st Qu.:4.500   1st Qu.:4.453   1st Qu.:4.650   1st Qu.:4.467  
##  Median :5.344   Median :5.312   Median :5.500   Median :5.267  
##  Mean   :5.300   Mean   :5.197   Mean   :5.349   Mean   :5.154  
##  3rd Qu.:6.062   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##  NA's   :6       NA's   :8       NA's   :7       NA's   :9      
##     Mtrust2         Ftrust2     
##  Min.   :3.000   Min.   :2.750  
##  1st Qu.:4.400   1st Qu.:4.875  
##  Median :5.333   Median :5.625  
##  Mean   :5.198   Mean   :5.448  
##  3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :7.000   Max.   :7.000  
##  NA's   :9       NA's   :7

Wow, okay that’s a lot. Something we can examine is the variable “age,” which denotes participants’ age. The mean is 37.26 and the median is 33. Age ranges from 21 to 72 (talk about a representative sample). What if we only want to examine only the variable “age”?

# summary for a single variable 

summary(Study1$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   29.00   33.00   37.26   42.00   72.00      19

As you can see, the summary command does not give us all the statistics we usually need. So, we can get the specific statistics using different commands.

Let’s start with central tendency (mean and median) and variation (sd).

#To get the mean: 

mean(Study1$age, na.rm=TRUE)

## [1] 37.26087

#na.rm is a command to remove (rm) all NA values in the age variable. This is because some participants decline to answer the age question. Thus, their value would be NA (not available). In order for us to calculate the mean, we need to remove all the NAs.

#To get the median

median(Study1$age, na.rm=TRUE)

## [1] 33

#To get the the standard deviation

sd(Study1$age, na.rm=TRUE)

## [1] 11.45938

#Pretty straightforward, right?

There is a package for skewness, unlike the other measures where we can just use base R. We would need to use the moment package.

#If you haven't downloaded the moments package yet, install it first. Here's a refresher on how to install packages.
#install.packages("moments")

library(moments)

#Then the rest is pretty intuitive

skewness(Study1$age, na.rm = TRUE)

## [1] 1.092112

What if you have a categorical variable such as participants gender (i.e., Pgender)? In this case, the table command is the better option. The code below shows how to use this command:

#First, let's recode. I use Tidyverse here but if you prefer base R commands, you can also try.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Study1$Pgender <- as.character(Study1$Pgender) #I am making the Gender variable a character variable instead of a string variable

Study1 <- Study1 %>% mutate(Pgender = recode(Pgender, 
                                             "1" = "male", 
                                             "2" = "female", 
                                             "3" = "transgender", 
                                             "4" = "rather not say"))

table(Study1$Pgender)

## 
## female   male 
##     57     58

Great, so we see we have 57 participants identifying as female and 58 participants identifying as male (gender-balanced indeed).

Pulling descriptive statistics out using R Tidyverse

As we learned in modules 3 and 4, there are multiple ways to do the same thing in R. In the previous section, we learned base R commands. Now, I will show you how to do the same thing using Tidyverse.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ stringr 1.5.0
## ✔ tidyr   1.3.0     ✔ forcats 0.5.2
## ✔ readr   2.1.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

#mean

Study1 %>% summarise(mean(age, na.rm=TRUE))

## # A tibble: 1 × 1
##   `mean(age, na.rm = TRUE)`
##                       <dbl>
## 1                      37.3

#median

Study1 %>% summarise(median(age, na.rm=TRUE))

## # A tibble: 1 × 1
##   `median(age, na.rm = TRUE)`
##                         <dbl>
## 1                          33

#standard deviation

Study1 %>% summarise(sd(age, na.rm=TRUE))

## # A tibble: 1 × 1
##   `sd(age, na.rm = TRUE)`
##                     <dbl>
## 1                    11.5

#what about all of them?

Study1$age <- as.numeric(Study1$age)
Study1 %>% summarize_at(vars(age), funs(mean(., na.rm=TRUE), median(., na.rm=TRUE), sd(., na.rm=TRUE), length))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## # A tibble: 1 × 4
##    mean median    sd length
##   <dbl>  <dbl> <dbl>  <int>
## 1  37.3     33  11.5    134

Now’s let try it with multiple variables

Study1 %>% summarize_at(vars(Ftrust, Mtrust), funs(mean(., na.rm=TRUE), median(., na.rm=TRUE), sd(., na.rm=TRUE), length))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## # A tibble: 1 × 8
##   Ftrust_mean Mtrust_mean Ftrust_median Mtrust_median Ftrust_sd Mtrust_sd
##         <dbl>       <dbl>         <dbl>         <dbl>     <dbl>     <dbl>
## 1        5.35        5.15           5.5          5.27     0.953     0.963
## # ℹ 2 more variables: Ftrust_length <int>, Mtrust_length <int>

Beautiful. We can also use the tidyverse to get the means by groups. This uses a very useful command: group_by. group_by is a way to tell R that you want summary statistics by a certain group. For example, you can get summary statistics of Ftrust and Mtrust variable sorted by the gender of the participants themselves:

Study1 %>% group_by(Pgender) %>% summarize_at(vars(Mtrust, Ftrust), funs(mean(., na.rm=TRUE), median(., na.rm=TRUE), sd(., na.rm=TRUE), length))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## # A tibble: 3 × 9
##   Pgender Mtrust_mean Ftrust_mean Mtrust_median Ftrust_median Mtrust_sd
##   <chr>         <dbl>       <dbl>         <dbl>         <dbl>     <dbl>
## 1 female         5.27        5.48          5.27          5.75     0.908
## 2 male           5.13        5.25          5.2           5.25     0.975
## 3 <NA>           4.65        5.18          4.9           5.45     1.12 
## # ℹ 3 more variables: Ftrust_sd <dbl>, Mtrust_length <int>, Ftrust_length <int>

Displaying summary statistics using the modelsummary package

In your work, you might want to present a table of summary statistics. However, good tables are time consuming to make. R makes it easy. First, you will need to install and load the modelsummary library (Arel-Bundock, 2022).

#remember to install the package first
#install.packages("modelsummary")

library(modelsummary)

## Warning: package 'modelsummary' was built under R version 4.2.3

#Now we get the summary statistics
modelsummary::datasummary_skim(Study1, output='markdown')

## Warning in datasummary_skim_numeric(data, output = output, fmt = fmt, histogram
## = histogram, : The histogram argument is only supported for (a) output types
## "default", "html", or "kableExtra"; (b) writing to file paths with extensions
## ".html", ".jpg", or ".png"; and (c) Rmarkdown or knitr documents compiled to PDF
## or HTML. Use `histogram=FALSE` to silence this warning.

	Unique (#)	Missing (%)	Mean	SD	Min	Median	Max
attention check	2	17	1.0	0.0	1.0	1.0	1.0
age	44	14	37.3	11.5	21.0	33.0	72.0
FageAff	51	4	5.4	1.0	3.0	5.5	7.0
MageAff	59	6	5.3	0.9	3.0	5.4	7.0
FcomAff	57	4	5.3	1.0	3.0	5.3	7.0
McomAff	57	6	5.2	0.9	3.0	5.3	7.0
Ftrust	63	5	5.3	1.0	2.8	5.5	7.0
Mtrust	75	7	5.2	1.0	2.8	5.3	7.0
Mtrust2	42	7	5.2	1.0	3.0	5.3	7.0
Ftrust2	30	5	5.4	1.0	2.8	5.6	7.0

#Now that's a beautiful summmary table. Here, I put my output as a markdown file but you can also make the output "html" or "latex" - your preference.

#What if we want categorical variables only?
datasummary_skim(Study1, type="categorical")

Pgender	N	%
female	57	42.5
male	58	43.3
NA	19	14.2

#What if we want numerical variables only?
datasummary_skim(Study1, type="numeric")

## Warning in datasummary_skim_numeric(data, output = output, fmt = fmt, histogram
## = histogram, : The histogram argument is only supported for (a) output types
## "default", "html", or "kableExtra"; (b) writing to file paths with extensions
## ".html", ".jpg", or ".png"; and (c) Rmarkdown or knitr documents compiled to PDF
## or HTML. Use `histogram=FALSE` to silence this warning.

	Unique (#)	Missing (%)	Mean	SD	Min	Median	Max
attention check	2	17	1.0	0.0	1.0	1.0	1.0
age	44	14	37.3	11.5	21.0	33.0	72.0
FageAff	51	4	5.4	1.0	3.0	5.5	7.0
MageAff	59	6	5.3	0.9	3.0	5.4	7.0
FcomAff	57	4	5.3	1.0	3.0	5.3	7.0
McomAff	57	6	5.2	0.9	3.0	5.3	7.0
Ftrust	63	5	5.3	1.0	2.8	5.5	7.0
Mtrust	75	7	5.2	1.0	2.8	5.3	7.0
Mtrust2	42	7	5.2	1.0	3.0	5.3	7.0
Ftrust2	30	5	5.4	1.0	2.8	5.6	7.0

With this package, you can also get summary statistics by another variable. For example, you can get the Ftrust and MTrust statistics by participants’ gender using:

#Just Ftrust by participants' gender
datasummary((Ftrust)~Pgender*(mean+sd)*Arguments(na.rm = TRUE), data=Study1)

	female / mean	female / sd	male / mean	male / sd
Ftrust	5.48	0.89	5.25	0.97

#Both Ftrust and MTrust by participants' gender
datasummary((Ftrust+Mtrust)~Pgender*(mean+sd)*Arguments(na.rm = TRUE), data=Study1)

	female / mean	female / sd	male / mean	male / sd
Ftrust	5.48	0.89	5.25	0.97
Mtrust	5.27	0.91	5.13	0.98

Okay, but the variable names look somewhat incomprehensible if you do not know what the variables mean. We can change the table looks using this code:

datasummary(('Organizational Trust with Female Leaders Present'=Ftrust)+ ('Organizational Trust with Male Leaders Present'=Mtrust)~
              Pgender*(mean+sd)*Arguments(na.rm = TRUE), data=Study1)

	female / mean	female / sd	male / mean	male / sd
"Organizational Trust with Female Leaders Present"	5.48	0.89	5.25	0.97
"Organizational Trust with Male Leaders Present"	5.27	0.91	5.13	0.98

Bonus: Another functionality of this library is the ability to get correlation matrices for your data. This is quite simple using the datasummary_correlation command

datasummary_correlation(Study1)

## Warning in stats::cor(x, use = "pairwise.complete.obs", method = method): the
## standard deviation is zero

	attention check	age	FageAff	MageAff	FcomAff	McomAff	Ftrust	Mtrust	Mtrust2	Ftrust2
attention check	1	.	.	.	.	.	.	.	.	.
age		1	.	.	.	.	.	.	.	.
FageAff		.07	1	.	.	.	.	.	.	.
MageAff		.11	.91	1	.	.	.	.	.	.
FcomAff		.05	.92	.87	1	.	.	.	.	.
McomAff		.05	.80	.85	.87	1	.	.	.	.
Ftrust		.14	.86	.81	.88	.73	1	.	.	.
Mtrust		.07	.77	.78	.79	.87	.80	1	.	.
Mtrust2		.12	.72	.75	.74	.80	.78	.95	1	.
Ftrust2		.15	.80	.76	.82	.66	.97	.74	.74	1

Practice makes perfect

Download the data to Study 2 in the same project. Their datasets maybe downloaded here. In Study 2, the researchers examine whether seeing female leaders in a technological company (compared to male leaders) increase perceivers’ trust in the organizations and their perceived opportunities offer by those organizations.

1, Find the mean, median, sd, and skewness of the following variables: age, f_ageaff, m_ageaff, f_comaff, m_comaff

2, Find the number of women and men participating in the study (i.e., Pgender)

Concluding words

Now you know how to describe your data in different ways using R. With that, we finished all the modules concerning the basics of data. Congratulations! Next, we will learn about data visualizations.

References

Arel-Bundock, V. (2022). modelsummary: Data and Model Summaries in R by Vincent Arel-Bundock. Journal of Statistical Software, 103(1), 1-23. https://www.jstatsoft.org/article/view/v103i01

Joshi, M. P., & Diekman, A. B. (2022). My Fair Lady? Inferring Organizational Trust From the Mere Presence of Women in Leadership Roles. Personality and Social Psychology Bulletin, 48(8), 1220–1237. https://doi.org/10.1177/01461672211035957