Welcome back to Rawlex. You have finished learning all the prerequisites in R before you conduct data analysis. So far, we have learned multiple ways to load in and mangage our data. Now, we will start learning about data analysis, beginning with descriptive statistics. It is always good to know your data and be able to describe it to your readers. Thus, in this module, I will provide you with multiple ways to get different descriptive statistics from your data. Once again, it is your call to pick a way that suits you and your goals the most. Let’s get started.
In this module, the data we will be using comes from Joshi & Diekman (2022). If you have time, I highly recommend you read this paper. All in all, this paper examines the effects of the presence of women in leadership roles. As we dig deeper into some of their studies, I will give more contexts. You can download their data, available for open access, here
By the end of this module, you will
1, Review the basics of descriptive statistics.
2, Be able to get descriptive statistics from your data using R.
3, Learn how to present the descriptive statistics using R.
Descriptive statistics, in a nutshell, are the statistics that summarize a distribution (i.e., your data). In this module, we will be using these descriptive statistics: central tendency (i.e., mean, median), variation (i.e., standard deviation), range, and skewness. Let’s review what each of these means.
Central tendency tells us where the data tends to be. There are 3 statistics we will be looking at: (1) The mean refers to the arithmetic average of a set of numerical values. (2) The median refers to the observation in the middle of an ordered data.
The variation tells us how spread out (or not) the data is. A measure of variation we will be looking at is standard deviation.
The range tells us the difference between the smallest (min) and largest (max) observations in our data.
Skewness tells us about the (a)symmetry of our distribution.
I understand this is a sparse review of the statistics but this is only a quick review; our focus is on R. Should you want to learn more about descriptive statistics, consult this source
Great. Now that we are caught up on different descriptive statistics, let’s learn how to get them using R. For the demonstration, I will be using Study 1 from Joshi & Diekman (2022). In this study, the researchers examine whether seeing female leaders in STEM (compared to male leaders) increase perceivers’ trust in the organizations and their perceived opportunities offer by those organizations.
Let’s load in the data.
Stop: Try loading in the data yourself.
setwd("C:/Users/linht/Downloads/RBlog/Module5/")
library(haven)
## Warning: package 'haven' was built under R version 4.2.3
Study1 <- read_sav("C:/Users/linht/Downloads/RBlog/Module5/study1labadvisor_mturkdata.sav")
# Check it out!
View(Study1)
The most basic of R’s descriptive statistics commands is the summary command. Summary can be used on an entire dataset or on a single variable in that dataset.
# Let's try summarizing the whole dataset
summary(Study1)
## filter_$ Pgender age FageAff MageAff
## Min. :1 Min. :1.000 Min. :21.00 Min. :3.000 Min. :3.000
## 1st Qu.:1 1st Qu.:1.000 1st Qu.:29.00 1st Qu.:4.672 1st Qu.:4.641
## Median :1 Median :1.000 Median :33.00 Median :5.469 Median :5.438
## Mean :1 Mean :1.496 Mean :37.26 Mean :5.379 Mean :5.341
## 3rd Qu.:1 3rd Qu.:2.000 3rd Qu.:42.00 3rd Qu.:6.141 3rd Qu.:6.000
## Max. :1 Max. :2.000 Max. :72.00 Max. :7.000 Max. :7.000
## NA's :23 NA's :19 NA's :19 NA's :6 NA's :8
## FcomAff McomAff Ftrust Mtrust
## Min. :3.000 Min. :3.000 Min. :2.850 Min. :2.800
## 1st Qu.:4.500 1st Qu.:4.453 1st Qu.:4.650 1st Qu.:4.467
## Median :5.344 Median :5.312 Median :5.500 Median :5.267
## Mean :5.300 Mean :5.197 Mean :5.349 Mean :5.154
## 3rd Qu.:6.062 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.:6.000
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
## NA's :6 NA's :8 NA's :7 NA's :9
## Mtrust2 Ftrust2
## Min. :3.000 Min. :2.750
## 1st Qu.:4.400 1st Qu.:4.875
## Median :5.333 Median :5.625
## Mean :5.198 Mean :5.448
## 3rd Qu.:6.000 3rd Qu.:6.000
## Max. :7.000 Max. :7.000
## NA's :9 NA's :7
Wow, okay that’s a lot. Something we can examine is the variable “age,” which denotes participants’ age. The mean is 37.26 and the median is 33. Age ranges from 21 to 72 (talk about a representative sample). What if we only want to examine only the variable “age”?
# summary for a single variable
summary(Study1$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 29.00 33.00 37.26 42.00 72.00 19
As you can see, the summary command does not give us all the statistics we usually need. So, we can get the specific statistics using different commands.
Let’s start with central tendency (mean and median) and variation (sd).
#To get the mean:
mean(Study1$age, na.rm=TRUE)
## [1] 37.26087
#na.rm is a command to remove (rm) all NA values in the age variable. This is because some participants decline to answer the age question. Thus, their value would be NA (not available). In order for us to calculate the mean, we need to remove all the NAs.
#To get the median
median(Study1$age, na.rm=TRUE)
## [1] 33
#To get the the standard deviation
sd(Study1$age, na.rm=TRUE)
## [1] 11.45938
#Pretty straightforward, right?
There is a package for skewness, unlike the other measures where we can just use base R. We would need to use the moment package.
#If you haven't downloaded the moments package yet, install it first. Here's a refresher on how to install packages.
#install.packages("moments")
library(moments)
#Then the rest is pretty intuitive
skewness(Study1$age, na.rm = TRUE)
## [1] 1.092112
What if you have a categorical variable such as participants gender (i.e., Pgender)? In this case, the table command is the better option. The code below shows how to use this command:
#First, let's recode. I use Tidyverse here but if you prefer base R commands, you can also try.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Study1$Pgender <- as.character(Study1$Pgender) #I am making the Gender variable a character variable instead of a string variable
Study1 <- Study1 %>% mutate(Pgender = recode(Pgender,
"1" = "male",
"2" = "female",
"3" = "transgender",
"4" = "rather not say"))
table(Study1$Pgender)
##
## female male
## 57 58
Great, so we see we have 57 participants identifying as female and 58 participants identifying as male (gender-balanced indeed).
As we learned in modules 3 and 4, there are multiple ways to do the same thing in R. In the previous section, we learned base R commands. Now, I will show you how to do the same thing using Tidyverse.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ stringr 1.5.0
## ✔ tidyr 1.3.0 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#mean
Study1 %>% summarise(mean(age, na.rm=TRUE))
## # A tibble: 1 × 1
## `mean(age, na.rm = TRUE)`
## <dbl>
## 1 37.3
#median
Study1 %>% summarise(median(age, na.rm=TRUE))
## # A tibble: 1 × 1
## `median(age, na.rm = TRUE)`
## <dbl>
## 1 33
#standard deviation
Study1 %>% summarise(sd(age, na.rm=TRUE))
## # A tibble: 1 × 1
## `sd(age, na.rm = TRUE)`
## <dbl>
## 1 11.5
#what about all of them?
Study1$age <- as.numeric(Study1$age)
Study1 %>% summarize_at(vars(age), funs(mean(., na.rm=TRUE), median(., na.rm=TRUE), sd(., na.rm=TRUE), length))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 1 × 4
## mean median sd length
## <dbl> <dbl> <dbl> <int>
## 1 37.3 33 11.5 134
Now’s let try it with multiple variables
Study1 %>% summarize_at(vars(Ftrust, Mtrust), funs(mean(., na.rm=TRUE), median(., na.rm=TRUE), sd(., na.rm=TRUE), length))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 1 × 8
## Ftrust_mean Mtrust_mean Ftrust_median Mtrust_median Ftrust_sd Mtrust_sd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.35 5.15 5.5 5.27 0.953 0.963
## # ℹ 2 more variables: Ftrust_length <int>, Mtrust_length <int>
Beautiful. We can also use the tidyverse to get the means by groups. This uses a very useful command: group_by. group_by is a way to tell R that you want summary statistics by a certain group. For example, you can get summary statistics of Ftrust and Mtrust variable sorted by the gender of the participants themselves:
Study1 %>% group_by(Pgender) %>% summarize_at(vars(Mtrust, Ftrust), funs(mean(., na.rm=TRUE), median(., na.rm=TRUE), sd(., na.rm=TRUE), length))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 3 × 9
## Pgender Mtrust_mean Ftrust_mean Mtrust_median Ftrust_median Mtrust_sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 5.27 5.48 5.27 5.75 0.908
## 2 male 5.13 5.25 5.2 5.25 0.975
## 3 <NA> 4.65 5.18 4.9 5.45 1.12
## # ℹ 3 more variables: Ftrust_sd <dbl>, Mtrust_length <int>, Ftrust_length <int>
In your work, you might want to present a table of summary statistics. However, good tables are time consuming to make. R makes it easy. First, you will need to install and load the modelsummary library (Arel-Bundock, 2022).
#remember to install the package first
#install.packages("modelsummary")
library(modelsummary)
## Warning: package 'modelsummary' was built under R version 4.2.3
#Now we get the summary statistics
modelsummary::datasummary_skim(Study1, output='markdown')
## Warning in datasummary_skim_numeric(data, output = output, fmt = fmt, histogram
## = histogram, : The histogram argument is only supported for (a) output types
## "default", "html", or "kableExtra"; (b) writing to file paths with extensions
## ".html", ".jpg", or ".png"; and (c) Rmarkdown or knitr documents compiled to PDF
## or HTML. Use `histogram=FALSE` to silence this warning.
Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | |
---|---|---|---|---|---|---|---|
attention check | 2 | 17 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 |
age | 44 | 14 | 37.3 | 11.5 | 21.0 | 33.0 | 72.0 |
FageAff | 51 | 4 | 5.4 | 1.0 | 3.0 | 5.5 | 7.0 |
MageAff | 59 | 6 | 5.3 | 0.9 | 3.0 | 5.4 | 7.0 |
FcomAff | 57 | 4 | 5.3 | 1.0 | 3.0 | 5.3 | 7.0 |
McomAff | 57 | 6 | 5.2 | 0.9 | 3.0 | 5.3 | 7.0 |
Ftrust | 63 | 5 | 5.3 | 1.0 | 2.8 | 5.5 | 7.0 |
Mtrust | 75 | 7 | 5.2 | 1.0 | 2.8 | 5.3 | 7.0 |
Mtrust2 | 42 | 7 | 5.2 | 1.0 | 3.0 | 5.3 | 7.0 |
Ftrust2 | 30 | 5 | 5.4 | 1.0 | 2.8 | 5.6 | 7.0 |
#Now that's a beautiful summmary table. Here, I put my output as a markdown file but you can also make the output "html" or "latex" - your preference.
#What if we want categorical variables only?
datasummary_skim(Study1, type="categorical")
Pgender | N | % |
---|---|---|
female | 57 | 42.5 |
male | 58 | 43.3 |
NA | 19 | 14.2 |
#What if we want numerical variables only?
datasummary_skim(Study1, type="numeric")
## Warning in datasummary_skim_numeric(data, output = output, fmt = fmt, histogram
## = histogram, : The histogram argument is only supported for (a) output types
## "default", "html", or "kableExtra"; (b) writing to file paths with extensions
## ".html", ".jpg", or ".png"; and (c) Rmarkdown or knitr documents compiled to PDF
## or HTML. Use `histogram=FALSE` to silence this warning.
| Unique (#) | Missing (%) | Mean | SD | Min | Median | Max |
---|---|---|---|---|---|---|---|
attention check | 2 | 17 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 |
age | 44 | 14 | 37.3 | 11.5 | 21.0 | 33.0 | 72.0 |
FageAff | 51 | 4 | 5.4 | 1.0 | 3.0 | 5.5 | 7.0 |
MageAff | 59 | 6 | 5.3 | 0.9 | 3.0 | 5.4 | 7.0 |
FcomAff | 57 | 4 | 5.3 | 1.0 | 3.0 | 5.3 | 7.0 |
McomAff | 57 | 6 | 5.2 | 0.9 | 3.0 | 5.3 | 7.0 |
Ftrust | 63 | 5 | 5.3 | 1.0 | 2.8 | 5.5 | 7.0 |
Mtrust | 75 | 7 | 5.2 | 1.0 | 2.8 | 5.3 | 7.0 |
Mtrust2 | 42 | 7 | 5.2 | 1.0 | 3.0 | 5.3 | 7.0 |
Ftrust2 | 30 | 5 | 5.4 | 1.0 | 2.8 | 5.6 | 7.0 |
With this package, you can also get summary statistics by another variable. For example, you can get the Ftrust and MTrust statistics by participants’ gender using:
#Just Ftrust by participants' gender
datasummary((Ftrust)~Pgender*(mean+sd)*Arguments(na.rm = TRUE), data=Study1)
| female / mean | female / sd | male / mean | male / sd |
---|---|---|---|---|
Ftrust | 5.48 | 0.89 | 5.25 | 0.97 |
#Both Ftrust and MTrust by participants' gender
datasummary((Ftrust+Mtrust)~Pgender*(mean+sd)*Arguments(na.rm = TRUE), data=Study1)
| female / mean | female / sd | male / mean | male / sd |
---|---|---|---|---|
Ftrust | 5.48 | 0.89 | 5.25 | 0.97 |
Mtrust | 5.27 | 0.91 | 5.13 | 0.98 |
Okay, but the variable names look somewhat incomprehensible if you do not know what the variables mean. We can change the table looks using this code:
datasummary(('Organizational Trust with Female Leaders Present'=Ftrust)+ ('Organizational Trust with Male Leaders Present'=Mtrust)~
Pgender*(mean+sd)*Arguments(na.rm = TRUE), data=Study1)
| female / mean | female / sd | male / mean | male / sd |
---|---|---|---|---|
"Organizational Trust with Female Leaders Present" | 5.48 | 0.89 | 5.25 | 0.97 |
"Organizational Trust with Male Leaders Present" | 5.27 | 0.91 | 5.13 | 0.98 |
Bonus: Another functionality of this library is the ability to get correlation matrices for your data. This is quite simple using the datasummary_correlation command
datasummary_correlation(Study1)
## Warning in stats::cor(x, use = "pairwise.complete.obs", method = method): the
## standard deviation is zero
| attention check | age | FageAff | MageAff | FcomAff | McomAff | Ftrust | Mtrust | Mtrust2 | Ftrust2 |
---|---|---|---|---|---|---|---|---|---|---|
attention check | 1 | . | . | . | . | . | . | . | . | . |
age | 1 | . | . | . | . | . | . | . | . | |
FageAff | .07 | 1 | . | . | . | . | . | . | . | |
MageAff | .11 | .91 | 1 | . | . | . | . | . | . | |
FcomAff | .05 | .92 | .87 | 1 | . | . | . | . | . | |
McomAff | .05 | .80 | .85 | .87 | 1 | . | . | . | . | |
Ftrust | .14 | .86 | .81 | .88 | .73 | 1 | . | . | . | |
Mtrust | .07 | .77 | .78 | .79 | .87 | .80 | 1 | . | . | |
Mtrust2 | .12 | .72 | .75 | .74 | .80 | .78 | .95 | 1 | . | |
Ftrust2 | .15 | .80 | .76 | .82 | .66 | .97 | .74 | .74 | 1 |
Download the data to Study 2 in the same project. Their datasets maybe downloaded here. In Study 2, the researchers examine whether seeing female leaders in a technological company (compared to male leaders) increase perceivers’ trust in the organizations and their perceived opportunities offer by those organizations.
1, Find the mean, median, sd, and skewness of the following variables: age, f_ageaff, m_ageaff, f_comaff, m_comaff
2, Find the number of women and men participating in the study (i.e., Pgender)
Now you know how to describe your data in different ways using R. With that, we finished all the modules concerning the basics of data. Congratulations! Next, we will learn about data visualizations.
Arel-Bundock, V. (2022). modelsummary: Data and Model Summaries in R by Vincent Arel-Bundock. Journal of Statistical Software, 103(1), 1-23. https://www.jstatsoft.org/article/view/v103i01
Joshi, M. P., & Diekman, A. B. (2022). My Fair Lady? Inferring Organizational Trust From the Mere Presence of Women in Leadership Roles. Personality and Social Psychology Bulletin, 48(8), 1220–1237. https://doi.org/10.1177/01461672211035957