Welcome back to Rawlex. Before you begin this module, I want you to marvel in how far you have come in learning R with me. So far, we have learned how to manage, manipulate, and visualize our data using a variety of packages. You also learned how to describe your data. At this point, you have the most basic tools to kickstart your R journey. This blog definitely cannot cover every command you need, nor can it help you troubleshoot every error you get. However, with your current R knowledge, you can Google anything you need to do and be able to understand the codes you searched for. Some reliable sites on Google you should look for: StackOverflow, StackExchange, R Bloggers, Github. A pro tip my lab mate, Andrew White, find him here, recommended - ask ChatGPT. It can generate R codes and troubleshoot errors for you.

With that being said, we will start learning about data analysis starting from this module (get excited, woohoo!). In this module, we will learn about correlation and regression.

For this module, we will be using data from Chen et al. (2022), find the paper here. With nationally representative samples from both the U.S. and Taiwan, the researchers demonstrated how knowledge of COVID-19 and trust in the government significantly predicted mask-wearing behaviors during the pandemic. We will be analyzing their U.S. data, accessible here.

Goals for this module

At the end of this module, you will

1, Learn how to estimate regression models in R

2, Understand how to interpret R regression results

3, Be able to present R regression results

4, Check (for violations) of regression assumptions

Alright, let’s start learning by loading in our U.S. data.

# Setting our working directory. Don't forget to make the path your own working directory. This is my working directory.
setwd("/Users/leoascenzi/Downloads/RBlog/Module8")

# Load in the data
library(readxl)
US_Data <- read_excel("US Data.xlsx")
## New names:
## • `Q85` -> `Q85...101`
## • `Q85` -> `Q85...123`
# Removing the first row because that is our codes, not the values for the variables. I haven't talked about this, but this is a good example of a moment where you would Google/ ask ChatGPT
US_Data <- US_Data[-1, ]
head(US_Data)
## # A tibble: 6 × 125
##   UserLanguage Q29   Q27_1 Q27_2 Q27_3 Q27_4 Q27_5 Q27_6 Q27_7 Q27_8 Q27_9 Q24_1
##   <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 EN           Yes   97    94    84    90    89    82    89    91    96    Some…
## 2 EN           Yes   100   100   0     100   100   48    51    0     0     Some…
## 3 EN           Yes   100   70    0     100   100   100   3     <NA>  0     Some…
## 4 EN           Yes   72    92    57    95    84    95    57    28    14    Stro…
## 5 EN           Yes   63    48    24    66    29    48    59    100   49    Stro…
## 6 EN           Yes   100   100   1     100   100   58    1     1     1     Some…
## # ℹ 113 more variables: Q24_2 <chr>, Q24_3 <chr>, Q24_4 <chr>, Q40 <chr>,
## #   Q70 <chr>, Q71 <chr>, Q72 <chr>, Q69 <chr>, Q89 <chr>, Q35 <chr>,
## #   Q36 <chr>, Q42 <chr>, Q37 <chr>, Q56 <chr>, Q89_1 <chr>, Q89_2 <chr>,
## #   Q89_3 <chr>, Q89_4 <chr>, Q89_6 <chr>, Q30_1 <chr>, Q30_2 <chr>,
## #   Q30_3 <chr>, Q30_4 <chr>, Q30_5 <chr>, Q30_6 <chr>, Q30_7 <chr>,
## #   Q30_8 <chr>, Q76_1 <chr>, Q76_2 <chr>, Q76_3 <chr>, Q76_4 <chr>,
## #   Q76_5 <chr>, Q76_6 <chr>, Q49_1 <chr>, Q49_2 <chr>, Q49_3 <chr>, …

Correlation

Regression

Regression analysis is used to model and examine the relationship between a dependent variable (also called the response or outcome variable) and one or more independent variables (also called predictors or explanatory variables). The main goal of regression is to understand how the changes in the independent variables influence the dependent variable. Regression analysis provides the equation of a line or a curve that best fits the data and represents the relationship between the variables.

There are different types of regression analysis. Here are the some of the types of regressions:

And many more… You can take a whole course on regression techniques. In the interest of keeping this module compact, organized, and comprehensible, I will equip you with the most basic types of regression (bivariate and multiple). For other regression techniques, with the foundation I provide, you can Google your way through and explore on your own.

Bivarate Regression

The workhorse of data analysis is linear regression. Linear regression is used as a hypothesis test when you have a continuous dependent variable. In some cases, you want to know the relationship between two variables, the dependent variable of which is a continuous variable. In this case, you can use bivariate (meaning two variables) regression.

Recall that linear regression takes the data and fits a linear model by estimating the regression parameters. In other words, linear regression takes the input (the X variable(s)) and the output (the Y variable) and estimates the following equation:

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i \]

where \(\beta_0\) is the estimated Y intercept, \(\beta_1\) is the estimated effect of X on Y, and \(\epsilon_i\) is the estimated error (also called the residual).

Pay attention to this equation. This is our foundation. Make sure you spend time to understand what each element in this equation is. Later on, all of our subsequent model equations will be building on this basic equation. If you understand this equation, the rest will be easy to understand, too.

Estimating a bivariate regression model in R uses the lm command. The structure of the command is lm(y~x, data=DataName). Notice that this formula always has the y (or dependent variable) before the ~ and the x variable (or independent variable) afterward.

First, let’s examine if there is a relationship between COVID-19 perceived threat (‘threat’ is the predictor variable) and mask-wearing behaviors (‘mask’ is the outcome variable). Our equation becomes

\[ mask_i = \beta_0 + \beta_1*threat_i + \epsilon_i \]

If the equation conceptually makes sense, you are in a good spot.

Stop and Answer: What does \(\beta_0\) \(\beta_1\) \(\epsilon_i\) mean in this specific equation?

Remember, the structure of the command is lm(y~x, data=DataName). Let’s apply it to this equation.

# Creating the variables of interest by calculating the mean of the corresponding items

# Recode the perceived threat variables

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
US_Data$Q24_1 <- recode(US_Data$Q24_1,
                        "Strongly disagree" = "1",
                        "Somewhat disagree" = "2",
                        "Neither agree nor disagree" = "3",
                        "Somewhat agree" = "4",
                        "Strongly agree" = "5")

US_Data$Q24_2 <- recode(US_Data$Q24_2,
                        "Strongly disagree" = "1",
                        "Somewhat disagree" = "2",
                        "Neither agree nor disagree" = "3",
                        "Somewhat agree" = "4",
                        "Strongly agree" = "5")

US_Data$Q24_3 <- recode(US_Data$Q24_3,
                        "Strongly disagree" = "1",
                        "Somewhat disagree" = "2",
                        "Neither agree nor disagree" = "3",
                        "Somewhat agree" = "4",
                        "Strongly agree" = "5")

US_Data$Q24_1 <- as.numeric(US_Data$Q24_1) #making the variables numeric instead of characters
US_Data$Q24_2 <- as.numeric(US_Data$Q24_2)
US_Data$Q24_3 <- as.numeric(US_Data$Q24_3)
US_Data$threat <- rowMeans(US_Data[, c("Q24_1", "Q24_2", "Q24_3")], na.rm = TRUE) #remember this from Module 3?

US_Data$Q27_1 <- as.numeric(US_Data$Q27_1)
US_Data$Q27_2 <- as.numeric(US_Data$Q27_2)
US_Data$mask <- rowMeans(US_Data[, c("Q27_1", "Q27_2")], na.rm = TRUE)

# To the regression model. I'm calling it bi_reg for bivariate regression

# First, we need to omit the NAs
bi_reg <- lm(mask~threat, data=US_Data)
summary(bi_reg)
## 
## Call:
## lm(formula = mask ~ threat, data = US_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.434 -16.910   1.102  16.740  50.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.3746     3.4472  12.002   <2e-16 ***
## threat        8.5119     0.9118   9.335   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.58 on 519 degrees of freedom
##   (39 observations deleted due to missingness)
## Multiple R-squared:  0.1438, Adjusted R-squared:  0.1421 
## F-statistic: 87.14 on 1 and 519 DF,  p-value: < 2.2e-16

The “Estimate” column of that output is the Standard Errors for the \(\beta\) estimates. The third column is the result of the t-test for statistical significance. The last column is the p-value. This the the measure of statistical significance most used for regression coefficients. The p-value tells you the probability of find your results (or something more extreme) if the null hypothesis were true. In other words, how likely are your results due to random chance? If the answer is “not likely”, we can conclude that this is a systematic relationship. The standard is for p-values less than .05 being considered “statistically significant”. In our regression, we can see that there is a statistically significant and positive relationship between mask-wearing behavior and perceived threat of COVID-19 (makes sense).

The other information R gives your is the Residual Standard Error and the number of degrees of freedom (the number of observations used in the model minus the number of parameters estimated). R also provides a measure of fit. The Multiple R-squared and Adjusted R-squared scores tell us how much of the variation in Y is explained by our model (about 14%, not too good but not nothing). The Adjusted R-squared is a measure that penalizes the model for each parameter added. We usually use this measure when considering how fit our model is.

Based on this output, our equation becomes

\[ mask_i = 41.37 + 8.51*threat_i \] A one unit increase in perceived threat of COVID-19 is associated with a 8.51 unit increase in mask-wearing behavior.

Multivariate Regression

Phew, that was a lot. If you understood all of that, the following sections will be much easier to comprehend. What if we want to control for confounding variables? For example, what if age also influence mask-wearing behaviors? We control for the variable by including it in our regression model. In terms of our generic regression equation, we simply add the new variable to the equation:

\[ mask_i = \beta_0 + \beta_1*threat_i + \beta_2*age_i + \epsilon_i \]

That looks a little bit more complicated, but conceptually, it is not different from what we have been doing so far.

\(\beta_1\) is the estimated effect of perceived threat of COVID-19 on mask-wearing behaviors holding age constant.

\(\beta_2\) is the estimated effect of age on mask-wearing behaviors holding perceived COVID-19 threat constant.

\(\beta_0\) is the estimated intercept, and \(\epsilon_i\) is the estimated error (also called the residual).

Not too bad, right? Let’s run the model. In R, it uses the same lm command. The only difference is that we add additional variations using the + symbol.

US_Data$age <- as.numeric(US_Data$Q19) #age variable

# Finally, to the regression model
mul_reg <- lm(mask~threat+age, data=US_Data)

summary(mul_reg)
## 
## Call:
## lm(formula = mask ~ threat + age, data = US_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.733 -16.683   1.317  16.934  49.901 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.157e+01  3.540e+00  11.743   <2e-16 ***
## threat       8.534e+00  9.337e-01   9.139   <2e-16 ***
## age         -1.424e-08  7.099e-09  -2.005   0.0455 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.58 on 497 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.1566, Adjusted R-squared:  0.1532 
## F-statistic: 46.15 on 2 and 497 DF,  p-value: < 2.2e-16

The way to interpret this output is the same as above. Note that income and education level are not significant predictors of mask-wearing behaviors. Thus, they will not make to our equation model.

Stop and try: What is our \(\beta_0\) \(\beta_1\) \(\beta_2\)? Which one is significant?

Remember, our interpretation of beta values is a little different from bivariate regression.

“one unit increase in perceived threat of COVID-19 is associated with a 41.97 unit increase in mask-wearing behaviors, holding participants’ age constant.”

See the difference?

Regression with cateogorical predictor variables (X)

What if we want to control for the education level and income level variables, too? It’s harder to conceptualize what one unit increase in education level means (what is the increase from finishing high school to finishing a 2 year college? Is it the same “increase” as from 2-year college to 4-year college?)

Therefore, we have to tell R to treat income and education levels as categorical variables with the as.factor() command. This command tells R that those are categorical variable and should be treated as such. Our model now, with all the controls, becomes

\[ mask_i = \beta_0 + \beta_1*threat_i + \beta_2*age_i + \beta_3*education_i + \beta_4*income_i + \epsilon_i \] \(\beta_3\) is the estimated effect of education on mask-wearing behaviors holding the other 3 variables constant.

\(\beta_4\) is the estimated effect of income on mask-wearing behaviors holding the other 3 variables constant.

The rest you already know.

The following code shows how to use it.

# Recode education variable
US_Data$edu <- recode(US_Data$Q84,
                        "Less than high school" = "1",
                        "High school graduate" = "2",
                        "Some college" = "3",
                        "2 year degree" = "4",
                        "4 year degree" = "5",
                      "Masters degree" = "6",
                      "Doctorate" = "7",
                      "Professional degree" = "8")
US_Data$edu <- as.numeric(US_Data$edu)

# Recode income variable
US_Data$income <- recode(US_Data$Q85...123,
                            "Less than $10,000" = "1",
                        "$10,000 - $19,999" = "2",
                        "$20,000 - $29,999" = "3",
                        "$30,000 - $39,999" = "4",
                        "$40,000 - $49,999" = "5",
                        "$50,000 - $59,999" = "6",
                      "$60,000 - $69,999" = "7",
                      "$70,000 - $79,999" = "8",
                      "$80,000 - $89,999" = "9",
                      "$90,000 - $99,999" = "10",
                      "$100,000 - $149,999" = "11",
                      "More than $150,000" = "12")
US_Data$income <- as.numeric(US_Data$income)

# Let's put it all together
mul_reg_withcatIV <- lm(mask~threat+age+as.factor(income)+as.factor(edu), data=US_Data)

summary(mul_reg_withcatIV)
## 
## Call:
## lm(formula = mask ~ threat + age + as.factor(income) + as.factor(edu), 
##     data = US_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.305 -16.230   1.797  16.582  57.402 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.375e+01  7.052e+00   4.786 2.27e-06 ***
## threat               8.849e+00  9.559e-01   9.257  < 2e-16 ***
## age                 -1.207e-08  7.164e-09  -1.685   0.0926 .  
## as.factor(income)2  -2.727e+00  4.787e+00  -0.570   0.5691    
## as.factor(income)3   2.081e+00  4.364e+00   0.477   0.6337    
## as.factor(income)4   2.811e+00  4.639e+00   0.606   0.5448    
## as.factor(income)5   2.456e+00  4.780e+00   0.514   0.6077    
## as.factor(income)6   8.862e+00  5.166e+00   1.715   0.0869 .  
## as.factor(income)7   2.281e+00  6.381e+00   0.357   0.7209    
## as.factor(income)8  -8.852e-01  5.513e+00  -0.161   0.8725    
## as.factor(income)9  -3.421e+00  7.402e+00  -0.462   0.6442    
## as.factor(income)10  4.286e+00  5.914e+00   0.725   0.4690    
## as.factor(income)11 -5.853e+00  4.914e+00  -1.191   0.2342    
## as.factor(income)12 -3.260e+00  5.617e+00  -0.580   0.5620    
## as.factor(edu)2      2.993e+00  6.180e+00   0.484   0.6284    
## as.factor(edu)3      1.134e+01  6.308e+00   1.798   0.0729 .  
## as.factor(edu)4      4.422e+00  6.914e+00   0.640   0.5227    
## as.factor(edu)5      6.906e+00  6.444e+00   1.072   0.2844    
## as.factor(edu)6      5.776e+00  6.904e+00   0.837   0.4032    
## as.factor(edu)7      6.512e+00  8.368e+00   0.778   0.4368    
## as.factor(edu)8      5.097e+00  9.987e+00   0.510   0.6100    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.5 on 479 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.1929, Adjusted R-squared:  0.1592 
## F-statistic: 5.723 on 20 and 479 DF,  p-value: 1.944e-13

Now this output is a little harder to interpret, isn’t it? You are getting more and more information now. But worry not. You will see that income value “1” and edu value “1” are missing. This is because they are default set as the reference level of the variable. This means every value you see from other levels of that variable (e.g., edu value “2”) is in comparison to the reference level (i.e., edu value “1”). For example, participants who is a high school graduate (set as value 2) have 2.993 more units in mask-wearing behaviors than those whose education level is less than high school (set as value 1), although this increase is not statistically significant.

What if you want to change the reference level? Perhaps you did not want less than high school graduate participants to be the reference category of our model. You can tell R which level to use as the reference level with the relevel command. To use that command, you can use one of two strategies.

#Let's make 4-year college graduate as the reference group
#Strategy 1
model4<-lm(mask~threat+age+as.factor(income)+relevel(as.factor(edu), ref="5"), data=US_Data)

summary(model4)
## 
## Call:
## lm(formula = mask ~ threat + age + as.factor(income) + relevel(as.factor(edu), 
##     ref = "5"), data = US_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.305 -16.230   1.797  16.582  57.402 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          4.066e+01  5.337e+00   7.618 1.39e-13 ***
## threat                               8.849e+00  9.559e-01   9.257  < 2e-16 ***
## age                                 -1.207e-08  7.164e-09  -1.685   0.0926 .  
## as.factor(income)2                  -2.727e+00  4.787e+00  -0.570   0.5691    
## as.factor(income)3                   2.081e+00  4.364e+00   0.477   0.6337    
## as.factor(income)4                   2.811e+00  4.639e+00   0.606   0.5448    
## as.factor(income)5                   2.456e+00  4.780e+00   0.514   0.6077    
## as.factor(income)6                   8.862e+00  5.166e+00   1.715   0.0869 .  
## as.factor(income)7                   2.281e+00  6.381e+00   0.357   0.7209    
## as.factor(income)8                  -8.852e-01  5.513e+00  -0.161   0.8725    
## as.factor(income)9                  -3.421e+00  7.402e+00  -0.462   0.6442    
## as.factor(income)10                  4.286e+00  5.914e+00   0.725   0.4690    
## as.factor(income)11                 -5.853e+00  4.914e+00  -1.191   0.2342    
## as.factor(income)12                 -3.260e+00  5.617e+00  -0.580   0.5620    
## relevel(as.factor(edu), ref = "5")1 -6.906e+00  6.444e+00  -1.072   0.2844    
## relevel(as.factor(edu), ref = "5")2 -3.914e+00  3.339e+00  -1.172   0.2418    
## relevel(as.factor(edu), ref = "5")3  4.434e+00  3.373e+00   1.314   0.1894    
## relevel(as.factor(edu), ref = "5")4 -2.484e+00  4.202e+00  -0.591   0.5547    
## relevel(as.factor(edu), ref = "5")6 -1.130e+00  3.804e+00  -0.297   0.7665    
## relevel(as.factor(edu), ref = "5")7 -3.942e-01  6.007e+00  -0.066   0.9477    
## relevel(as.factor(edu), ref = "5")8 -1.809e+00  8.198e+00  -0.221   0.8254    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.5 on 479 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.1929, Adjusted R-squared:  0.1592 
## F-statistic: 5.723 on 20 and 479 DF,  p-value: 1.944e-13
#Strategy 2

US_Data<-within(US_Data, edu<-relevel(as.factor(edu), ref= "5"))

model5<-lm(mask~threat+age+as.factor(income)+as.factor(edu), data=US_Data)

summary(model5)
## 
## Call:
## lm(formula = mask ~ threat + age + as.factor(income) + as.factor(edu), 
##     data = US_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.305 -16.230   1.797  16.582  57.402 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.066e+01  5.337e+00   7.618 1.39e-13 ***
## threat               8.849e+00  9.559e-01   9.257  < 2e-16 ***
## age                 -1.207e-08  7.164e-09  -1.685   0.0926 .  
## as.factor(income)2  -2.727e+00  4.787e+00  -0.570   0.5691    
## as.factor(income)3   2.081e+00  4.364e+00   0.477   0.6337    
## as.factor(income)4   2.811e+00  4.639e+00   0.606   0.5448    
## as.factor(income)5   2.456e+00  4.780e+00   0.514   0.6077    
## as.factor(income)6   8.862e+00  5.166e+00   1.715   0.0869 .  
## as.factor(income)7   2.281e+00  6.381e+00   0.357   0.7209    
## as.factor(income)8  -8.852e-01  5.513e+00  -0.161   0.8725    
## as.factor(income)9  -3.421e+00  7.402e+00  -0.462   0.6442    
## as.factor(income)10  4.286e+00  5.914e+00   0.725   0.4690    
## as.factor(income)11 -5.853e+00  4.914e+00  -1.191   0.2342    
## as.factor(income)12 -3.260e+00  5.617e+00  -0.580   0.5620    
## as.factor(edu)1     -6.906e+00  6.444e+00  -1.072   0.2844    
## as.factor(edu)2     -3.914e+00  3.339e+00  -1.172   0.2418    
## as.factor(edu)3      4.434e+00  3.373e+00   1.314   0.1894    
## as.factor(edu)4     -2.484e+00  4.202e+00  -0.591   0.5547    
## as.factor(edu)6     -1.130e+00  3.804e+00  -0.297   0.7665    
## as.factor(edu)7     -3.942e-01  6.007e+00  -0.066   0.9477    
## as.factor(edu)8     -1.809e+00  8.198e+00  -0.221   0.8254    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.5 on 479 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.1929, Adjusted R-squared:  0.1592 
## F-statistic: 5.723 on 20 and 479 DF,  p-value: 1.944e-13

Works like a charm.

Interactive Regression Models

Occasionally, we hypothesize that the effect of one predictor variable depends on the value of another predictor variable. In this case, we would need an interactive model, which is a model that includes an interactive term. An interactive term is created by multiplying two predictor variables together.

For example, we could hypothesize that the perceived COVID-19 threat depends on the education or age of the participants.

Interactive Models with Categorical Variables

Let’s first run a model where we hypothesize that the perceived COVID-19 threat depends on the education level of the participants.

interactive.model1<-lm(mask~threat+age+as.factor(income)+edu+age*edu, data=US_Data)

summary(interactive.model1)
## 
## Call:
## lm(formula = mask ~ threat + age + as.factor(income) + edu + 
##     age * edu, data = US_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.603 -15.776   2.304  16.531  54.205 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.919e+01  5.367e+00   7.301 1.22e-12 ***
## threat               9.163e+00  9.606e-01   9.539  < 2e-16 ***
## age                 -1.174e-08  7.137e-09  -1.645  0.10072    
## as.factor(income)2  -2.616e+00  4.800e+00  -0.545  0.58602    
## as.factor(income)3   2.854e+00  4.373e+00   0.653  0.51428    
## as.factor(income)4   4.026e+00  4.686e+00   0.859  0.39071    
## as.factor(income)5   2.381e+00  4.783e+00   0.498  0.61875    
## as.factor(income)6   9.428e+00  5.162e+00   1.827  0.06840 .  
## as.factor(income)7   1.895e+00  6.392e+00   0.297  0.76697    
## as.factor(income)8  -5.985e-01  5.507e+00  -0.109  0.91350    
## as.factor(income)9  -3.189e+00  7.447e+00  -0.428  0.66874    
## as.factor(income)10  4.821e+00  5.918e+00   0.815  0.41563    
## as.factor(income)11 -5.590e+00  4.990e+00  -1.120  0.26318    
## as.factor(income)12 -2.940e+00  5.689e+00  -0.517  0.60551    
## edu1                 2.738e+01  1.425e+01   1.921  0.05534 .  
## edu2                -3.992e+00  3.348e+00  -1.192  0.23373    
## edu3                 2.230e+00  7.139e+00   0.312  0.75490    
## edu4                -2.350e+00  1.038e+01  -0.226  0.82101    
## edu6                 3.473e+00  1.087e+01   0.319  0.74955    
## edu7                -9.339e+00  2.414e+01  -0.387  0.69907    
## edu8                -3.251e+01  2.309e+01  -1.408  0.15982    
## age:edu1            -9.071e-01  3.344e-01  -2.713  0.00692 ** 
## age:edu2                    NA         NA      NA       NA    
## age:edu3             4.444e-02  1.271e-01   0.350  0.72680    
## age:edu4            -3.541e-03  1.915e-01  -0.018  0.98525    
## age:edu6            -1.042e-01  2.188e-01  -0.476  0.63411    
## age:edu7             2.051e-01  5.396e-01   0.380  0.70398    
## age:edu8             6.770e-01  4.774e-01   1.418  0.15684    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.41 on 473 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.2093, Adjusted R-squared:  0.1658 
## F-statistic: 4.815 on 26 and 473 DF,  p-value: 5.839e-13

This is not too different, except for the final few interaction terms. We see that age:edu1 is significantly different from age:edu5 (our reference level). That’s something to consider.

Interactive Models with Continuous Variables

What if we want to predict mask-wearing behaviors from both perceived threat of COVID-19 and perceived reality of COVID-19? What if we also hypothesize that the perceived threat of COVID-19 depends on the perceived reality of COVID-19, too (2 continuous variables)?

Same code as we have always learned.

# Calculating the perceived reality of COVID-19
US_Data$Q109 <- recode(US_Data$Q109,
                        "Strongly disagree" = "1",
                        "Somewhat disagree" = "2",
                        "Neither agree nor disagree" = "3",
                        "Somewhat agree" = "4",
                        "Strongly agree" = "5")

US_Data$Q110 <- recode(US_Data$Q110,
                        "Strongly disagree" = "1",
                        "Somewhat disagree" = "2",
                        "Neither agree nor disagree" = "3",
                        "Somewhat agree" = "4",
                        "Strongly agree" = "5")

US_Data$Q111 <- recode(US_Data$Q111,
                        "Strongly disagree" = "1",
                        "Somewhat disagree" = "2",
                        "Neither agree nor disagree" = "3",
                        "Somewhat agree" = "4",
                        "Strongly agree" = "5")
US_Data$Q109 <- as.numeric(US_Data$Q109)
## Warning: NAs introduced by coercion
US_Data$Q110 <- as.numeric(US_Data$Q110)
## Warning: NAs introduced by coercion
US_Data$Q111 <- as.numeric(US_Data$Q111)
## Warning: NAs introduced by coercion
US_Data$reality <- rowMeans(US_Data[, c("Q109", "Q110", "Q111")], na.rm = TRUE)

# To the regression model. I'm calling it bi_reg for bivariate regression

# First, we need to omit the NAs
interactive.model2 <- lm(mask~threat+reality+threat*reality, data=US_Data)
summary(interactive.model2)
## 
## Call:
## lm(formula = mask ~ threat + reality + threat * reality, data = US_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.061 -16.915   0.735  17.688  52.950 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     26.2475     7.7253   3.398 0.000738 ***
## threat          12.6563     2.1072   6.006 3.83e-09 ***
## reality          5.5835     2.6999   2.068 0.039192 *  
## threat:reality  -1.5103     0.6981  -2.163 0.031018 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.83 on 465 degrees of freedom
##   (91 observations deleted due to missingness)
## Multiple R-squared:  0.1554, Adjusted R-squared:   0.15 
## F-statistic: 28.52 on 3 and 465 DF,  p-value: < 2.2e-16

We see that perceived threat and perceived reality of COVID-19 do interact with one another. However, wiht these interactive models, interpretation becomes much harder. Thus, you need to review your theoretical foundations and pick the best model for your research question.

Displaying Results with Tables

After you estimate a linear model, R has some cool features to let you convert those results to a professional looking table. You can use a package here called Stargazer (Hlavac, 2018). To install the package, use the same code that we have used previously, remembering to use the library command to call up the package.

install.packages("stargazer")
library(stargazer)

Here is a demonstration:

stargazer(mul_reg, type="html", out="MultipleRegression.doc", 
          covariate.labels = c("Perceived Threat", "Age", "Income", "Education"))

Here is what is looks like:

stargazer output
stargazer output

Let’s try putting all the models we ran together.

stargazer(bi_reg, mul_reg, mul_reg_withcatIV, model4, model5, type="html",
          out="Regression.doc", 
          covariate.labels = c("Perceived Threat", "Age", "Income", "Education"))

Since the output is a little large, try running it and see it for yourself. However, I will not include a screenshot here.

Another package that makes nice tables in R is sjPlot. The following code shows how to create these tables. You will need to install a few packages for this to work.

install.packages("sjPlot")
install.packages("sjmisc")
install.packages("sjlabelled")

library(sjPlot)
library(sjmisc)
library(sjlabelled)

Now let’s try it!

tab_model(bi_reg, mul_reg, pred.labels = 
            c("Intercept", "Threat", "Age"), 
          dv.labels = c("Bivariate Regression", 
                        "Multiple Regression"), 
          string.pred = "Predictors", string.ci = "C.I. (95%)", 
          string.p="P-Value", show.ci=FALSE)

Beautiful! Here is the output you will see in your Viewer window in your R Studio

sjPlot
sjPlot

Regression Assumptions

Linear regression makes several assumptions about the data when you choose to estimated the model. I will briefly list those assumptions here, but you can see your textbook for more details. Note: X are your predictor variables.

In this blog, I will defining some assumption violations and demonstrating how to diagnose the issue.

Normality of the Residuals

The residuals (or the errors) must be normally distributed in order to make inference about our results. To do this, we must first extract the residuals from R. Remember that residuals are defined as:

\[ residual = y_i - \hat{y}_i \]

Where \(y_i\) is the actual value of y for observation i and \(\hat{y}_i\) is the predicted value of Y for observation i based on your regression equation.

To get R to calculate the residuals for us, we use the following code:

residual<-mul_reg$residuals

# visualize it
plot(density(residual), main="Density Plot of Residuals")

hist(residual, main="Histogram of Residuals", xlab="Residuals")

The way to interpret these plots is to look at the shape of the plots. Do they look normal-ish? Not quite… You should also consider the number of observations (500). The larger the number of observations, the more clearly normal we would expect the residuals to look.

If you want a diagnostic that is more precise, you can look to the QQplot. This is a plot designed to see if the residuals and the normal distribution quantiles match. The code is as follows:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
qqPlot(mul_reg)

## [1]  96 468

To interpret this figure, you are looking for most of the points to be on or as close as possible to the line. This plot also includes a margin of error. Therefore, as long as a vast majority of your residuals are inside the dashed lines (margin of error), you have normally distributed residuals. This command also gives you the highest absolute value residuals.

Autocorrelation

Autocorrelation occurs when the residual of one observation is correlated with the residual at the next observation. Our regression assumes that the residuals are independent of each other. When there is autocorrelation present, the residuals are instead correlated with each other.

The Durbin-Watson test is designed to look for serial autocorrelation. This is a kind of hypothesis test, just like a t test, therefore it is important to understand the null and alternative hypotheses being tested here. In the Durbin-Watson test:

H0 = No serial autocorrelation

Ha = Serial autocorrelation

This means that we do not want to reject the null hypothesis. If we get a statistically significant p-value (less than .05), then we reject the null hypothesis that there is no serial autocorrelation. To run this diagnostic in R, you need to first install the lmtest library.

#install.packages("lmtest")
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
dwtest(mul_reg)
## 
##  Durbin-Watson test
## 
## data:  mul_reg
## DW = 1.8185, p-value = 0.02093
## alternative hypothesis: true autocorrelation is greater than 0

Unfortunately, it is significant. However, these are not time series data, therefore we had not prior expectation for serial autocorrelation. As a result, we should use another diagnostic to be sure.

The other autocorrelation diagnostic test we can use the Breusch-Godfrey test. With this test, the null and alternative hypothesis remain the same as the Durbin-Watson test. The difference between the Durbin-Watson test and the Breusch-Godfrey test is that the Durbin-Watson test can only be used on nonstochastic regressors and can only test for first order autocorrelation. With neither restrictions, the Breusch-Godfrey test is a more general test. To conduct the test in R:

bgtest(mul_reg)
## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  mul_reg
## LM test = 4.0444, df = 1, p-value = 0.04432

Still significant. It seems like our regression model is violating our assumption.

Heteroscedasticity

Heteroscedasticity simply means non-constant error variance. Our regression model assumes that we have homoscedasticity, meaning we have constant error variance. When heteroscedasticity is present, the standard errors are incorrectly estimated in your regression model, usually making them larger than they should be.

The best way to check for heteroscedasticity in R is to plot the residuals against the fitted values of the model. You are looking for a ``trumpet’’ shape. Below is an example of how to diagnose our model from above for heteroscedasticity:

fitted<-fitted.values(mul_reg)
plot(x=fitted, y=residual, pch=16, xlab="Fitted Values", ylab="Residuals", main="Fitted Values v. Residuals")

That looks horrible (not “trumpet” shaped). It seems like we are also violating this assumption.

Multicollinearity

Another assumption of multiple regression is no multicollinearity. Multicollinearity happens when one or more of independent variables are closely related. Perfect multicollinearity happens when one of the independent variables is a linear function of another independent variable.

The most common causes of multicollinearity are a small sample and high correlation between independent variables. The first indication that you have multicollinearity is when you have a high \(R^2\) value, but few statistically significant coefficients. If this occurs, you should check for multicollinearity by calculating the variance inflation factor (VIF). This test will essentially estimate several models where each X variable is the dependent variable and the other X variables are the independent variables. The VIF takes the \(R^2\) from each of these models and calculates:

\[ VIF_j = \frac{1}{(1-R^2_j)} \] Where \(VIF_j\) is the VIF for the model ran with \(x_j\) as the dependent variable and the other x as the independent variables. \(R^2_j\) is the \(R^2\) for the model ran with \(x_j\) as the dependent variable and the other x as the independent variables. This will give you a VIF score for each x variable in your model. R can estimate the VIF for you with one line of code from the car library:

vif(mul_reg)
##   threat      age 
## 1.010724 1.010724

If any x gives a VIF>5, then you have high enough multicollinearity to be concerning. If this does occur, the best fix is to make sure you do not have two measures of the same concept. If that is not the case, could you use a different measure of something? Are all your variables necessary?

Concluding words

Congratulations on finishing the Correlation and Regression module. I acknowledge that this module feels like a jump from what we have done so far, and it is. The shift from descriptive statistics to data analysis is usually difficult for learners. If all of this has yet to make sense to you, I suggest reviewing the previous modules, take a break, then come back. Once you understood, you will find the regression a powerful tool for your research designs, and now you know how to use R to do regression!