Welcome back to Rawlex. Congratulations on finishing your introduction to R. Now you have R and RStudio at your disposal. We will continue our journey of learning R by understanding how you can load data into R and clean them before we conduct analyses on them.

Goals for this module

By the end of this module, you will

1, Familiarize yourself with the typical types of data. 2, Understand how to load different types of data into R. 3, Be able to check if your data load into R correctly and troubleshoot if they did not.

Data types

As you probably know, our data comes in many types. Here are some of the most common data types that you will encounter.

  1. .csv

CSV stands for “Comma-Separated Values.” It is a simple and widely used file format used to store tabular data, such as spreadsheets or databases. They are easy to read, write, and be manipulated. R loves .csv

  1. .txt

TXT is short for “text” and refers to a plain text file format. Unlike other file formats that may contain special formatting, images, or other multimedia elements, TXT files only contain unformatted text.

  1. .sav (and .spss)

SAV (and/or SPSS) is a file extension used to indicate data files created by IBM SPSS Statistics software. The .sav file format is specific to SPSS, a widely used statistical software package for data analysis, data management, and data visualization. If you are a Psychology major like myself, you will see this a lot.

  1. .dta

DTA is a file extension used to indicate data files created by Stata, a popular statistical software package used for data analysis and statistical modeling. A lot of fields use Stata such as Ecnonomics, Education, and Public Health.

  1. .xls and .xlsx

XLS and XLSX are both file extensions used to indicate Microsoft Excel spreadsheet files, but they represent different file formats used by different versions of Excel.

  1. .Rdata

The .RData file extension is used to indicate data files created and saved in R. In R, .RData files are used to store objects, data frames, variables, and other R-specific data structures. This is less common than other data types, but with the increasing prevalence of R, .Rdata is becoming more common, too.

Reflection Question: What are the most frequent data types in your research? What data type is your current research project?

Loading Data into R

Great. Now, you have familiarized yourself with the different data types. The following steps detail how you can load your data into R.

  1. Identify your file format (now you understand why we need to know what our data type is)
  2. Set your working directory
  3. Install/Load package needed to load data (if applicable)
  4. Load the data into R, creating an object
  5. Check to make sure the data were loaded correctly

You have probably finished your first step, identifying your file format if you answer the reflection question at the end of the last section. Now, we set our working directory.

In R, the “working directory” refers to the directory or folder location on your computer where R will look for files and save files. You should set your working directory to the folder that you designate for your project to better organize your materials.

To get the current working directory, we can use these codes

getwd()
## [1] "C:/Users/linht/Downloads/RBlog"

You will see your default working directory. In this document, you will see my output for my working directory. Your working directory may be different, and that’s ok. Should you want to change your working directory, you would use the following codes.

setwd("path_to_your_working_directory")

For example, if your working directory is /Users/rawlex/Documents/R, then your code will be setwd(“/Users/rawlex/Documents/R”) You can find your file path in the file folder (in Finder for Mac and My Computer for PC). Sometimes, there is disagreements in the symbols such as / and \. It is recommended that you follow the symbols in the output when you run the command getwd()

Amazing! Now you have finished the first two steps. On to step 3 and 4, installing the packages required and load in your data. Here, I will demonstrate how to load in the common data types. You should also save your data as you load it in by using yourdataname <-

Try using descriptive names so you would remember what data files contain what. This will be demonstrated soon.

If your file is in .csv, you would use the following codes

Dataframe_name <- read.csv(”filename.csv”)

For example, if your file name is Study1Stereotypes, your code will be Study1 <- read.csv(”Study1Stereotypes.csv”)

If your file is in .txt, you would use the following codes

Dataframename <- read.table(”filename.txt”, sep=” ”)

If your file is in .sav or .spss, you would need the R package haven (remember R package from last module?). Here are the codes you would use:

install.packages("haven")
library(haven)
Data_frame_name <- read_sav("filename.sav”)

For Stata files (.dta), it is a little more complicated. If your file is in Stata 13 and over, you would need to use the R package readstata13. Here are the codes you would use:

install.packages("readstata13")
library(readstata13)
Data_frame_name <- read.dta13("filename.dta”)

If your file is not in Stata 13 and over, you would need to use the R package foreign. Here are the codes you would use:

install.packages("foreign")
library(foreign)
Data_frame_name <- read.dta(”filename.dta”) 

Final checks

Congratulations, you are almost there. Now is the most important step of them all - checking if your data load in correctly. Do NOT skip this step because if your data did not load in correctly, all your subsequent analyses may also be incorrect.

Here are some ways to check:

1, Open the dataset itself and check:

View(Data_frame_name)

2, You can check if you have all the variables in the dataset (no more no less)

names(Data_frame_name)

3, Check the first few rows of your dataset:

head(Data_frame_name)

Stop and try: Great job reading and absorbing thus far. I would like you to stop and try (1) creating a working directory and (2) loading a file into R right now. It could be any file you want. If it is the data file you are intending to work with, all the better. I just want you to practice what you have learned thus far.

Alright, what happens if your R gives you errors? Although I am not physically there with you to read your output and decipher where the error lies, I can give you a checklist to troubleshoot common errors.

I hope these questions are enough for your data to be loaded in successfully in R.

A full example

Alright, now, let’s practice loading data into R with a real dataset. You can access and download the dataset racecoding_bothratersL1.csv in the project “Race and Gender Representation in Social-Psychology Department Photographs” here

Let’s follow our five steps:

  1. Identify your file format

The file is a .csv file

  1. Set your working directory

This is my own working directory but you should set it to your working directory when you load in your dataset

setwd("C:/Users/linht/Downloads/RBlog")
  1. Install/Load package needed to load data (if applicable)

To load in a .csv file, we do not need any extra packages.

  1. Load the data into R, creating an object
RaceCodes <- read.csv("racecoding_bothratersL1.csv")
  1. Check to make sure the data were loaded correctly
View(RaceCodes)
names(RaceCodes)
##  [1] "schoolname"     "schoolid"       "schooln"        "lnschooln"     
##  [5] "lnschoolg10"    "CodeuidApplied" "faceid"         "faceida"       
##  [9] "hisp1"          "female1"        "male1"          "ainaan1"       
## [13] "easian1"        "indian1"        "black1"         "pi1"           
## [17] "white1"         "nhwhite"        "hwhite"         "racea"         
## [21] "raceb"          "gender"         "hisp2"          "female2"       
## [25] "male2"          "indian2"        "black2"         "easian2"       
## [29] "pi2"            "white2"         "racetotal"      "sextotal"
head(RaceCodes)
##            schoolname schoolid schooln lnschooln lnschoolg10 CodeuidApplied
## 1 American University        1       8  2.079442     0.90309           True
## 2 American University        1       8  2.079442     0.90309           True
## 3 American University        1       8  2.079442     0.90309           True
## 4 American University        1       8  2.079442     0.90309           True
## 5 American University        1       8  2.079442     0.90309           True
## 6 American University        1       8  2.079442     0.90309           True
##   faceid faceida hisp1 female1 male1 ainaan1 easian1 indian1 black1 pi1 white1
## 1   1001       1     1       1     0       0       0       0      0   0      1
## 2   1002       2     0       1     0       0       0       0      0   0      1
## 3   1003       3     0       0     1       0       0       0      1   0      0
## 4   1004       4     0       0     1       0       0       0      1   0      0
## 5   1005       5     0       0     1       0       1       0      0   0      0
## 6   1006       6     0       1     0       0       0       0      1   0      0
##   nhwhite hwhite racea raceb gender hisp2 female2 male2 indian2 black2 easian2
## 1       0      1     1     3      0     0       1     0       0      0       0
## 2       1      0     1     2      0     0       1     0       0      0       0
## 3       0      0     4     6      1     0       1     0       0      1       0
## 4       0      0     4     6      1     0       0     1       0      1       0
## 5       0      0     2     4      1     0       0     1       0      0       1
## 6       0      0     4     6      0     0       1     0       0      1       0
##   pi2 white2 racetotal sextotal
## 1   0      1         1        1
## 2   0      1         1        1
## 3   0      0         1        1
## 4   0      0         1        1
## 5   0      0         1        1
## 6   0      0         1        1

Perfect.

Practice makes perfect

Download the two datasets from the ICPSR Load them both into R

Concluding words

Congratulations on finishing this module. In this session, you have learned about the different data types, how to load your data into R, and how to troubleshoot possible errors. In the real world, data you get out there are often messy, too much, and sometimes missing in values. That is why, before we conduct our analysis, we will need to clean them. That is exactly we will learn in the next module. Stay tuned.