Welcome back to Rawlex. Congratulations on finishing your introduction to R. Now you have R and RStudio at your disposal. We will continue our journey of learning R by understanding how you can load data into R and clean them before we conduct analyses on them.
By the end of this module, you will
1, Familiarize yourself with the typical types of data. 2, Understand how to load different types of data into R. 3, Be able to check if your data load into R correctly and troubleshoot if they did not.
As you probably know, our data comes in many types. Here are some of the most common data types that you will encounter.
CSV stands for “Comma-Separated Values.” It is a simple and widely used file format used to store tabular data, such as spreadsheets or databases. They are easy to read, write, and be manipulated. R loves .csv
TXT is short for “text” and refers to a plain text file format. Unlike other file formats that may contain special formatting, images, or other multimedia elements, TXT files only contain unformatted text.
SAV (and/or SPSS) is a file extension used to indicate data files created by IBM SPSS Statistics software. The .sav file format is specific to SPSS, a widely used statistical software package for data analysis, data management, and data visualization. If you are a Psychology major like myself, you will see this a lot.
DTA is a file extension used to indicate data files created by Stata, a popular statistical software package used for data analysis and statistical modeling. A lot of fields use Stata such as Ecnonomics, Education, and Public Health.
XLS and XLSX are both file extensions used to indicate Microsoft Excel spreadsheet files, but they represent different file formats used by different versions of Excel.
The .RData file extension is used to indicate data files created and saved in R. In R, .RData files are used to store objects, data frames, variables, and other R-specific data structures. This is less common than other data types, but with the increasing prevalence of R, .Rdata is becoming more common, too.
Reflection Question: What are the most frequent data types in your research? What data type is your current research project?
Great. Now, you have familiarized yourself with the different data types. The following steps detail how you can load your data into R.
You have probably finished your first step, identifying your file format if you answer the reflection question at the end of the last section. Now, we set our working directory.
In R, the “working directory” refers to the directory or folder location on your computer where R will look for files and save files. You should set your working directory to the folder that you designate for your project to better organize your materials.
To get the current working directory, we can use these codes
getwd()
## [1] "C:/Users/linht/Downloads/RBlog"
You will see your default working directory. In this document, you will see my output for my working directory. Your working directory may be different, and that’s ok. Should you want to change your working directory, you would use the following codes.
setwd("path_to_your_working_directory")
For example, if your working directory is /Users/rawlex/Documents/R, then your code will be setwd(“/Users/rawlex/Documents/R”) You can find your file path in the file folder (in Finder for Mac and My Computer for PC). Sometimes, there is disagreements in the symbols such as / and \. It is recommended that you follow the symbols in the output when you run the command getwd()
Amazing! Now you have finished the first two steps. On to step 3 and 4, installing the packages required and load in your data. Here, I will demonstrate how to load in the common data types. You should also save your data as you load it in by using yourdataname <-
Try using descriptive names so you would remember what data files contain what. This will be demonstrated soon.
If your file is in .csv, you would use the following codes
Dataframe_name <- read.csv(”filename.csv”)
For example, if your file name is Study1Stereotypes, your code will be Study1 <- read.csv(”Study1Stereotypes.csv”)
If your file is in .txt, you would use the following codes
Dataframename <- read.table(”filename.txt”, sep=” ”)
If your file is in .sav or .spss, you would need the R package haven (remember R package from last module?). Here are the codes you would use:
install.packages("haven")
library(haven)
Data_frame_name <- read_sav("filename.sav”)
For Stata files (.dta), it is a little more complicated. If your file is in Stata 13 and over, you would need to use the R package readstata13. Here are the codes you would use:
install.packages("readstata13")
library(readstata13)
Data_frame_name <- read.dta13("filename.dta”)
If your file is not in Stata 13 and over, you would need to use the R package foreign. Here are the codes you would use:
install.packages("foreign")
library(foreign)
Data_frame_name <- read.dta(”filename.dta”)
Congratulations, you are almost there. Now is the most important step of them all - checking if your data load in correctly. Do NOT skip this step because if your data did not load in correctly, all your subsequent analyses may also be incorrect.
Here are some ways to check:
1, Open the dataset itself and check:
View(Data_frame_name)
2, You can check if you have all the variables in the dataset (no more no less)
names(Data_frame_name)
3, Check the first few rows of your dataset:
head(Data_frame_name)
Stop and try: Great job reading and absorbing thus far. I would like you to stop and try (1) creating a working directory and (2) loading a file into R right now. It could be any file you want. If it is the data file you are intending to work with, all the better. I just want you to practice what you have learned thus far.
Alright, what happens if your R gives you errors? Although I am not physically there with you to read your output and decipher where the error lies, I can give you a checklist to troubleshoot common errors.
I hope these questions are enough for your data to be loaded in successfully in R.
Alright, now, let’s practice loading data into R with a real dataset. You can access and download the dataset racecoding_bothratersL1.csv in the project “Race and Gender Representation in Social-Psychology Department Photographs” here
Let’s follow our five steps:
The file is a .csv file
This is my own working directory but you should set it to your working directory when you load in your dataset
setwd("C:/Users/linht/Downloads/RBlog")
To load in a .csv file, we do not need any extra packages.
RaceCodes <- read.csv("racecoding_bothratersL1.csv")
View(RaceCodes)
names(RaceCodes)
## [1] "schoolname" "schoolid" "schooln" "lnschooln"
## [5] "lnschoolg10" "CodeuidApplied" "faceid" "faceida"
## [9] "hisp1" "female1" "male1" "ainaan1"
## [13] "easian1" "indian1" "black1" "pi1"
## [17] "white1" "nhwhite" "hwhite" "racea"
## [21] "raceb" "gender" "hisp2" "female2"
## [25] "male2" "indian2" "black2" "easian2"
## [29] "pi2" "white2" "racetotal" "sextotal"
head(RaceCodes)
## schoolname schoolid schooln lnschooln lnschoolg10 CodeuidApplied
## 1 American University 1 8 2.079442 0.90309 True
## 2 American University 1 8 2.079442 0.90309 True
## 3 American University 1 8 2.079442 0.90309 True
## 4 American University 1 8 2.079442 0.90309 True
## 5 American University 1 8 2.079442 0.90309 True
## 6 American University 1 8 2.079442 0.90309 True
## faceid faceida hisp1 female1 male1 ainaan1 easian1 indian1 black1 pi1 white1
## 1 1001 1 1 1 0 0 0 0 0 0 1
## 2 1002 2 0 1 0 0 0 0 0 0 1
## 3 1003 3 0 0 1 0 0 0 1 0 0
## 4 1004 4 0 0 1 0 0 0 1 0 0
## 5 1005 5 0 0 1 0 1 0 0 0 0
## 6 1006 6 0 1 0 0 0 0 1 0 0
## nhwhite hwhite racea raceb gender hisp2 female2 male2 indian2 black2 easian2
## 1 0 1 1 3 0 0 1 0 0 0 0
## 2 1 0 1 2 0 0 1 0 0 0 0
## 3 0 0 4 6 1 0 1 0 0 1 0
## 4 0 0 4 6 1 0 0 1 0 1 0
## 5 0 0 2 4 1 0 0 1 0 0 1
## 6 0 0 4 6 0 0 1 0 0 1 0
## pi2 white2 racetotal sextotal
## 1 0 1 1 1
## 2 0 1 1 1
## 3 0 0 1 1
## 4 0 0 1 1
## 5 0 0 1 1
## 6 0 0 1 1
Perfect.
Download the two datasets from the ICPSR Load them both into R
Congratulations on finishing this module. In this session, you have learned about the different data types, how to load your data into R, and how to troubleshoot possible errors. In the real world, data you get out there are often messy, too much, and sometimes missing in values. That is why, before we conduct our analysis, we will need to clean them. That is exactly we will learn in the next module. Stay tuned.