All the data that we use (and will be used in this course) are available from here. I have also placed a copy of this data in our repository.
Getting data into R
There are a lot of ways of getting data into R and this can add to a lot of confusion for R newbies trying to get started in R. We have already shown that there are ways to manually enter data in the previous lesson. R does have its own data format called .Rdata.
.RData - R’s internal data format
You can read and write your data to an .RData format in a couple of ways. To illustrate, we will use the iris dataset.
Other formats
Whether you are trying to scrape a webpage, load from SPSS or SAS, or csv there is a package trying to help you get it into R. Hadley has been behind a cohesive effort and philosophy of data and R programming called the tidyverse. Within these collection of packages are the abilities to load most kinds of data. NOTE: these packages will load data in the form of a tibble
To install these packages, you will first need to install the package devtools and then install from Hadley’s github repository.
Please note that I am only going to cover the files that you are most likely going to encounter. There are a ton of different files out there and if you need alternate file types, check out the foreign package, rio, or Hadley’s page.
CSV and TSV
One of the most basic types of files (and those that I use most frequently) are flat files like csv (comma seperated values or text files (space or tab seperated files). The best (in my unqualified opinion) is the readr package.
To look at the requirements and default options pull up the help menu
There is a couple of things that are nice with this package over the base:
comment: a string to identify comments
strip white space: removes leading and trailing white space (THE BANE OF MANY STRINGS)
To open a csv file, indicate the path to the file. Again NOTE that this will be loaded as a tibble and not a traditional data.frame.
To open a tsv file, indicate the path to the file. Again NOTE that this will be loaded as a tibble and not a traditional data.frame.
xls or xlsx
Unfortunately, people like to store data in excel files, despite many problems like those pointed out in this study. However, there is the readxl package.
To look at the requirements and default options pull up the help menu
To open a excel file, indicate the path to the file. Again NOTE that this will be loaded as a tibble and not a traditional data.frame.
SAS, SPSS, or Stata
There is always the chance that you get handed data from one of the ‘other’ stat programs and need to load it into R. In addition, it can write data in these formats as well. Luckily there is the haven package.
To look at the requirements and default options pull up the help menu
SAS files
To open a SAS file (SAS7BDAT + SAS7BCAT formats), indicate the path to the file. Again NOTE that this will be loaded as a tibble and not a traditional data.frame.
SPSS files
To open a SPSS file (.sav), indicate the path to the file. Again NOTE that this will be loaded as a tibble and not a traditional data.frame.
Stata files
To open a Stata file (Stata 13 and 14), indicate the path to the file. Again NOTE that this will be loaded as a tibble and not a traditional data.frame.
Reading data from a github repository
Text files (csv, tab)
To load text files from a git repository you will need the RCurl package. Note this is not part of the tidyverse)
On the github repository that you would like to download data from
, find the button marked “Raw” and click on it. This is the raw text file and you will need to copy the URL to past this following the code below.
RData files
To load .RData files from a git repository, you will need the repmis package. Note that repmis is not part of the tidyverse and contains some other miscellaneous functions that could be helpful.
On the github repository that you would like to download data from, find the button marked “Raw” and right-click on it, and copy link. If you click it, it will download the file.