All the data that we use (and will be used in this course) are available from here. I have also placed a copy of this data in our repository.
Getting data into R
There are a lot of ways of getting data into R and this can add to a lot of confusion for R newbies trying to get started in R. We have already shown that there are ways to manually enter data in the previous lesson. R does have its own data format called .Rdata
.
.RData
- R’s internal data format
You can read and write your data to an .RData
format in a couple of ways. To illustrate, we will use the iris dataset.
data(iris) # load the internal data set
head(iris) # take a look at it
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
newiris<- iris # create a new object called newiris
# To save this as an .Rdata set we need to specify the data.frame and then the path to save it
save(newiris, file = "/Users/cchizinski2/Documents/SNR_R_Group/master/data/newiris.RData")
# First lets remove newiris from the environment
rm(newiris, iris)
ls()
## [1] "KnitPost" "theme_map"
## [3] "theme_map_presentation" "theme_mine"
## [5] "theme_presentation" "theme_transparent"
# To load an .Rdata file you
load(file = "/Users/cchizinski2/Documents/SNR_R_Group/master/data/newiris.RData")
head(newiris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Other formats
Whether you are trying to scrape a webpage, load from SPSS or SAS, or csv there is a package trying to help you get it into R. Hadley has been behind a cohesive effort and philosophy of data and R programming called the tidyverse
. Within these collection of packages are the abilities to load most kinds of data. NOTE: these packages will load data in the form of a tibble
To install these packages, you will first need to install the package devtools
and then install from Hadley’s github repository.
Please note that I am only going to cover the files that you are most likely going to encounter. There are a ton of different files out there and if you need alternate file types, check out the foreign
package, rio
, or Hadley’s page.
install.packages("devtools")
library(devtools)
install_github("hadley/tidyverse", force = TRUE)
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
CSV and TSV
One of the most basic types of files (and those that I use most frequently) are flat files like csv (comma seperated values or text files (space or tab seperated files). The best (in my unqualified opinion) is the readr
package.
To look at the requirements and default options pull up the help menu
#library(readr) # if you have not loaded tidyverse
?read_csv # note this is different from read.csv in base R
There is a couple of things that are nice with this package over the base:
- comment: a string to identify comments
- strip white space: removes leading and trailing white space (THE BANE OF MANY STRINGS)
To open a csv file, indicate the path to the file. Again NOTE that this will be loaded as a tibble
and not a traditional data.frame.
#Land crabs on Christmas Island, relationship to burrow density
land_crabs<-read_csv("/Users/cchizinski2/Documents/SNR_R_Group/master/data/ExperimentalDesignData/chpt5/green.csv")
## Parsed with column specification:
## cols(
## SITE = col_character(),
## QUADNUM = col_integer(),
## TOTMASS = col_double(),
## BURROWS = col_integer()
## )
head(land_crabs)
## # A tibble: 6 × 4
## SITE QUADNUM TOTMASS BURROWS
## <chr> <int> <dbl> <int>
## 1 DS 1 2.15 39
## 2 DS 2 2.27 38
## 3 DS 3 4.31 61
## 4 DS 4 2.58 79
## 5 DS 5 3.23 35
## 6 DS 6 1.83 39
# to convert to a data.frame use
land_crabs.df<-as.data.frame(land_crabs)
head(land_crabs.df)
## SITE QUADNUM TOTMASS BURROWS
## 1 DS 1 2.15 39
## 2 DS 2 2.27 38
## 3 DS 3 4.31 61
## 4 DS 4 2.58 79
## 5 DS 5 3.23 35
## 6 DS 6 1.83 39
#library(readr) # if you have not loaded tidyverse
?read_tsv # note this is different from read.csv in base R
To open a tsv file, indicate the path to the file. Again NOTE that this will be loaded as a tibble
and not a traditional data.frame.
#Land crabs on Christmas Island, relationship to burrow density
land_crabs<-read_tsv("/Users/cchizinski2/Documents/SNR_R_Group/master/data/ExperimentalDesignData/chpt5/green_txt.txt")
## Warning: Missing column names filled in: 'X5' [5]
## Parsed with column specification:
## cols(
## SITE = col_character(),
## QUADNUM = col_integer(),
## TOTMASS = col_double(),
## BURROWS = col_integer(),
## X5 = col_character()
## )
head(land_crabs)
## # A tibble: 6 × 5
## SITE QUADNUM TOTMASS BURROWS X5
## <chr> <int> <dbl> <int> <chr>
## 1 DS 1 2.15 39 <NA>
## 2 DS 2 2.27 38 <NA>
## 3 DS 3 4.31 61 <NA>
## 4 DS 4 2.58 79 <NA>
## 5 DS 5 3.23 35 <NA>
## 6 DS 6 1.83 39 <NA>
# to convert to a data.frame use
land_crabs.df<-as.data.frame(land_crabs)
head(land_crabs.df)
## SITE QUADNUM TOTMASS BURROWS X5
## 1 DS 1 2.15 39 <NA>
## 2 DS 2 2.27 38 <NA>
## 3 DS 3 4.31 61 <NA>
## 4 DS 4 2.58 79 <NA>
## 5 DS 5 3.23 35 <NA>
## 6 DS 6 1.83 39 <NA>
xls or xlsx
Unfortunately, people like to store data in excel files, despite many problems like those pointed out in this study. However, there is the readxl
package.
To look at the requirements and default options pull up the help menu
library(readxl) # if you have not loaded tidyverse
?read_excel
To open a excel file, indicate the path to the file. Again NOTE that this will be loaded as a tibble
and not a traditional data.frame.
#Land crabs on Christmas Island, relationship to burrow density
land_crabs<-read_excel("/Users/cchizinski2/Documents/SNR_R_Group/master/data/ExperimentalDesignData/chpt5/green.xls")
head(land_crabs)
## # A tibble: 6 × 4
## SITE QUADNUM TOTMASS BURROWS
## <chr> <dbl> <dbl> <dbl>
## 1 DS 1 2.15 39
## 2 DS 2 2.27 38
## 3 DS 3 4.31 61
## 4 DS 4 2.58 79
## 5 DS 5 3.23 35
## 6 DS 6 1.83 39
# You can also specify the sheet you would like to input
land_crabs2<-read_excel("/Users/cchizinski2/Documents/SNR_R_Group/master/data/ExperimentalDesignData/chpt5/green.xls", sheet = "Sheet2")
# or
land_crabs2<-read_excel("/Users/cchizinski2/Documents/SNR_R_Group/master/data/ExperimentalDesignData/chpt5/green.xls", sheet = 2)
# and specify NAs for something different than blank cells
land_crabs2<-read_excel("/Users/cchizinski2/Documents/SNR_R_Group/master/data/ExperimentalDesignData/chpt5/green.xls", sheet = "Sheet2", na = "NA")
land_crabs2
## # A tibble: 18 × 4
## SITE QUADNUM TOTMASS BURROWS
## <chr> <dbl> <dbl> <dbl>
## 1 DS 1 2.15 39
## 2 DS 2 2.27 38
## 3 DS 3 4.31 61
## 4 DS 4 2.58 79
## 5 DS 5 3.23 35
## 6 DS 6 1.83 39
## 7 DS 7 1.54 NA
## 8 DS 8 2.00 28
## 9 LS 1 4.36 38
## 10 LS 2 4.01 37
## 11 LS 3 3.33 NA
## 12 LS 4 2.63 18
## 13 LS 5 4.46 41
## 14 LS 6 3.96 33
## 15 LS 7 4.18 40
## 16 LS 8 4.21 29
## 17 LS 9 2.54 25
## 18 LS 10 4.29 38
SAS, SPSS, or Stata
There is always the chance that you get handed data from one of the ‘other’ stat programs and need to load it into R. In addition, it can write data in these formats as well. Luckily there is the haven
package.
To look at the requirements and default options pull up the help menu
library(haven) # if you have not loaded tidyverse
?read_sas #SAS
?read_sav #SPSS
?read_dta #Stata
SAS files
To open a SAS file (SAS7BDAT + SAS7BCAT formats), indicate the path to the file. Again NOTE that this will be loaded as a tibble
and not a traditional data.frame.
# Iris data set
iris_sas<-read_sas("/Users/cchizinski2/Documents/SNR_R_Group/master/data/iris.sas7bdat")
head(iris_sas)
## # A tibble: 6 × 5
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
SPSS files
To open a SPSS file (.sav
), indicate the path to the file. Again NOTE that this will be loaded as a tibble
and not a traditional data.frame.
# Iris data set
iris_spss<-read_sav("/Users/cchizinski2/Documents/SNR_R_Group/master/data/iris.sav")
head(iris_spss)
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <S3: labelled>
## 1 5.1 3.5 1.4 0.2 1
## 2 4.9 3.0 1.4 0.2 1
## 3 4.7 3.2 1.3 0.2 1
## 4 4.6 3.1 1.5 0.2 1
## 5 5.0 3.6 1.4 0.2 1
## 6 5.4 3.9 1.7 0.4 1
Stata files
To open a Stata file (Stata 13 and 14), indicate the path to the file. Again NOTE that this will be loaded as a tibble
and not a traditional data.frame.
# Iris data set
iris_stata<-read_stata("/Users/cchizinski2/Documents/SNR_R_Group/master/data/iris.dta")
head(iris_stata)
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <S3: labelled>
## 1 5.1 3.5 1.4 0.2 1
## 2 4.9 3.0 1.4 0.2 1
## 3 4.7 3.2 1.3 0.2 1
## 4 4.6 3.1 1.5 0.2 1
## 5 5.0 3.6 1.4 0.2 1
## 6 5.4 3.9 1.7 0.4 1
Reading data from a github repository
Text files (csv, tab)
To load text files from a git repository you will need the RCurl
package. Note this is not part of the tidyverse
)
On the github repository that you would like to download data from , find the button marked “Raw” and click on it. This is the raw text file and you will need to copy the URL to past this following the code below.
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
library(readr)
land_crabs<-read_csv(getURL("https://raw.githubusercontent.com/chrischizinski/SNR_R_Group/master/data/ExperimentalDesignData/chpt5/green.csv"))
head(land_crabs)
## # A tibble: 6 × 4
## SITE QUADNUM TOTMASS BURROWS
## <chr> <int> <dbl> <int>
## 1 DS 1 2.15 39
## 2 DS 2 2.27 38
## 3 DS 3 4.31 61
## 4 DS 4 2.58 79
## 5 DS 5 3.23 35
## 6 DS 6 1.83 39
RData files
To load .RData
files from a git repository, you will need the repmis package
. Note that repmis
is not part of the tidyverse
and contains some other miscellaneous functions that could be helpful.
On the github repository that you would like to download data from, find the button marked “Raw” and right-click on it, and copy link. If you click it, it will download the file.
library(repmis)
source_data("https://github.com/chrischizinski/SNR_R_Group/blob/master/data/iris_from_git.RData?raw=true")
## Downloading data from: https://github.com/chrischizinski/SNR_R_Group/blob/master/data/iris_from_git.RData?raw=true
## SHA-1 hash of the downloaded data file is:
## fe14b424cf7065a4574f5657153db5d25c6e2057
## [1] "iris_from_git"
head(iris_from_git)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa