Very often when we are working with datasets, particularly databases, we often want to combine multiple datasets. Traditionally, in R, this was done with the function merge(). Since the development of the tidyverse there has been improvements on the types and speed of joins. The joins available in the dplyr() package follow those in SQL type joins.

We will explore these types of joins using datasets from the our github repository

library(RCurl)
library(tidyverse)

publishers <- read_csv(getURL("https://raw.githubusercontent.com/chrischizinski/OFWIM_2016/master/data/publisher.csv"))

publishers
## # A tibble: 3 × 2
## publisher yr_founded
## <chr> <int>
## 1 DC 1934
## 2 Marvel 1939
## 3 Image 1992
superheroes <- read_csv(getURL("https://raw.githubusercontent.com/chrischizinski/OFWIM_2016/master/data/superheroes.csv"))

superheroes
## # A tibble: 7 × 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Magneto bad male Marvel
## 2 Storm good female Marvel
## 3 Mystique bad female Marvel
## 4 Batman good male DC
## 5 Joker bad male DC
## 6 Catwoman bad female DC
## 7 Hellboy good male Dark Horse Comics

Inner join

Mutating join

inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

An inner join of superheroes with publisher would return a dataset of superheros that match those in publisher

super_ij <- inner_join(superheroes, publishers, by = "publisher")
super_ij
## # A tibble: 6 × 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934

Semi join

Filtering join

semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x.

A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

super_sj <- semi_join(superheroes, publishers, by = "publisher")
super_sj
## # A tibble: 6 × 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Batman good male DC
## 2 Joker bad male DC
## 3 Catwoman bad female DC
## 4 Magneto bad male Marvel
## 5 Storm good female Marvel
## 6 Mystique bad female Marvel

Notice that we loose ‘Hellboy’ because publisher was ‘Dark Horse Comics’ that did not exist in publisher. This differs from the inner join because it only retains the columns in superheroes, which is why it is a filtering join.

Left join

Mutating join

left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

The left join keeps all the data from the first specified dataset (x) and the columns from the second dataset (y). Where there are no matches an NA will be returned.

super_lj <- left_join(superheroes, publishers, by = "publisher")
super_lj
## # A tibble: 7 × 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA

Notice that we loose ‘Hellboy’ because publisher was ‘Dark Horse Comics’ that did not exist in publisher. This differs from the inner join because it only retains the columns in superheroes, which is why it is a filtering join.

Right join

Mutating join

right_join(x, y): Returns all observations in y. It’s equivalent to left_join(y, x), but the columns will be ordered differently.

The right join keeps all the data from the first specified dataset (x) and the columns from the second dataset (y). Where there are no matches an NA will be returned.

super_rj <- right_join(superheroes, publishers, by = "publisher")
super_rj
## # A tibble: 7 × 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Batman good male DC 1934
## 2 Joker bad male DC 1934
## 3 Catwoman bad female DC 1934
## 4 Magneto bad male Marvel 1939
## 5 Storm good female Marvel 1939
## 6 Mystique bad female Marvel 1939
## 7 <NA> <NA> <NA> Image 1992

Anti join

Filtering join

anti_join(x, y): Returns all rows in x that do not have a match in y.

Returns only the rows that do not have matches.

super_aj <- anti_join(superheroes, publishers, by = "publisher")
super_aj
## # A tibble: 1 × 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Hellboy good male Dark Horse Comics

Full join

Mutating join

full_join(x, y): Returns all observations in x and y.

The full join keeps all the data from the first specified dataset (x) and the second dataset (y). Where there are no matches an NA will be returned.

super_fj <- full_join(superheroes, publishers, by = "publisher")
super_fj
## # A tibble: 8 × 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
## 8 <NA> <NA> <NA> Image 1992