Joins

Very often when we are working with datasets, particularly databases, we often want to combine multiple datasets. Traditionally, in R, this was done with the function merge(). Since the development of the tidyverse there has been improvements on the types and speed of joins. The joins available in the dplyr() package follow those in SQL type joins.

We will explore these types of joins using datasets from the our github repository

library(RCurl)
library(tidyverse)

publishers <- read_csv(getURL("https://raw.githubusercontent.com/chrischizinski/OFWIM_2016/master/data/publisher.csv"))

publishers

## # A tibble: 3 × 2
##   publisher yr_founded
##       <chr>      <int>
## 1        DC       1934
## 2    Marvel       1939
## 3     Image       1992

superheroes <- read_csv(getURL("https://raw.githubusercontent.com/chrischizinski/OFWIM_2016/master/data/superheroes.csv"))

superheroes

## # A tibble: 7 × 4
##       name alignment gender         publisher
##      <chr>     <chr>  <chr>             <chr>
## 1  Magneto       bad   male            Marvel
## 2    Storm      good female            Marvel
## 3 Mystique       bad female            Marvel
## 4   Batman      good   male                DC
## 5    Joker       bad   male                DC
## 6 Catwoman       bad female                DC
## 7  Hellboy      good   male Dark Horse Comics

Inner join

Mutating join

inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

An inner join of superheroes with publisher would return a dataset of superheros that match those in publisher

super_ij <- inner_join(superheroes, publishers, by = "publisher")
super_ij

## # A tibble: 6 × 5
##       name alignment gender publisher yr_founded
##      <chr>     <chr>  <chr>     <chr>      <int>
## 1  Magneto       bad   male    Marvel       1939
## 2    Storm      good female    Marvel       1939
## 3 Mystique       bad female    Marvel       1939
## 4   Batman      good   male        DC       1934
## 5    Joker       bad   male        DC       1934
## 6 Catwoman       bad female        DC       1934

Semi join

Filtering join

semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x.

A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

super_sj <- semi_join(superheroes, publishers, by = "publisher")
super_sj

## # A tibble: 6 × 4
##       name alignment gender publisher
##      <chr>     <chr>  <chr>     <chr>
## 1   Batman      good   male        DC
## 2    Joker       bad   male        DC
## 3 Catwoman       bad female        DC
## 4  Magneto       bad   male    Marvel
## 5    Storm      good female    Marvel
## 6 Mystique       bad female    Marvel

Notice that we loose ‘Hellboy’ because publisher was ‘Dark Horse Comics’ that did not exist in publisher. This differs from the inner join because it only retains the columns in superheroes, which is why it is a filtering join.

Left join

Mutating join

left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

The left join keeps all the data from the first specified dataset (x) and the columns from the second dataset (y). Where there are no matches an NA will be returned.

super_lj <- left_join(superheroes, publishers, by = "publisher")
super_lj

## # A tibble: 7 × 5
##       name alignment gender         publisher yr_founded
##      <chr>     <chr>  <chr>             <chr>      <int>
## 1  Magneto       bad   male            Marvel       1939
## 2    Storm      good female            Marvel       1939
## 3 Mystique       bad female            Marvel       1939
## 4   Batman      good   male                DC       1934
## 5    Joker       bad   male                DC       1934
## 6 Catwoman       bad female                DC       1934
## 7  Hellboy      good   male Dark Horse Comics         NA

Right join

Mutating join

right_join(x, y): Returns all observations in y. It’s equivalent to left_join(y, x), but the columns will be ordered differently.

The right join keeps all the data from the first specified dataset (x) and the columns from the second dataset (y). Where there are no matches an NA will be returned.

super_rj <- right_join(superheroes, publishers, by = "publisher")
super_rj

## # A tibble: 7 × 5
##       name alignment gender publisher yr_founded
##      <chr>     <chr>  <chr>     <chr>      <int>
## 1   Batman      good   male        DC       1934
## 2    Joker       bad   male        DC       1934
## 3 Catwoman       bad female        DC       1934
## 4  Magneto       bad   male    Marvel       1939
## 5    Storm      good female    Marvel       1939
## 6 Mystique       bad female    Marvel       1939
## 7     <NA>      <NA>   <NA>     Image       1992

Anti join

Filtering join

anti_join(x, y): Returns all rows in x that do not have a match in y.

Returns only the rows that do not have matches.

super_aj <- anti_join(superheroes, publishers, by = "publisher")
super_aj

## # A tibble: 1 × 4
##      name alignment gender         publisher
##     <chr>     <chr>  <chr>             <chr>
## 1 Hellboy      good   male Dark Horse Comics

Full join

Mutating join

full_join(x, y): Returns all observations in x and y.

The full join keeps all the data from the first specified dataset (x) and the second dataset (y). Where there are no matches an NA will be returned.

super_fj <- full_join(superheroes, publishers, by = "publisher")
super_fj

## # A tibble: 8 × 5
##       name alignment gender         publisher yr_founded
##      <chr>     <chr>  <chr>             <chr>      <int>
## 1  Magneto       bad   male            Marvel       1939
## 2    Storm      good female            Marvel       1939
## 3 Mystique       bad female            Marvel       1939
## 4   Batman      good   male                DC       1934
## 5    Joker       bad   male                DC       1934
## 6 Catwoman       bad female                DC       1934
## 7  Hellboy      good   male Dark Horse Comics         NA
## 8     <NA>      <NA>   <NA>             Image       1992