Very often when we are working with datasets, particularly databases, we often want to combine multiple datasets. Traditionally, in R, this was done with the function merge(). Since the development of the tidyverse there has been improvements on the types and speed of joins. The joins available in the dplyr() package follow those in SQL type joins.
We will explore these types of joins using datasets from the our github repository
Inner join
Mutating join
inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
An inner join of superheroes with publisher would return a dataset of superheros that match those in publisher
Semi join
Filtering join
semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x.
A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
Notice that we loose ‘Hellboy’ because publisher was ‘Dark Horse Comics’ that did not exist in publisher. This differs from the inner join because it only retains the columns in superheroes, which is why it is a filtering join.
Left join
Mutating join
left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
The left join keeps all the data from the first specified dataset (x) and the columns from the second dataset (y). Where there are no matches an NA will be returned.
Notice that we loose ‘Hellboy’ because publisher was ‘Dark Horse Comics’ that did not exist in publisher. This differs from the inner join because it only retains the columns in superheroes, which is why it is a filtering join.
Right join
Mutating join
right_join(x, y): Returns all observations in y. It’s equivalent to left_join(y, x), but the columns will be ordered differently.
The right join keeps all the data from the first specified dataset (x) and the columns from the second dataset (y). Where there are no matches an NA will be returned.
Anti join
Filtering join
anti_join(x, y): Returns all rows in x that do not have a match in y.
Returns only the rows that do not have matches.
Full join
Mutating join
full_join(x, y): Returns all observations in x and y.
The full join keeps all the data from the first specified dataset (x) and the second dataset (y). Where there are no matches an NA will be returned.