Load the libraries we will use today
Similarity and distances
To illustrate the concept of similarity and distance, lets envison a data matrix with 4 sites and 2 species
Lets plot these in 2 dimensions to show the relationships
How can we quantify that distance? One of the simplest methods is the Euclidean distance
The problem with this function that we wrote is that its not easily able to calculate all the distances.
Common distance measures
There are approximately 30 similarity or distances commonly used. Legendre and Legendre 2012
The choice of which distance you are going to use depends on the data type and the type of analysis you will do.
Euclidean distance
\[ ED_{ij} = \sum_{i = 1}^p \sqrt{(x_{ij} - x_{ik})^2} \] - Most appealing measure because it has true “metric” properties - Column standardization to remove potential issues with scale - Applied to any data of any scale - Used in eigenvector ordinations (e.g., PCA) - Assume that variables are not correlated - Emphasizes outliers - Loose sensitivity with heterogeneous data - Distances are not proportional
vegdist(hyp_data, method = "euclidean")
1 2 3
2 1.000000
3 5.830952 5.385165
4 11.313708 10.630146 5.830952
City block (Manhattan) distance
- Most ecologically meaningful dissimilarities are Manhattan types
- Less weight to outliers compared to Euclidean
- Retains sensitivity with heterogenous data
- Distances are not proportional
vegdist(hyp_data, method = "manhattan")
1 2 3
2 1
3 8 7
4 16 15 8
Proportional distances
- Manhattan distances expressed as a proportion of the max distance
- 2 communities with nothing in common would be 1
vegdist(hyp_data, method = "manhattan")/max_dist
1 2 3
2 0.0625
3 0.5000 0.4375
4 1.0000 0.9375 0.5000
Sorensen or Bray-Curtis distance
- Percent dissimilarity
- Commonly used with species abundance but it can be used with data of any scale
- Gives less weight to outliers than euclidean
- Retains sensitivity with heteregenous data
- Max when no species are in common
- NOT metric and can not be used with DA, PCA, or CCA
vegdist(hyp_data, method = "bray")
1 2 3
2 0.05263158
3 0.36363636 0.33333333
4 0.80000000 0.78947368 0.36363636
Some other proportional distances exist and differ how they weigh the dissimilarity. Two examples are
vegdist(hyp_data, method = "jaccard")
1 2 3
2 0.1000000
3 0.5333333 0.5000000
4 0.8888889 0.8823529 0.5333333
vegdist(hyp_data, method = "kulczynski")
1 2 3
2 0.0500000
3 0.3583333 0.3194444
4 0.8000000 0.7888889 0.3583333
Euclidean distances based on species profiles
Chord distance
- Similar conceptually to euclidean, but data are row normalized
- Useful in species abundance because it removes differences in total abundance
- Gives low weights to variables with low counts and many zeros
decostand(hyp_data, method = "normalize")
SpeciesA SpeciesB
[1,] 0.1104315 0.9938837
[2,] 0.1240347 0.9922779
[3,] 0.7071068 0.7071068
[4,] 0.9938837 0.1104315
attr(,"decostand")
[1] "normalize"
Chi-square distances
- Euclidean distances after completing a chi-square standardization
- Distance used in correspondance analysis (CA) and canonical correspondance analysis (CCA)
vegdist(decostand(hyp_data, method = "chi.square"), method = "euclidean")
1 2 3
2 0.02255336
3 0.81192099 0.78936762
4 1.62384197 1.60128861 0.81192099
Species profile distance
- Euclidean distances on relative abundance
- Variables with higher values and fewer zeros contribute more to the distance
vegdist(decostand(hyp_data, method = "total", MARGIN = 1), method = "euclidean")
1 2 3
2 0.01571348
3 0.56568542 0.54997194
4 1.13137085 1.11565737 0.56568542
Hellinger distance
- Euclidean distance on the hellinger standardization
- Give low weights to variables with low counts and many zeros
vegdist(decostand(hyp_data, method = "hellinger"), method = "euclidean")
1 2 3
2 0.01808611
3 0.45950584 0.44188477
4 0.89442719 0.87821391 0.45950584
Distances on binary data
pa_data
[,1] [,2] [,3] [,4]
[1,] 0 1 1 1
[2,] 1 1 0 0
[3,] 1 0 1 0
[4,] 0 1 0 1
vegdist(pa_data, binary = TRUE, method = "jaccard")
1 2 3
2 0.7500000
3 0.7500000 0.6666667
4 0.3333333 0.6666667 1.0000000
Binomial
- Null hypothesis two communites are equal
vegdist(pa_data, binary = TRUE, method = "binomial")
1 2 3
2 2.0794415
3 2.0794415 1.3862944
4 0.6931472 1.3862944 2.7725887
Raup
- Probablistic index based on presence/absence data
- Non-metric
vegdist(pa_data, binary = TRUE, method = "raup")
1 2 3
2 1.0000000
3 1.0000000 0.8333333
4 0.5000000 0.8333333 1.0000000
Categorical and mixed data
Gowers distance
- For each variable, a particular distance metric that works well for that data type and is used to scale between 0-1
- Then a linear combination of those user specied weights (most simply an average) is calculated to create the final distance matrix
- for quantitative data = range normalzed Manhattan distance
- ordinal = variable is first ranked then Manhattan with adjustment for ties
- nominal = variables of k categories are first converted into k binary columns and then a Dice coefficient is used
as.matrix(daisy(df, metric = "gower", type = list(asym = c(2,3))))
1 2 3 4 5 6 7
1 0.0000000 0.5749818 0.2292435 0.6173067 0.6699906 0.3176680 0.7372010
2 0.5749818 0.0000000 0.5776599 0.3290653 0.3113710 0.7376149 0.3421992
3 0.2292435 0.5776599 0.0000000 0.6155740 0.6217465 0.2546933 0.6187765
4 0.6173067 0.3290653 0.6155740 0.0000000 0.1165022 0.4421957 0.2865609
5 0.6699906 0.3113710 0.6217465 0.1165022 0.0000000 0.4654749 0.3493142
6 0.3176680 0.7376149 0.2546933 0.4421957 0.4654749 0.0000000 0.6120648
7 0.7372010 0.3421992 0.6187765 0.2865609 0.3493142 0.6120648 0.0000000
8 0.5794467 0.5822714 0.4104376 0.4621400 0.4719496 0.4484787 0.2979355
9 0.4471714 0.6888198 0.6018221 0.5264212 0.6171620 0.5137954 0.4431129
8 9
1 0.5794467 0.4471714
2 0.5822714 0.6888198
3 0.4104376 0.6018221
4 0.4621400 0.5264212
5 0.4719496 0.6171620
6 0.4484787 0.5137954
7 0.2979355 0.4431129
8 0.0000000 0.2853586
9 0.2853586 0.0000000
