Load the libraries we will use today

Similarity and distances

To illustrate the concept of similarity and distance, lets envison a data matrix with 4 sites and 2 species

Lets plot these in 2 dimensions to show the relationships

How can we quantify that distance? One of the simplest methods is the Euclidean distance

The problem with this function that we wrote is that its not easily able to calculate all the distances.

Common distance measures

There are approximately 30 similarity or distances commonly used. Legendre and Legendre 2012

The choice of which distance you are going to use depends on the data type and the type of analysis you will do.

Euclidean distance

\[ ED_{ij} = \sum_{i = 1}^p \sqrt{(x_{ij} - x_{ik})^2} \] - Most appealing measure because it has true “metric” properties - Column standardization to remove potential issues with scale - Applied to any data of any scale - Used in eigenvector ordinations (e.g., PCA) - Assume that variables are not correlated - Emphasizes outliers - Loose sensitivity with heterogeneous data - Distances are not proportional

vegdist(hyp_data, method = "euclidean")
          1         2         3
2  1.000000                    
3  5.830952  5.385165          
4 11.313708 10.630146  5.830952

City block (Manhattan) distance

  • Most ecologically meaningful dissimilarities are Manhattan types
  • Less weight to outliers compared to Euclidean
  • Retains sensitivity with heterogenous data
  • Distances are not proportional
vegdist(hyp_data, method = "manhattan")
   1  2  3
2  1      
3  8  7   
4 16 15  8

Proportional distances

  • Manhattan distances expressed as a proportion of the max distance
  • 2 communities with nothing in common would be 1
vegdist(hyp_data, method = "manhattan")/max_dist
       1      2      3
2 0.0625              
3 0.5000 0.4375       
4 1.0000 0.9375 0.5000

Sorensen or Bray-Curtis distance

  • Percent dissimilarity
  • Commonly used with species abundance but it can be used with data of any scale
  • Gives less weight to outliers than euclidean
  • Retains sensitivity with heteregenous data
  • Max when no species are in common
  • NOT metric and can not be used with DA, PCA, or CCA
vegdist(hyp_data, method = "bray")
           1          2          3
2 0.05263158                      
3 0.36363636 0.33333333           
4 0.80000000 0.78947368 0.36363636

Some other proportional distances exist and differ how they weigh the dissimilarity. Two examples are

  • Jaccards distance
vegdist(hyp_data, method = "jaccard")
          1         2         3
2 0.1000000                    
3 0.5333333 0.5000000          
4 0.8888889 0.8823529 0.5333333
  • Kulczynski distance
vegdist(hyp_data, method = "kulczynski")
          1         2         3
2 0.0500000                    
3 0.3583333 0.3194444          
4 0.8000000 0.7888889 0.3583333

Euclidean distances based on species profiles

Chord distance

  • Similar conceptually to euclidean, but data are row normalized
  • Useful in species abundance because it removes differences in total abundance
  • Gives low weights to variables with low counts and many zeros
decostand(hyp_data, method = "normalize")
      SpeciesA  SpeciesB
[1,] 0.1104315 0.9938837
[2,] 0.1240347 0.9922779
[3,] 0.7071068 0.7071068
[4,] 0.9938837 0.1104315
attr(,"decostand")
[1] "normalize"

Chi-square distances

  • Euclidean distances after completing a chi-square standardization
  • Distance used in correspondance analysis (CA) and canonical correspondance analysis (CCA)
vegdist(decostand(hyp_data, method = "chi.square"), method = "euclidean")
           1          2          3
2 0.02255336                      
3 0.81192099 0.78936762           
4 1.62384197 1.60128861 0.81192099

Species profile distance

  • Euclidean distances on relative abundance
  • Variables with higher values and fewer zeros contribute more to the distance
vegdist(decostand(hyp_data, method = "total", MARGIN = 1), method = "euclidean")
           1          2          3
2 0.01571348                      
3 0.56568542 0.54997194           
4 1.13137085 1.11565737 0.56568542

Hellinger distance

  • Euclidean distance on the hellinger standardization
  • Give low weights to variables with low counts and many zeros
vegdist(decostand(hyp_data, method = "hellinger"), method = "euclidean")
           1          2          3
2 0.01808611                      
3 0.45950584 0.44188477           
4 0.89442719 0.87821391 0.45950584

Distances on binary data

pa_data
     [,1] [,2] [,3] [,4]
[1,]    0    1    1    1
[2,]    1    1    0    0
[3,]    1    0    1    0
[4,]    0    1    0    1
vegdist(pa_data, binary = TRUE, method = "jaccard")
          1         2         3
2 0.7500000                    
3 0.7500000 0.6666667          
4 0.3333333 0.6666667 1.0000000

Binomial

  • Null hypothesis two communites are equal
vegdist(pa_data, binary = TRUE, method = "binomial")
          1         2         3
2 2.0794415                    
3 2.0794415 1.3862944          
4 0.6931472 1.3862944 2.7725887

Raup

  • Probablistic index based on presence/absence data
  • Non-metric
vegdist(pa_data, binary = TRUE, method = "raup")
          1         2         3
2 1.0000000                    
3 1.0000000 0.8333333          
4 0.5000000 0.8333333 1.0000000

Categorical and mixed data

Gowers distance

  • For each variable, a particular distance metric that works well for that data type and is used to scale between 0-1
  • Then a linear combination of those user specied weights (most simply an average) is calculated to create the final distance matrix
    • for quantitative data = range normalzed Manhattan distance
    • ordinal = variable is first ranked then Manhattan with adjustment for ties
    • nominal = variables of k categories are first converted into k binary columns and then a Dice coefficient is used
as.matrix(daisy(df, metric = "gower", type = list(asym = c(2,3))))
          1         2         3         4         5         6         7
1 0.0000000 0.5749818 0.2292435 0.6173067 0.6699906 0.3176680 0.7372010
2 0.5749818 0.0000000 0.5776599 0.3290653 0.3113710 0.7376149 0.3421992
3 0.2292435 0.5776599 0.0000000 0.6155740 0.6217465 0.2546933 0.6187765
4 0.6173067 0.3290653 0.6155740 0.0000000 0.1165022 0.4421957 0.2865609
5 0.6699906 0.3113710 0.6217465 0.1165022 0.0000000 0.4654749 0.3493142
6 0.3176680 0.7376149 0.2546933 0.4421957 0.4654749 0.0000000 0.6120648
7 0.7372010 0.3421992 0.6187765 0.2865609 0.3493142 0.6120648 0.0000000
8 0.5794467 0.5822714 0.4104376 0.4621400 0.4719496 0.4484787 0.2979355
9 0.4471714 0.6888198 0.6018221 0.5264212 0.6171620 0.5137954 0.4431129
          8         9
1 0.5794467 0.4471714
2 0.5822714 0.6888198
3 0.4104376 0.6018221
4 0.4621400 0.5264212
5 0.4719496 0.6171620
6 0.4484787 0.5137954
7 0.2979355 0.4431129
8 0.0000000 0.2853586
9 0.2853586 0.0000000
