library(tidyverse)
library(vegan)
library(cluster)
library(factoextra)
library(fpc)

Cluster analysis

The background

  • Cluster analysis is a broad group of multivariate techniques to identify homogenous groups
    • maximizes between group variation and minimizing within group variation
    • outcome: reduction of observations into fewer groups
    • often used in data mining or exploratory approaches
    • works best when there are inherent discontinuities in the data
      • if the data is continuous, ordination techniques may be preferred
      • ordination may force groups that do not exist
  • Occurs in two basic steps:
    1. measure of similarity betewen observations is specified
    2. Using this distance (and a clustering rule) observations are grouped based on either a hierarchical or partitioning technique
    3. Once a new cluster is formed, distances between clusters are based on single linkage (minimum distance), complete linkage (maximum method), or average linkage
  • Hiercharchical techniques are useful because they can reveal relationships in a nested fashion (i.e., phylogenetic tree)
    • not efficient for large data sets (> 500 obs)
  • Unlike hierarchical, partitioning does not require dissimilarity matrices

  • Partitioning methods follow four iterative steps:
    1. randomly assign cluster centroids
    2. classify clusters based on the closest centroid
    3. recaculate the centroid after each observation is added
    4. repeat steps1-3 until within cluster variation is minimized
  • Limitations of clustering
    • exploratory or hypothesis generating tool
    • Be considerate of using mixed data types. Gower’s distance should not be used in hierarchical analysis
    • Assumes distance measures follow a normal or multinomial distribution
    • clustering variables are appropriate for group separation
    • Can be influenced by scale and units
    • visual classifications are selective

Now on to the doing

We are going to use non-ecological data in this excersize to illustrate the different types of data that can be incorporated into this type of analysis

data("USArrests")
glimpse(USArrests)
Observations: 50
Variables: 4
$ Murder   <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.6, 10...
$ Assault  <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 249, 11...
$ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 66, 52...
$ Rape     <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 20.2, ...

Lets scale the data

USArrests %>% 
  scale() -> arrest.scale
head(arrest.scale)
               Murder   Assault   UrbanPop         Rape
Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
Arizona    0.07163341 1.4788032  0.9989801  1.042878388
Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144  1.7589234  2.067820292
Colorado   0.02571456 0.3988593  0.8608085  1.864967207

lets convert this to a distance matrix using the factoextra::get_dist() function.

arrest.scale %>% 
  get_dist(upper = TRUE, diag = TRUE) -> arrest.dist

Visualizing the distance matrix

arrest.dist.df <- as.data.frame(as.matrix(arrest.dist))
arrest.dist.df$row <- rownames(arrest.dist.df)
arrest.dist.df %>% 
  gather(col, value, -row) -> arrest_long
ggplot(data = arrest_long) +
  geom_raster(aes(x = col,y=row, fill = value)) +
  coord_equal(expand = F) +
  scale_fill_gradient2( low = "red", mid = "white", high = "blue", midpoint = 3) +
theme_classic() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

LS0tCnRpdGxlOiAiQXBwbGllZCBNdWx0aXZhcmlhdGU6ICBCcmVha2luZyBtdWx0aXZhcmlhdGUgZGF0YSBpbnRvIGdyb3Vwcy4gUGFydCAxLiIKb3V0cHV0OiBodG1sX25vdGVib29rCmVkaXRvcl9vcHRpb25zOiAKICBjaHVua19vdXRwdXRfdHlwZTogaW5saW5lCi0tLQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkKYGBgCgpgYGB7ciBtZXNzYWdlID0gRkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KHZlZ2FuKQpsaWJyYXJ5KGNsdXN0ZXIpCmxpYnJhcnkoZmFjdG9leHRyYSkKbGlicmFyeShmcGMpCmBgYAoKIyMgQ2x1c3RlciBhbmFseXNpcwoKIyMjIFRoZSBiYWNrZ3JvdW5kCgotIENsdXN0ZXIgYW5hbHlzaXMgaXMgYSBicm9hZCBncm91cCBvZiBtdWx0aXZhcmlhdGUgdGVjaG5pcXVlcyB0byBpZGVudGlmeSBob21vZ2Vub3VzIGdyb3VwcwogICAgLSBtYXhpbWl6ZXMgYmV0d2VlbiBncm91cCB2YXJpYXRpb24gYW5kIG1pbmltaXppbmcgd2l0aGluIGdyb3VwIHZhcmlhdGlvbiAKICAgIC0gb3V0Y29tZTogIHJlZHVjdGlvbiBvZiBvYnNlcnZhdGlvbnMgaW50byBmZXdlciBncm91cHMKICAgIC0gb2Z0ZW4gdXNlZCBpbiBkYXRhIG1pbmluZyBvciBleHBsb3JhdG9yeSBhcHByb2FjaGVzCiAgICAtIHdvcmtzIGJlc3Qgd2hlbiB0aGVyZSBhcmUgaW5oZXJlbnQgZGlzY29udGludWl0aWVzIGluIHRoZSBkYXRhCiAgICAgICAgIC0gIGlmIHRoZSBkYXRhIGlzIGNvbnRpbnVvdXMsIG9yZGluYXRpb24gdGVjaG5pcXVlcyBtYXkgYmUgcHJlZmVycmVkCiAgICAgICAgIC0gb3JkaW5hdGlvbiBtYXkgZm9yY2UgZ3JvdXBzIHRoYXQgZG8gbm90IGV4aXN0IAogICAgICAgICAKLSBPY2N1cnMgaW4gdHdvIGJhc2ljIHN0ZXBzOgogICAgMS4gbWVhc3VyZSBvZiBzaW1pbGFyaXR5IGJldGV3ZW4gb2JzZXJ2YXRpb25zIGlzIHNwZWNpZmllZAogICAgMi4gVXNpbmcgdGhpcyBkaXN0YW5jZSAoYW5kIGEgY2x1c3RlcmluZyBydWxlKSBvYnNlcnZhdGlvbnMgYXJlIGdyb3VwZWQgYmFzZWQgb24gZWl0aGVyIGEgaGllcmFyY2hpY2FsIG9yIHBhcnRpdGlvbmluZyB0ZWNobmlxdWUKICAgIDMuICBPbmNlIGEgbmV3IGNsdXN0ZXIgaXMgZm9ybWVkLCBkaXN0YW5jZXMgYmV0d2VlbiBjbHVzdGVycyBhcmUgYmFzZWQgb24gc2luZ2xlIGxpbmthZ2UgKG1pbmltdW0gZGlzdGFuY2UpLCBjb21wbGV0ZSBsaW5rYWdlIChtYXhpbXVtIG1ldGhvZCksIG9yIGF2ZXJhZ2UgbGlua2FnZQoKLSBIaWVyY2hhcmNoaWNhbCB0ZWNobmlxdWVzIGFyZSB1c2VmdWwgYmVjYXVzZSB0aGV5IGNhbiByZXZlYWwgcmVsYXRpb25zaGlwcyBpbiBhIG5lc3RlZCBmYXNoaW9uIChpLmUuLCBwaHlsb2dlbmV0aWMgdHJlZSkKICAgIC0gbm90IGVmZmljaWVudCBmb3IgbGFyZ2UgZGF0YSBzZXRzICg+IDUwMCBvYnMpCiAgICAKLSBVbmxpa2UgaGllcmFyY2hpY2FsLCBwYXJ0aXRpb25pbmcgZG9lcyBub3QgcmVxdWlyZSBkaXNzaW1pbGFyaXR5IG1hdHJpY2VzCgotIFBhcnRpdGlvbmluZyBtZXRob2RzIGZvbGxvdyBmb3VyIGl0ZXJhdGl2ZSBzdGVwczoKICAgIDEuIHJhbmRvbWx5IGFzc2lnbiBjbHVzdGVyIGNlbnRyb2lkcwogICAgMi4gY2xhc3NpZnkgY2x1c3RlcnMgYmFzZWQgb24gdGhlIGNsb3Nlc3QgY2VudHJvaWQKICAgIDMuIHJlY2FjdWxhdGUgdGhlIGNlbnRyb2lkIGFmdGVyIGVhY2ggb2JzZXJ2YXRpb24gaXMgYWRkZWQKICAgIDQuIHJlcGVhdCBzdGVwczEtMyB1bnRpbCB3aXRoaW4gY2x1c3RlciB2YXJpYXRpb24gaXMgbWluaW1pemVkCgotIExpbWl0YXRpb25zIG9mIGNsdXN0ZXJpbmcKICAgIC0gZXhwbG9yYXRvcnkgb3IgaHlwb3RoZXNpcyBnZW5lcmF0aW5nIHRvb2wKICAgIC0gQmUgY29uc2lkZXJhdGUgb2YgdXNpbmcgbWl4ZWQgZGF0YSB0eXBlcy4gIEdvd2VyJ3MgZGlzdGFuY2Ugc2hvdWxkIG5vdCBiZSB1c2VkIGluIGhpZXJhcmNoaWNhbCBhbmFseXNpcwogICAgLSBBc3N1bWVzIGRpc3RhbmNlIG1lYXN1cmVzIGZvbGxvdyBhIG5vcm1hbCBvciBtdWx0aW5vbWlhbCBkaXN0cmlidXRpb24KICAgIC0gY2x1c3RlcmluZyB2YXJpYWJsZXMgYXJlIGFwcHJvcHJpYXRlIGZvciBncm91cCBzZXBhcmF0aW9uCiAgICAtIENhbiBiZSBpbmZsdWVuY2VkIGJ5IHNjYWxlIGFuZCB1bml0cwogICAgLSB2aXN1YWwgY2xhc3NpZmljYXRpb25zIGFyZSBzZWxlY3RpdmUKCiMjIyBOb3cgb24gdG8gdGhlIGRvaW5nCgpXZSBhcmUgZ29pbmcgdG8gdXNlIG5vbi1lY29sb2dpY2FsIGRhdGEgaW4gdGhpcyBleGNlcnNpemUgdG8gaWxsdXN0cmF0ZSB0aGUgZGlmZmVyZW50IHR5cGVzIG9mIGRhdGEgdGhhdCBjYW4gYmUgaW5jb3Jwb3JhdGVkIGludG8gdGhpcyB0eXBlIG9mIGFuYWx5c2lzCgpgYGB7cn0KZGF0YSgiVVNBcnJlc3RzIikKZ2xpbXBzZShVU0FycmVzdHMpCmBgYAoKTGV0cyBzY2FsZSB0aGUgZGF0YSAKYGBge3J9ClVTQXJyZXN0cyAlPiUgCiAgc2NhbGUoKSAtPiBhcnJlc3Quc2NhbGUKCmhlYWQoYXJyZXN0LnNjYWxlKQpgYGAKCmxldHMgY29udmVydCB0aGlzIHRvIGEgZGlzdGFuY2UgbWF0cml4IHVzaW5nIHRoZSBgZmFjdG9leHRyYTo6Z2V0X2Rpc3QoKWAgZnVuY3Rpb24uCgpgYGB7cn0KYXJyZXN0LnNjYWxlICU+JSAKICBnZXRfZGlzdCh1cHBlciA9IFRSVUUsIGRpYWcgPSBUUlVFKSAtPiBhcnJlc3QuZGlzdApgYGAKClZpc3VhbGl6aW5nIHRoZSBkaXN0YW5jZSBtYXRyaXgKCmBgYHtyfQoKYXJyZXN0LmRpc3QuZGYgPC0gYXMuZGF0YS5mcmFtZShhcy5tYXRyaXgoYXJyZXN0LmRpc3QpKQphcnJlc3QuZGlzdC5kZiRyb3cgPC0gcm93bmFtZXMoYXJyZXN0LmRpc3QuZGYpCgphcnJlc3QuZGlzdC5kZiAlPiUgCiAgZ2F0aGVyKGNvbCwgdmFsdWUsIC1yb3cpIC0+IGFycmVzdF9sb25nCmBgYAoKYGBge3J9CmdncGxvdChkYXRhID0gYXJyZXN0X2xvbmcpICsKICBnZW9tX3Jhc3RlcihhZXMoeCA9IGNvbCx5PXJvdywgZmlsbCA9IHZhbHVlKSkgKwogIGNvb3JkX2VxdWFsKGV4cGFuZCA9IEYpICsKICBzY2FsZV9maWxsX2dyYWRpZW50MiggbG93ID0gInJlZCIsIG1pZCA9ICJ3aGl0ZSIsIGhpZ2ggPSAiYmx1ZSIsIG1pZHBvaW50ID0gMykgKwp0aGVtZV9jbGFzc2ljKCkgKwogIHRoZW1lKGF4aXMudGV4dC54ID0gZWxlbWVudF90ZXh0KGFuZ2xlID0gOTAsIGhqdXN0ID0gMSksCiAgICAgICAgYXhpcy50aXRsZS54ID0gZWxlbWVudF9ibGFuaygpLAogICAgICAgIGF4aXMudGl0bGUueSA9IGVsZW1lbnRfYmxhbmsoKSkKYGBgCgogCg==