% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{beta and gamma diversity clustering} %\VignettePackage{natto} \documentclass[a4paper,10pt,twocolumn]{article} \usepackage{amsmath} \usepackage{amssymb} \usepackage{ucs} \usepackage[utf8x]{inputenc} \usepackage[T1]{fontenc} \usepackage[round]{natbib} \renewcommand{\cite}{\citep} \author{Jari Oksanen} \title{Cluster Analysis based on beta or gamma Diversity} \begin{document} \bibliographystyle{jss} \newcommand{\code}[1]{{\tt #1}} \newcommand{\pkg}[1]{\textbf{#1}} \newcommand{\fun}[1]{\texttt{#1}} \newcommand{\R}{\textsf{R}} %% <>= library("knitr") opts_chunk$set(fig.lp = "fig:", size = "small", out.width = "1.1\\linewidth", fig.align = "center", fig.show = "hold") require(vegan) require(natto) @ \maketitle \tableofcontents \section{Introduction} The \pkg{natto} package includes function \code{diverclust} that introduces a potentially new method of beta and gamma diversity clustering. The idea of this new method is obvious and I have long assumed that such a method must have been invented previously. However, I have failed to find such a method, and therefore I added the function in \pkg{natto}. This document describes the method as implemented in \pkg{natto} for two purposes: to explain what was done in \pkg{natto} and to help finding its predecessors. \section{The Method} \subsection{The Stages of Clustering} The main steps are: \begin{enumerate} \item Initially estimate diversity of all sampling units (\S\ref{sec:diversity}). \item First evaluate the \emph{increase} in diversity when you pool together a pair of sampling units (optionally with equalization, \S\ref{sec:equalize}). The increase in diversity in pooling is conventionally called \emph{beta diversity}. \item Select the smallest of pooled beta diversity values and make the pair a cluster. \item Pool the sampling units of the new merged cluster by summing up abundance values (optionally with equalization \S\ref{sec:equalize}), and re-evaluate its beta diversities with all other units. \item Go back to step 3 until all sampling units are merged. \end{enumerate} \subsection{Measurement of Diversity} \label{sec:diversity} Diversity indices are based on proportional abundances $p_j$ of for species $j = 1 \ldots S$, where $S$ is the number of species. The species proportions are normally found from community matrix $\{x_{ij}\}$ by dividing by the sum of row sums $p_j = x_{ij}/\sum_{i=1}^N x_{ij}$. The \code{diverclust} function uses \pkg{vegan} function \code{renyi} to evaluate any Rényi diversity or the corresponding Hill number. R{\'e}nyi diversity of order $a$ is \cite{Hill73number}: \begin{equation} \label{eq:renyi} H_a = \frac{1}{1-a} \log \sum_{j=1}^S p_j^a \,, \end{equation} and the corresponding Hill number is $N_a = \exp(H_a)$. Many common diversity indices are special cases of Hill numbers: $N_0 = S$ is the number of species, $N_1 = \exp(H')$ is exponent of the Shannon diversity, $N_2 = 1/\sum_{j=1}^S p_j^2$ is the Simpson diversity, and $N_\infty = 1/(\max p_j)$ is the Berger-Parker index. The corresponding R\'{e}nyi diversities are $H_0 = \log(S)$, $H_1 = H'$, $H_2 = - \log(\sum p_j^2)$, and $H_\infty = -\log(\max p_j)$. The beta diversity is defined as an increase of diversity when pooling sampling units. Pooling is performed by summing up abundance values of species, and the beta diversity $\beta$ for a cluster of $M$ sampling units is defined as a difference of the pooled gamma diversity and mean of alpha diversities $\beta = \gamma - \bar \alpha$: \begin{equation} \label{eq:partition} \beta_a = \underbrace{H_a \left(\sum_{i=1}^M x_{ij} \right)}_{\gamma} - \underbrace{\frac{1}{M} \sum_{i=1}^M H_a (x_{ij})}_{\bar \alpha}\,, \end{equation} Alternatively we can use Hill numbers $N_a$ in place of $H_a$. The equation implies additive partitioning of diversity. However, Rényi diversities are on $\log$ scale which translates a multiplicative model into an additive model. If you use Hill numbers in eq.~\ref{eq:partition}, the model is truly additive (see the two forms in eq.~\ref{eq:dN0}). \subsection{Equalizing Observations} \label{sec:equalize} The total pooled diversity ($\gamma$) in eq.~\ref{eq:partition} includes both the within-unit diversity ($\alpha$) and between-unit differences ($\beta$). This requires that pooled diversity really is additive: it maintains the within-unit diversity and adds between-unit differences. This seems to be the case with Rényi diversities, but not necessarily for the Hill numbers beyond $N_0 = S$ (species richness). However, if the sampling units are equalized, the additivity is greatly improved. Equalization scales units with different total abundances to a more similar magnitude for pooling, but does not influence their alpha diversities. The suggested equalization is dependent on the scale $a$ of the Rényi diversity, and all values for a unit are divided by weight \begin{equation} \label{eq:equalize} w_i = \left( \sum_{j=1}^S x_{ij}^a \right)^{1/a}\,. \end{equation} For $a=1$ (Shannon diversity) $w$ scales to unit sum, for $a=2$ (Simpson diversity) $w$ scales to unit sum of squares (norm), and for $a=\infty$ (Berger-Parker diversity) $w$ is the maximum for each sampling unit. No equalization is needed for $a=0$ (species richness). \subsection{Clustering Based on gamma Diversity} If we omit the $\bar \alpha$ term for average alpha diversity in eq.~\ref{eq:partition}, we can base the clustering on total or gamma diversity. It may be difficult to see what would be the utility of such a clustering, but perhaps it could be used to form low-diversity classes that differ from each other. \subsection{Implementation} Function \code{diverclust} implements beta and gamma diversity clustering as described above. The following options can be used to modify its behaviour: \begin{itemize} \item The scale $a$ of Rényi diversity (eq.~\ref{eq:renyi}) can be given. \item The equalization of eq.~\ref{eq:equalize} can be turned off or on. \item beta or gamma diversity clustering can be selected. \item Either Rényi diversities or Hill numbers can be selected. \end{itemize} The following example performs diversity clustering with defaults: it uses beta diversity based on Rényi index $H_1$ (Shannon diversity) and equalizes sample plots, and displays the dendrogram (Fig.~\ref{fig:renyi1}). <>= data(BCI) ## row names by most abundant species colnames(BCI) <- make.cepnames(colnames(BCI)) dom <- colnames(BCI)[apply(BCI, 1, which.max)] rownames(BCI) <- paste0(dom, 1:50) ## diversity clustering cl <- diverclust(BCI, trace=FALSE) plot(cl, hang = -1, cex=0.8) @ \section{Other Methods} One reason for writing this vignette was to find out if the \code{diverclust} function re-invents an existing method. Information Analysis is based on very similar reasoning as the diversity clustering \cite{WilliamsEa66, LanceWilliams66}. To compare it against \code{diverclust}, it was implemented as function \code{infoclust} in the \pkg{natto} package. \subsection{Clustering Based on Information Analysis} Information analysis works on binary community data matrix. The information content $I$ of a cluster of $M$ sampling units is defined as \cite{WilliamsEa66, LanceWilliams66}: \begin{multline} \label{eq:information} I = S M \log M \\ - \sum_{j=1}^S Mq_j \log Mq_j + M(1-q_j) \log M(1-q_j)\,, \end{multline} where $q_j$ is the relative frequency of species $j$ in the $M$ sampling units of the cluster. The contribution to $I$ is 0 for species absent from the cluster ($q=0$) and for species present on every unit in the cluster ($q=1$), and it is maximal for species with frequency $q=0.5$. The clustering minimizes the information criterion $I$ and therefore it tries to produce clusters where species are either absent or nearly absent, or constant or nearly so. This often produces clusters that are easy to interpret in floristic terms. The clusters are formed similarly as in \code{diverclust}: two sampling units or clusters are pooled, the species frequencies are recalculated, and the information content with the pooled group and all other groups are re-evaluated. However, the group selected for merging is not not the one with lowest $I$, but the group that gives the lowest increase $\Delta I$. The information content of single sampling units is $I=0$, but formed clusters have positive values of $I$, and the information values of the members of the cluster are subtracted from the value of the cluster, and the difference $\Delta I$ is used as the criterion of clustering. The merge still happens at the level of pooled new unit $I$, and the clusters are not formed in the order of their merge heights. This conflicts with \R{} conventions, and the \code{infoclust} function updates the merge table to correspond to the merge heights. \citet{WilliamsEa66} and \citet{LanceWilliams66} describe their method very briefly, and the current implementation is \pkg{natto} is based on the worked example of \citet{Legendres}. The use is similar as for \code{diverclust}: <>= cli <- infoclust(BCI) plot(cli, hang=-1, cex=0.8) @ The clusters are often very compact (Fig.~\ref{fig:infoclust}), and results differ from the diversity clustering. Fig.~\ref{fig:renyi1} was based on quantitative data, but clusterings are often very different with binary data as well. In this case, diversity clustering of binary data with Simpson index ($N_2$) and equalization gave most similar results to information clustering (Fig.~\ref{fig:tanglegram}). <>= library(dendextend) # tanglegram cl2 <- diverclust(decostand(BCI, "pa"), renyi=2, hill=TRUE, equalize=TRUE, trace=FALSE) tanglegram(untangle(as.dendrogram(cli), as.dendrogram(cl2), method="step2side"), main_left="Information", main_right="Rényi 2") @ \subsection{Dissimilarity Indices} The diversity clustering methods can also be expressed as dissimilarity measures for two pooled sampling units. Such dissimilarities do not produce the diversity clustering, because dissimilarities between clusters cannot be found from dissimilarities between sampling units, but the original data must be pooled by clusters, and dissimilarites re-evaluated from the updated community data. However, it may be instructive to compare these indices with common dissimilarity measures. In this chapter we see how the diversity measures can be expressed as dissimilarities with binary data. The \pkg{natto} package includes function \code{diverdist} that evaluates the pairwise diversity dissimilarities, and is used in the first step of diversity clustering to select first merged sampling units. The diversity clustering and Rényi diversities are principally designed for quantitative data. For presence-absence dissimilarity indices we analyse binary data. Binary data defines a community with maximum equitability, so that all Rényi indices will be $N = \log S$ and Hill numbers $H = S$ irrespective of the Rényi scale. Average alpha diversity with Rényi ($\bar \alpha_H$) or Hill ($\bar \alpha_N$) indices for two sampling units eachs with $S_1$ and $S_2$ species is \begin{align} \label{eq:alphaH} \bar \alpha_H &= \tfrac{1}{2}(\log S_1 + \log S_2)\\ \bar \alpha_N &= \tfrac{1}{2}(S_1 + S_2)\,. \end{align} In two sampling units of $S_1$ and $S_2$ species and $S_{1 \cap 2}$ shared species, there will be $S_1 + S_2 - 2S_{1 \cap 2}$ species that occur only one of the units and each at proportion $p = \frac{1}{S_1 + S_2}$ and $S_{1 \cap 2}$ species that occur in both sampling units at proportion $p = \frac{2}{S_1 + S_2}$. The general equation (\ref{eq:partition}) of distance based on Rényi beta diversity between two binary sampling units is: \begin{multline} d(H_a) = \frac{1}{1-a} [\log(S_1+S_2 + (2^a - 2) S_{1 \cap 2}) \\ - a \log (S_1+S_2)] - \bar \alpha_H \end{multline} and the corresponding formula for Hill numbers is: \begin{equation} d(N_a) = \left[\frac{S_1 + S_2 + (2^a-2) S_{1 \cap 2}}{(S_1 + S_2)^a} \right]^\frac{1}{1-a} - \bar \alpha_N \,. \end{equation} Some special cases simplify into more compact forms: \begin{align} d(H_0) &= \log(S_1 + S_2 - S_{1 \cap 2}) - \bar \alpha_H \nonumber\\ &= \log \left(\frac{S_1+S_2- S_{1 \cap 2}}{\sqrt{S_1 S_2}} \right) \\ \label{eq:dN0} d(N_0) &= \tfrac{1}{2}(S_1 + S_2 - 2 S_{1 \cap 2}) \nonumber \\ &= \bar \alpha_N - S_{1 \cap 2} \\ \label{eq:dH1} d(H_1) &= \log(S_1 + S_2) - \frac{S_{1 \cap 2} \log 4}{S_1+S_2} - \bar \alpha_H\\ d(H_2) &= 2 \log (S_1+S_2) \nonumber \\ & - \log(S_1 + S_2 + 2 S_{1 \cap 2}) - \bar \alpha_H\\ d(N_2) &= \frac{(S_1 + S_2)^2}{S_1 + S_2 + 2 S_{1 \cap 2}} - \bar \alpha_N\\ d(H_\infty) &= \log(S_1 + S_2) - \log 2 - \bar \alpha_H\\ d(N_\infty) &= 0 \,. \end{align} These may be regarded as ``new'' dissimilarity measures, although there hardly is a deficit of dissimilarity indices. Simplest case is $d(N_0)$ which is only half of the number of non-shared species, or half of the squared Euclidean distance between binary vectors. Most importantly, these are special cases for binary data and two pooled sampling units. Although the indices can be calculated for binary data, they really are intended for quantitative data. More importantly, these indices are only used for comparing a pair of unmerged sites, and they do not apply to comparisons involving clusters. The implicit dissimilarity measure in information analysis is \citep{WilliamsEa66,LanceWilliams66}: \begin{equation} d(I) = (S_1 + S_2 - 2 S_{1 \cap 2}) \log(4) \,. \end{equation} This is $d(N_0)$ with different multiplier. However, the similarity to $d(N_0)$ disappears when more than two sampling units are compared. Dissimilarities based on diversity or information have been sometimes suggested. \citet{KoleffEa03} suggest the following that they ascribe to \citet{Routledge84}: \begin{multline} \label{eq:routledge} d(I) = \log(S_1 + S_2) - \frac{S_{1 \cap 2} \log 4}{S_1 + S_2} \\ - \frac{S_1}{S_1 + S_2} \log S_1 - \frac{S_2}{S_1+S_2} \log S_2\,. \end{multline} The formulation was based on \citet{KoleffEa03} who adapted it to binary data, and it was further rearranged to emphasize its resemblance to our $d(H_1)$ (eq.~\ref{eq:dH1}). The only difference is that $d(H_1)$ uses unweighted average alpha diversity $\bar \alpha_H$ (eq.~\ref{eq:alphaH}), whereas eq.~\ref{eq:routledge} weights sampling units by their species richness values. Using equalized pooling of binary data produces dissimilarities that are even more similar to this index. \bibliography{diverclust} \end{document}