% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{beta and gamma diversity clustering}
%\VignettePackage{natto}

\documentclass[a4paper,10pt,twocolumn]{article}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}

\usepackage[round]{natbib}
\renewcommand{\cite}{\citep}

\author{Jari Oksanen}
\title{Cluster Analysis based on beta or gamma Diversity}
\begin{document}
\bibliographystyle{jss}

\newcommand{\code}[1]{{\tt #1}}
\newcommand{\pkg}[1]{\textbf{#1}}
\newcommand{\fun}[1]{\texttt{#1}}
\newcommand{\R}{\textsf{R}}
%%
<<prelim, echo = FALSE, results = "hide", message=FALSE>>=
library("knitr")
opts_chunk$set(fig.lp = "fig:", size = "small", out.width = "1.1\\linewidth",
               fig.align = "center", fig.show = "hold")
require(vegan)
require(natto)
@

\maketitle
\tableofcontents

\section{Introduction}

The \pkg{natto} package includes function \code{diverclust} that
introduces a potentially new method of beta and gamma diversity
clustering.  The idea of this new method is obvious and I have long
assumed that such a method must have been invented previously.
However, I have failed to find such a method, and therefore I added
the function in \pkg{natto}.  This document describes the method as
implemented in \pkg{natto} for two purposes: to explain what was done
in \pkg{natto} and to help finding its predecessors.

\section{The Method}

\subsection{The Stages of Clustering}

The main steps are:
\begin{enumerate}
\item Initially estimate diversity of all sampling units
  (\S\ref{sec:diversity}).
\item First evaluate the \emph{increase} in diversity when you
  pool together a pair of sampling units (optionally with equalization,
  \S\ref{sec:equalize}). The increase in diversity in pooling is
  conventionally called \emph{beta diversity}.
\item Select the smallest of pooled beta diversity values and make the
  pair a cluster.
\item Pool the sampling units of the new merged cluster by summing up
  abundance values (optionally with equalization
  \S\ref{sec:equalize}), and re-evaluate its beta diversities with
  all other units.
\item Go back to step 3 until all sampling units are merged.
\end{enumerate}

\subsection{Measurement of Diversity}
\label{sec:diversity}

Diversity indices are based on proportional abundances $p_j$ of for
species $j = 1 \ldots S$, where $S$ is the number of species. The
species proportions are normally found from community matrix
$\{x_{ij}\}$ by dividing by the sum of row sums
$p_j = x_{ij}/\sum_{i=1}^N x_{ij}$.  The \code{diverclust} function
uses \pkg{vegan} function \code{renyi} to evaluate any Rényi diversity
or the corresponding Hill number.  R{\'e}nyi diversity of order $a$ is
\cite{Hill73number}:
\begin{equation}
  \label{eq:renyi}
H_a = \frac{1}{1-a} \log \sum_{j=1}^S p_j^a \,,
\end{equation}
and the corresponding Hill number is $N_a = \exp(H_a)$.  Many common
diversity indices are special cases of Hill numbers: $N_0 = S$ is the
number of species, $N_1 = \exp(H')$ is exponent of the Shannon
diversity, $N_2 = 1/\sum_{j=1}^S p_j^2$ is the Simpson diversity, and
$N_\infty = 1/(\max p_j)$ is the Berger-Parker index. The
corresponding R\'{e}nyi diversities are $H_0 = \log(S)$, $H_1 = H'$,
$H_2 = - \log(\sum p_j^2)$, and $H_\infty = -\log(\max p_j)$.

The beta diversity is defined as an increase of diversity when pooling
sampling units. Pooling is performed by summing up abundance values of
species, and the beta diversity $\beta$ for a cluster of $M$ sampling
units is defined as a difference of the pooled gamma diversity and
mean of alpha diversities $\beta = \gamma - \bar \alpha$:
\begin{equation}
  \label{eq:partition}
  \beta_a = \underbrace{H_a \left(\sum_{i=1}^M x_{ij}
    \right)}_{\gamma} - \underbrace{\frac{1}{M} \sum_{i=1}^M H_a
    (x_{ij})}_{\bar \alpha}\,,
\end{equation}
Alternatively we can use Hill numbers $N_a$ in place of $H_a$. The
equation implies additive partitioning of diversity. However, Rényi
diversities are on $\log$ scale which translates a multiplicative
model into an additive model. If you use Hill numbers in
eq.~\ref{eq:partition}, the model is truly additive
(see the two forms in eq.~\ref{eq:dN0}).

\subsection{Equalizing Observations}
\label{sec:equalize}

The total pooled diversity ($\gamma$) in eq.~\ref{eq:partition}
includes both the within-unit diversity ($\alpha$) and between-unit
differences ($\beta$). This requires that pooled diversity really is
additive: it maintains the within-unit diversity and adds between-unit
differences. This seems to be the case with Rényi diversities, but not
necessarily for the Hill numbers beyond $N_0 = S$ (species
richness). However, if the sampling units are equalized, the
additivity is greatly improved. Equalization scales units with
different total abundances to a more similar magnitude for pooling,
but does not influence their alpha diversities. The suggested
equalization is dependent on the scale $a$ of the Rényi diversity, and
all values for a unit are divided by weight
\begin{equation}
  \label{eq:equalize}
  w_i = \left( \sum_{j=1}^S x_{ij}^a \right)^{1/a}\,.
\end{equation}
For $a=1$ (Shannon diversity) $w$ scales to unit sum, for $a=2$
(Simpson diversity) $w$ scales to unit sum of squares (norm), and for
$a=\infty$ (Berger-Parker diversity) $w$ is the maximum for each
sampling unit. No equalization is needed for $a=0$ (species
richness).

\subsection{Clustering Based on gamma Diversity}

If we omit the $\bar \alpha$ term for average alpha diversity in
eq.~\ref{eq:partition}, we can base the clustering on total or gamma
diversity. It may be difficult to see what would be the utility of such a
clustering, but perhaps it could be used to form low-diversity classes
that differ from each other.

\subsection{Implementation}

Function \code{diverclust} implements beta and gamma diversity
clustering as described above. The following options can be used to
modify its behaviour:
\begin{itemize}
  \item The scale $a$ of Rényi diversity (eq.~\ref{eq:renyi}) can be given.
  \item The equalization of eq.~\ref{eq:equalize} can be turned off or on.
  \item beta or gamma diversity clustering can be selected.
  \item Either Rényi diversities or Hill numbers can be selected.
\end{itemize}

The following example performs diversity clustering with defaults: it
uses beta diversity based on Rényi index $H_1$ (Shannon diversity) and
equalizes sample plots, and displays the dendrogram
(Fig.~\ref{fig:renyi1}).

<<renyi1,fig.cap="Diversity clustering of Barro Colorado Island forests based on $H_1$. The sampling units are named by their most abundant tree species.">>=
data(BCI)
## row names by most abundant species
colnames(BCI) <- make.cepnames(colnames(BCI))
dom <- colnames(BCI)[apply(BCI, 1, which.max)]
rownames(BCI) <- paste0(dom, 1:50)
## diversity clustering
cl <- diverclust(BCI, trace=FALSE)
plot(cl, hang = -1, cex=0.8)
@

\section{Other Methods}

One reason for writing this vignette was to find out if the
\code{diverclust} function re-invents an existing method. Information
Analysis is based on very similar reasoning as the diversity
clustering \cite{WilliamsEa66, LanceWilliams66}. To compare it against
\code{diverclust}, it was implemented as function \code{infoclust} in
the \pkg{natto} package.

\subsection{Clustering Based on Information Analysis}

Information analysis works on binary community data matrix. The
information content $I$ of a cluster of $M$ sampling units is defined
as \cite{WilliamsEa66, LanceWilliams66}:
\begin{multline}
  \label{eq:information}
  I = S M \log M \\ - \sum_{j=1}^S Mq_j \log Mq_j + M(1-q_j) \log M(1-q_j)\,,
\end{multline}
where $q_j$ is the relative frequency of species $j$ in the $M$
sampling units of the cluster. The contribution to $I$ is 0 for
species absent from the cluster ($q=0$) and for species present on
every unit in the cluster ($q=1$), and it is maximal for species with
frequency $q=0.5$. The clustering minimizes the information criterion
$I$ and therefore it tries to produce clusters where species are
either absent or nearly absent, or constant or nearly so. This often
produces clusters that are easy to interpret in floristic terms.

The clusters are formed similarly as in \code{diverclust}: two
sampling units or clusters are pooled, the species frequencies are
recalculated, and the information content with the pooled group and
all other groups are re-evaluated. However, the group selected for
merging is not not the one with lowest $I$, but the group that gives
the lowest increase $\Delta I$. The information content of single
sampling units is $I=0$, but formed clusters have positive values of
$I$, and the information values of the members of the cluster are
subtracted from the value of the cluster, and the difference
$\Delta I$ is used as the criterion of clustering. The merge still
happens at the level of pooled new unit $I$, and the clusters are not
formed in the order of their merge heights. This conflicts with \R{}
conventions, and the \code{infoclust} function updates the merge table
to correspond to the merge heights.

\citet{WilliamsEa66} and \citet{LanceWilliams66} describe their method
very briefly, and the current implementation is \pkg{natto} is based
on the worked example of \citet{Legendres}. The use is similar as for

\code{diverclust}:
<<infoclust,fig.cap="Information Analysis Clustering of the Barro Colorado Island forests.">>=
cli <- infoclust(BCI)
plot(cli, hang=-1, cex=0.8)
@

The clusters are often very compact (Fig.~\ref{fig:infoclust}), and
results differ from the diversity clustering. Fig.~\ref{fig:renyi1}
was based on quantitative data, but clusterings are often very
different with binary data as well. In this case, diversity clustering
of binary data with Simpson index ($N_2$) and equalization gave most
similar results to information clustering (Fig.~\ref{fig:tanglegram}).

<<tanglegram,fig.cap="Information analysis (Fig.~\\ref{fig:infoclust}) against diversity clustering of Simpson Index (Hill number $N_2$) of binary data with equalization of sample plots. The dendrograms were rearranged to minimize entanglement.",out.width="\\linewidth",message=FALSE>>=
library(dendextend) # tanglegram
cl2 <- diverclust(decostand(BCI, "pa"),
   renyi=2, hill=TRUE, equalize=TRUE, trace=FALSE)
tanglegram(untangle(as.dendrogram(cli),
   as.dendrogram(cl2), method="step2side"),
   main_left="Information", main_right="Rényi 2")
@

\subsection{Dissimilarity Indices}

The diversity clustering methods can also be expressed as
dissimilarity measures for two pooled sampling units. Such
dissimilarities do not produce the diversity clustering, because
dissimilarities between clusters cannot be found from dissimilarities
between sampling units, but the original data must be pooled by
clusters, and dissimilarites re-evaluated from the updated community
data. However, it may be instructive to compare these indices with
common dissimilarity measures. In this chapter we see how the
diversity measures can be expressed as dissimilarities with binary
data.

The \pkg{natto} package includes function \code{diverdist} that
evaluates the pairwise diversity dissimilarities, and is used in the
first step of diversity clustering to select first merged sampling
units.

The diversity clustering and Rényi diversities are principally
designed for quantitative data. For presence-absence dissimilarity
indices we analyse binary data.  Binary data defines a community with
maximum equitability, so that all Rényi indices will be $N = \log S$
and Hill numbers $H = S$ irrespective of the Rényi scale.  Average
alpha diversity with Rényi ($\bar \alpha_H$) or Hill ($\bar \alpha_N$)
indices for two sampling units eachs with $S_1$ and $S_2$ species is
\begin{align}
  \label{eq:alphaH}
  \bar \alpha_H &= \tfrac{1}{2}(\log S_1 + \log S_2)\\
  \bar \alpha_N &= \tfrac{1}{2}(S_1 + S_2)\,.
\end{align}

In two sampling units of $S_1$ and $S_2$ species and $S_{1 \cap 2}$ shared
species, there will be $S_1 + S_2 - 2S_{1 \cap 2}$ species that occur only one of
the units and each at proportion $p = \frac{1}{S_1 + S_2}$ and
$S_{1 \cap 2}$ species that occur in both sampling units at proportion
$p = \frac{2}{S_1 + S_2}$.  The general equation (\ref{eq:partition})
of distance based on Rényi beta diversity between two binary sampling
units is:
\begin{multline}
  d(H_a) = \frac{1}{1-a} [\log(S_1+S_2 + (2^a - 2) S_{1 \cap 2}) \\
            - a \log (S_1+S_2)] - \bar \alpha_H
\end{multline}
and the corresponding formula for Hill numbers is:
\begin{equation}
d(N_a) = \left[\frac{S_1 + S_2 + (2^a-2) S_{1 \cap 2}}{(S_1 + S_2)^a} \right]^\frac{1}{1-a}
- \bar \alpha_N \,.
\end{equation}

Some special cases simplify into more compact forms:

\begin{align}
  d(H_0) &= \log(S_1 + S_2 - S_{1 \cap 2}) - \bar \alpha_H \nonumber\\
         &= \log \left(\frac{S_1+S_2- S_{1 \cap 2}}{\sqrt{S_1 S_2}} \right) \\
  \label{eq:dN0}
  d(N_0) &= \tfrac{1}{2}(S_1 + S_2 - 2 S_{1 \cap 2}) \nonumber \\
           &= \bar \alpha_N - S_{1 \cap 2} \\
  \label{eq:dH1}
  d(H_1) &= \log(S_1 + S_2) - \frac{S_{1 \cap 2} \log 4}{S_1+S_2} - \bar \alpha_H\\
  d(H_2) &= 2 \log (S_1+S_2) \nonumber \\
         & - \log(S_1 + S_2 + 2 S_{1 \cap 2}) - \bar \alpha_H\\
  d(N_2) &= \frac{(S_1 + S_2)^2}{S_1 + S_2 + 2 S_{1 \cap 2}}  - \bar \alpha_N\\
  d(H_\infty) &= \log(S_1 + S_2) - \log 2 - \bar \alpha_H\\
  d(N_\infty) &= 0 \,.
\end{align}

These may be regarded as ``new'' dissimilarity measures, although
there hardly is a deficit of dissimilarity indices. Simplest case is
$d(N_0)$ which is only half of the number of non-shared species, or
half of the squared Euclidean distance between binary vectors. Most
importantly, these are special cases for binary data and two pooled
sampling units. Although the indices can be calculated for binary
data, they really are intended for quantitative data. More
importantly, these indices are only used for comparing a pair of
unmerged sites, and they do not apply to comparisons involving
clusters.

The implicit dissimilarity measure in information analysis is
\citep{WilliamsEa66,LanceWilliams66}:
\begin{equation}
  d(I) = (S_1 + S_2 - 2 S_{1 \cap 2}) \log(4) \,.
\end{equation}
This is $d(N_0)$ with different multiplier. However, the similarity to
$d(N_0)$ disappears when more than two sampling units are compared.

Dissimilarities based on diversity or information have been sometimes
suggested. \citet{KoleffEa03} suggest the following that they ascribe
to \citet{Routledge84}:
\begin{multline}
  \label{eq:routledge}
  d(I) = \log(S_1 + S_2) - \frac{S_{1 \cap 2} \log 4}{S_1 + S_2} \\
  - \frac{S_1}{S_1 + S_2} \log S_1 - \frac{S_2}{S_1+S_2} \log S_2\,.
\end{multline}
The formulation was based on \citet{KoleffEa03} who adapted it to
binary data, and it was further rearranged to emphasize its
resemblance to our $d(H_1)$ (eq.~\ref{eq:dH1}). The only difference is
that $d(H_1)$ uses unweighted average alpha diversity $\bar \alpha_H$
(eq.~\ref{eq:alphaH}), whereas eq.~\ref{eq:routledge} weights sampling
units by their species richness values. Using equalized pooling of
binary data produces dissimilarities that are even more similar to
this index.

\bibliography{diverclust}

\end{document}