%\documentclass[fleqn, letter, 10pt]{article}
%\documentclass[article]{jss}
\documentclass[nojss]{jss}
%\usepackage[round,longnamesfirst]{natbib}
%\usepackage[left=1.5in,top=1.5in,right=1.5in,bottom=1.5in,nohead]{geometry}
%\usepackage{graphicx,keyval,thumbpdf,url}
%\usepackage{hyperref}
%\usepackage{Sweave}
%\SweaveOpts{strip.white=TRUE, eps=FALSE}
%\AtBeginDocument{\setkeys{Gin}{width=0.6\textwidth}}

\usepackage[utf8]{inputenc}
\usepackage[english]{babel}

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{amsmath}
\usepackage{amsfonts}


%\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}
\newcommand{\class}[1]{\mbox{\textsf{#1}}}
\newcommand{\func}[1]{\mbox{\texttt{#1()}}}
%\newcommand{\code}[1]{\mbox{\texttt{#1}}}
%\newcommand{\pkg}[1]{\strong{#1}}
%\newcommand{\samp}[1]{`\mbox{\texttt{#1}}'}
%\newcommand{\proglang}[1]{\textsf{#1}}
\newcommand{\set}[1]{\mathcal{#1}}
\newcommand{\vect}[1]{\mathbf{#1}}
\newcommand{\mat}[1]{\mathbf{#1}}
%\newcommand{\sQuote}[1]{`{#1}'}
%\newcommand{\dQuote}[1]{``{#1}''}
\newcommand\R{{\mathbb{R}}}

%\DeclareMathOperator*{\argmin}{argmin}
%\DeclareMathOperator*{\argmax}{argmax}

%\setlength{\parindent}{0mm}
%\setlength{\parskip}{3mm plus2mm minus2mm}

%% \VignetteIndexEntry{Extensible Markov Model for data stream clustering}

\hyphenation{Clu-Stream}

\author{Michael Hahsler\\Southern Methodist University \And
    Margaret H. Dunham\\Southern Methodist University}
\title{\pkg{rEMM}: Extensible Markov Model for
Data Stream Clustering in \proglang{R}}
%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Michael Hahsler, Margaret H. Dunham} %% comma-separated
\Plaintitle{rEMM: Extensible Markov Model for
Data Stream Clustering in R} %% without formatting
\Shorttitle{Extensible Markov Model for Data Stream Clustering} %% a short title (if necessary)

%% an abstract and keywords
\Abstract{
Clustering streams
of continuously arriving data has become an important application
of data mining in recent years and efficient algorithms have been proposed
by several researchers. However, clustering alone neglects the fact that
data in a data stream is not only characterized by the proximity
of data points which is used by clustering, but also by a temporal
component. The Extensible Markov Model (EMM) adds the temporal component
to data stream clustering by superimposing a dynamically adapting
Markov Chain. In this paper we introduce the implementation of the
\proglang{R}~extension package~\pkg{rEMM} which implements EMM
and we discuss some
examples and applications.
}
\Keywords{data mining, data streams, clustering, Markov chain}
\Plainkeywords{data mining, data streams, clustering, Markov chain} %% without formatting

%% at least one keyword must be supplied

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{13}
%% \Issue{9}
%% \Month{September}
%% \Year{2004}
%% \Submitdate{2004-09-29}
%% \Acceptdate{2004-09-29}

%% The address of (at least) one author should be given
%% in the following format:
\Address{
    Michael Hahsler\\
        Computer Science\\
        Lyle School of Engineering\\
        Southern Methodist University\\
        P.O. Box 750122 \\
        Dallas, TX 75275-0122\\
        E-mail: \email{mhahsler@lyle.smu.edu}\\
        URL: \url{http://lyle.smu.edu/~mhahsler}

    Margaret H. Dunham\\
        Computer Science and Engineering\\
        Lyle School of Engineering\\
        Southern Methodist University\\
        P.O. Box 750122 \\
        Dallas, TX 75275-0122\\
        E-mail: \email{mhd@lyle.smu.edu}\\
        URL: \url{http://lyle.smu.edu/~mhd}
}
%% It is also possible to add a telephone and fax number
%% before the e-mail in the following format:
%% Telephone: +43/1/31336-5053
%% Fax: +43/1/31336-734

%% for those who use Sweave please include the following line (with % symbols):
    %% need no \usepackage{Sweave.sty}

    %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



\begin{document}

%\title{rEMM: Extensible Markov Model for Data Stream Clustering in R}
%\author{Michael Hahsler and Margaret H. Dunham}
%\date{}

%\maketitle
\sloppy

%\abstract{}

<<echo=FALSE>>=
options(width = 70,
  prompt = "R> ",
  digits = 4)
### for sampling
set.seed(1234)
@


\section{Introduction}
Clustering data streams~\citep{stream_clust:Guha:2000}
has become an important field in recent years.
A data stream is an ordered and potentially infinite
sequence of data points
$\langle \vect{y}_1, \vect{y}_2, \vect{y}_3, \ldots\rangle$.
Such streams of constantly arriving
data are generated by many types of applications and include web click-stream
data, computer network monitoring data, telecommunication connection data,
readings from sensor nets, stock quotes, etc. An important property of data
streams for clustering is that data streams often produce massive amounts of
data which have to be processed in (or close to) real time since it is
impractical to permanently store the data (transient data).  This leads to the
following requirements:

\begin{itemize}
\item The data stream can only be processed in a single pass or scan
and typically only in the order of arrival.
\item Only a minimal amount of data can be retained and the clusters have to
be represented in an extremely concise way.
\item Data stream characteristics may change over time
(e.g., clusters move, merge, disappear or
new clusters may appear).
\end{itemize}

Many algorithms for data stream clustering have been proposed recently.
For example,
\cite{stream_clust:O'Callaghan:2002}~\citep[see also][]{stream_clust:Guha:2003}
study the $k$-medians problem. Their algorithm
called {\em STREAM} divides the data stream into pieces,
clusters each piece individually and then iteratively reclusters the resulting
centers to obtain a final clustering.
\cite{stream_clust:Aggarwal:2003} present {\em CluStream} which uses
micro-clusters (an extension of cluster feature vectors used
by BIRCH~\citep{stream_clust:Zhang:1996}).
Micro-clusters can be deleted and merged and permanently stored at
different points in time to allow to create final
clusterings (recluster micro-clusters with $k$-means) for different time frames.
Even though CluStream allows clusters to evolve over time, the ordering
of the arriving data points in the stream is lost.
\cite{stream_clust:Kriegel:2003} and \cite{stream_clust:Tasoulis:2007} present
variants of the density based method
{\em OPTICS}~\citep{stream_clust:Ankerst:1999} suitable for streaming data.
\cite{stream_clust:Aggarwal:2004} introduce {\em HPStream} which finds
clusters that are well defined in different subsets of the dimensions
of the data. The set of dimensions for each cluster can evolve over time
and a fading function is used to discount the influence of older data points
by fading the entire cluster structure.
\cite{stream_clust:Cao:2006} introduce {\em DenStream} which maintains
micro-clusters in real time and uses a variant of
GDBSCAN~\citep{stream_clust:Sander:1998} to produce a final clustering
for users.
\cite{stream_clust:Tasoulis:2006} present {\em WSTREAM,} which uses
kernel density estimation to find rectangular windows to represent clusters.
The windows can move, contract, expand and be merged over time.
More recent density-based data stream clustering algorithms are
{\em D-Stream}~\citep{stream_clust:Tu:2009} and
{\em MR-Stream}~\citep{stream_clust:Wan:2009}.
{\em D-Stream} uses an online
component to map each data point into a predefined grid and then uses an
offline component to cluster the grid based on density.
{\em MR-Stream} facilitates the discovery of clusters
at multiple resolutions by using a
grid of cells that can dynamically be sub-divided into more cells using a tree
data structure.
%\cite{stream_clust:Aggarwal:2008} Uncertain Data
%\cite{stream_clust:Aggarwal:2009} Massive Data

All approaches center on finding clusters of data points based on some notion
of proximity, but neglect the temporal structure of the data stream which
might be crucial to understanding the underlying processes.
For example, for intrusion detection a user might change from
behavior A to behavior B, both represented by clusters labeled non-suspicious
behavior, but the transition form A to B might be extremely unusual and give
away an intrusion event.
The Extensible Markov Model~(EMM) originally developed by
\cite{emm:Dunham:2004} provides a
technique to add temporal information in form of an evolving Markov Chain~(MC)
to data stream clustering algorithms.  Clusters correspond to states in the
Markov Chain and transitions represent the temporal information in the data.
EMM was successfully applied to rare event and intrusion
detection~\citep{emm:Meng:2006c, emm:Isaksson:2006, emm:Meng:2006a}, web usage
mining~\citep{emm:Lu:2006}, and identifying emerging events and developing
trends~\citep{emm:Meng:2006, emm:Meng:2006b}.
In this paper we describe an implementation of EMM in the
extension package~\pkg{rEMM} for the
\proglang{R}~environment for statistical computing~\citep{R:R:2005}.

Although the traditional Markov Chain is an excellent modeling technique for a
static set of temporal data, it can not be applied directly to stream data. As
the content of stream data is not known apriori, the requirement of a
fixed transition matrix is too restrictive. The dynamic nature of EMM
resolves this problem.
Although there have been a few other approaches to the use of dynamic Markov
chains ~\citep{emm:cormack:1987, emm:ostendorf:1997, emm:goldberg:1999}, none
of the others provide the complete flexibility needed by stream clustering to
create, merge, and delete clusters.

This paper is organized as follows. In the next section we
introduce the concept of EMM and show that all operations needed for
adding EMM to data stream clustering algorithms can be performed efficiently.
Section~\ref{sec:clustering} introduces the simple data stream clustering
algorithm implemented in \pkg{rEMM}. In Section~\ref{sec:implementation}
we discuss implementation details of the package.
Sections~\ref{sec:examples} and \ref{sec:applications} provide
examples for the package's functionality and apply EMM
to analyzing river flow data and to genetic sequences.
We conclude with Section~\ref{sec:conclusion}.

A previous version of this paper was published in the Journal of Statistical
Software~\citep{Hahsler+Dunham:2010}.

\section{Extensible Markov model}
\label{sec:EMM}\nopagebreak
The Extensible Markov Model (EMM) can be understood as an evolving
Markov Chain (MC) which at each point in time
represents a regular time-homogeneous MC which is updated when new data is
available. In the following we will restrict the discussion to first order EMM
but, as for a regular MC, it is straight forward to extend EMM to higher order
models~\citep{misc:Kijima:1997}.

\paragraph{Markov Chain.}
A (first order) discrete parameter Markov Chain~\citep{misc:Parzen:1999} is a
special case of a Markov Process in discrete time and with a discrete state
space.
It is characterized by a sequence
$\langle X_1, X_2, \dots\rangle$
of random variables $X_t$
with $t$ being the time index.
All random variables have the same domain
$\mathrm{dom}(X_t) = S =\{s_1, s_2, \dots, s_K\}$, a
set called the state space. The Markov
property states that the next state is only dependent on the current state.
Formally,

\begin{equation}
P(X_{t+1} = s \; | \; X_t =s_t, \dots, X_1=s_1) = P(X_{t+1} = s \; | \; X_t =s_t)
\end{equation}

where $s, s_t \in S$.
For simplicity we use for transition probabilities the notation
$$a_{ij} = P(X_{t+1} = s_j \; | \; X_t =s_i)$$
where it is appropriate.  Time-homogeneous MC can
be represented by a graph with the states as vertices and the edges labeled
with transition probabilities. Another representation is as
a $K \times K$ transition
matrix $\mat{A}$ containing the transition
probabilities from each state to all other states.

\begin{equation}
\mat{A} =
\begin{pmatrix}
a_{11}&a_{12}&\dots&a_{1K}\\
a_{21}&a_{22}&\dots&a_{2K}\\
\vdots&\vdots&\ddots&\vdots\\
a_{K1}&a_{K2}&\dots&a_{KK}\\
\end {pmatrix}
\end{equation}

MCs are very useful to keep track of temporal information using the Markov
Property as a relaxation.  With a MC it is easy to forecast the probability of
future states. For example the probability to get from a
given state to any other
state in $n$ time steps is given by the matrix $\mat{A}^n$. With an MC it is
also easy to calculate the probability of a new sequence
of length $t$ as the product of transition probabilities:
\begin{equation}
P(X_t=s_t, X_{t-1}=s_{t-1} \dots, X_1=s_1) =
P(X_1=s_1) \prod_{i=1}^{t-1}{P(X_{i+1} = s_{i+1} \; |  \; X_{i} =s_i)}
\end{equation}

The probabilities of a Markov Chain can be directly estimated from
data using the maximum likelihood method by
\begin{equation}
a_{ij} = c_{ij}/n_i,
\label{eqn:estim}
\end{equation}
where $c_{ij}$ is the
observed count of transitions from $s_i$ to $s_j$ in the data
and $n_i=\sum_{k=1}^K {c_{ik}}$, the sum of all outgoing transitions from $s_i$.

\paragraph{Stream Data and Markov Chains.}
Data streams typically contain dimensions with continuous data and/or have
discrete dimensions with a large number of domain
values~\citep{stream_clust:Aggarwal:2009}.
In addition, the data may continue to arrive resulting in a
possibly infinite number of
observations. Therefore data points have to be mapped onto a manageable
number of states. This mapping is done online as data arrives using
data stream clustering where each cluster (or
micro-cluster) is represented by a state in the MC.
Because of this one-to-one relationship we use cluster and state
for EMM often as synonyms.

%% MFH: I don't understand
%%Due to the fact that the
%%stream data may arrive from many differ sources, the evolving EMM we use for
%%stream data is a multivariate MC.
The transition count
information is obtained during the clustering process by using an
additional data structure efficiently representing the MC transitions.
Since it only uses information (assignment of a data point to a
cluster) which is created by the clustering algorithm anyway, the computational
overhead is minimal.  When the clustering algorithm creates, merges or deletes
clusters, the corresponding states in the MC are also created,
merged or deleted resulting in the evolving MC.
Note that $K$, the size of the
set of clusters and of states $S$ is not
fixed for EMMs and will change over time.

In the following we look at the additional data structures and the operations
on these structure which are necessary to extend an existing data stream
clustering algorithm for EMM.

\paragraph{Data Structures for the EMM.}
Typically algorithms for data stream clustering use a very compact
representation for each cluster consisting of a description of the center
and how many data points were assigned to the cluster so far.
Some algorithms also keep summary information of the dispersion of the
data points assigned to each cluster.
Since the cluster also represents a state in the EMM we need to
add a data structure to store the outgoing edges and their counts.
For each cluster $i$ representing state $s_i$ we need to store a
transition count vector~$\vect{c}_{i}$.
All transition counts in an EMM can be seen as a
transition $K \times K$ count matrix~$\mat{C}$ composed
of all transition count vectors.
It is easy to
calculate the estimated transition probability matrix from the
transition count matrix (see Equation~(\ref{eqn:estim})).
%by
%\begin{equation}
%$\mat{A} = \mat{C} / \vect{n},$
%\end{equation}
%where $\vect{n} = (n_1, n_2,\dots,n_K)$ is the vector of row sums of $\mat{C}$.
Note that $n_i$ in Equation~(\ref{eqn:estim}) normally is the same as the
number of data points assigned to cluster $i$ maintained by
the clustering algorithm. If we manipulate the
clustering using certain operations, e.g., by deleting clusters or
fading the cluster structure (see below),
the values of $n_i$ calculated from $\mat{C}$ will diverge from the
number of assigned data points maintained by the clustering algorithm.
However, this is desirable since it ensures that the probabilities
calculated for the transition probability matrix $\mat{A}$
stay consistent and keep adding up to unity.

%For data with an existing temporal structure,
%EMM will produce a transition count matrix
%$\mat{C}$ with many of the $K^2$ entries containing a count of zero.  Storing
%the transition count matrix in a sparse data structure
%helps to reduce space requirements significantly, however, using a
%full matrix increases processing speed.

For EMM we also need to keep track of the current state
$\gamma \in \{\epsilon, 1,2,\dots,K\}$ which is either
no state~($\epsilon$; before the first data point has arrived)
or the index of one of the $K$ states.
%\marginpar{explain $\epsilon$}
We store the transitions from $\epsilon$ to the first
state in form of an initial transition count vector
$\vect{c^{\epsilon}}$ of length $K$. Note that the superscript
is used to indicate that this is the special count vector
from $\epsilon$ to all existing states.
The initial transition probability vector is calculated by
$\vect{p^{\epsilon}} = \vect{c^{\epsilon}}/ \sum_{k=1}^K{c^\epsilon_ k}.$
For a single continuous data stream,
only one of the elements of $\vect{p^{\epsilon}}$
is one and all others are zero. However, if we have a data stream that
naturally should be split into several sequences (e.g., a sequence for
each day for stock exchange data),
$\vect{p^{\epsilon}}$ is the probability of each state
to be the first state
in a sequence (see also the genetic sequence analysis application in
Section~\ref{sec:genetic_sequence_analysis}).

%start new text
Thus in addition to the current state $\gamma$ there are only two data
structures needed by EMM:
the transition count matrix, $\mat{C}$, and
and the initial transition count vector, $\vect{c^\epsilon}$.
These are only related to maintaining the transition information.
No additional data is needed for the clusters themselves.
%end new text

\paragraph{EMM Clustering Operations.}

We now define how the operations typically performed by data stream clustering
algorithms on (micro-)clusters can be mirrored for the EMM.

\begin{description}
\item[Adding a data point to an existing cluster.]
When a data point is added to an existing cluster $i$, the EMM has to
update the transition count from the current state $\gamma$
to the new state $s_i$ by
setting $c_{\gamma i} = c_{\gamma i} +1$.
Finally the current state is set to the new state by $\gamma=i$.

\item[Creating a new cluster.]
This operation increases the number of clusters/states from $K$ to $K+1$
by adding a new \mbox{(micro-)cluster.}
To store the transition counts from/to this new cluster,
we enlarge the transition count matrix $\mat{C}$ by a row and a
column which are initialized to zero.
%If the matrix is stored in a sparse data structure,
%this modification incurs only minimal overhead.

\item[Deleting clusters.]
When a cluster $i$ (typically an outlier cluster) is deleted by
the clustering algorithm, all we need to do is to remove
the row $i$ and column $i$ in the transition count matrix $\mat{C}$.
This deletes the corresponding state $s_i$ and reduces $K$ to $K-1$.

\item[Merging clusters.]
When two clusters~$i$ and $j$ are merged into a new cluster $m$,
we need to:

\begin{enumerate}
\item Create new state $s_m$ in $\mat{C}$ (see creating a new cluster above).
\item Compute the outgoing edges for $s_m$  by $c_{mk} = c_{ik} +c_{jk},\;k=1,2,\ldots K$.
\item Compute the incoming edges for $s_m$ by $c_{km} = c_{ki} + c_{kj},\;k=1,2,\ldots K$.
\item Delete columns and rows for the old states $s_i$ and $s_j$
    from $\mat{C}$ (see deleting clusters above).
\end{enumerate}

It is straight forward to extend the merge operation to an arbitrary number of
clusters at a time.  Merging states also covers reclustering which is done by
many data stream clustering algorithm to create a final clustering for the
user/application.


\item[Splitting clusters.]
Splitting micro-clusters is typically not implemented in
data stream clustering algorithms since the individual data points are not
stored and therefore it is not clear how to create two new meaningful clusters.
When clusters are ``split'' by algorithms like BIRCH, it typically
only means that
one or several micro-clusters are assigned to a different cluster
of micro-clusters. This case
does not affect the EMM, since the states are attached to the micro-clusters
and thus will move with them to the new cluster.

However, if splitting cluster $i$ into two new clusters $n$ and $m$
is necessary, we replace $s_i$ by the two states, $s_n$ and $s_m$, with equal
incoming and
outgoing transition probabilities by splitting the counts
between $s_n$ and $s_m$ proportional to $n_n$ and $n_m$:
\begin{align*}
c_{nk}& = n_n(c_{ik}/n_i),\;k=1,2,\ldots K\\
c_{kn}& = n_n(c_{ki}/n_i),\;k=1,2,\ldots K\\
c_{mk}& = n_m(c_{ik}/n_i),\;k=1,2,\ldots K\\
c_{km}& = n_m(c_{ki}/n_i),\;k=1,2,\ldots K\\
\end{align*}
After the split we delete $s_i$.

\item[Fading the cluster structure.]
Clusterings and EMMs adapt to changes in data over time. New data points
influence the clusters and  transition probabilities. However, to enable the
EMM to learn the temporal structure, it also has to forget old data.
Fading the cluster structure is
for example used by HPStream~\citep{stream_clust:Aggarwal:2004}.
Fading is achieved by reducing
the weight of old observations in the data stream over time. We use
a decay rate $\lambda \ge 0$ to specify the weight over time. We
define the weight
for data that is $t$ timesteps in the past by the following strictly decreasing
function:
\begin{equation}
w_t = 2^{-\lambda t}.
\end{equation}

Since data points are not stored, the weighting has to be performed on the
transition counts. This is easy since
the weight defined above is multiplicative:
\begin{equation}
w_t = \prod_{i=1}^t{2^{-\lambda}}
\end{equation}
and thus can be applied iteratively.
This property allows us to fade all transition counts in the EMM by
\begin{align*}
\mat{C}_{t+1} & = 2^{-\lambda}\; \mat{C}_{t} \quad \text{and} \\
\vect{c}^\epsilon_{t+1} & = 2^{-\lambda}\; \vect{c}^\epsilon_{t}
\end{align*}
each time step resulting in a compounded fading effect.
The exact time of fading is decided
by the clustering algorithm.
Fading can be used before each new data point is added, or at other regular
intervals appropriate for the application.
\end{description}

The discussed operations cover all cases typically needed to incorporate
EMM into existing data stream clustering algorithms.  For example,
BIRCH~\citep{stream_clust:Zhang:1996},
CluStream~\citep{stream_clust:Aggarwal:2003},
DenStream~\citep{stream_clust:Cao:2006} or
WSTREAM~\citep{stream_clust:Tasoulis:2006} can be extended to maintain temporal
information in form of an EMM.

Next we introduce the simple data stream clustering
algorithm called threshold nearest neighbor clustering algorithm
implemented in \pkg{rEMM}.

\section{Threshold Nearest Neighbor clustering algorithm}
\label{sec:clustering}\nopagebreak

%start new text
Although the EMM concept can be built on top of any stream clustering algorithm
that uses exclusively the operations described above,
we discuss here only a simple algorithm
used in our initial \proglang{R}~implementation. The clustering algorithm
applies a variation of the
Nearest Neighbor (NN) algorithm which instead of always placing a new
observation in the closest existing cluster creates a new cluster if no
existing cluster is near enough. To specify what near enough means, a threshold
value must be provided. We call this algorithm threshold NN (tNN).
%end new text
The clusters produced by tNN can be considered micro-clusters
which can be merged later on in an optional reclustering phase.
To represent (micro-)clusters, we use the following information:

\begin{itemize}
\item Cluster centers
\item Number of data points assigned to the cluster
\end{itemize}


In Euclidean space we use centroids as cluster centers since
they can be easily incrementally updated as new data points are added to
the cluster by
$$\vect{z}_{t+1} = n/(n+1) \vect{z}_t + 1/(n+1) \vect{y}$$
where $\vect{z}_t$ is the old centroid for a cluster containing
$n$ points, $\vect{y}$ is the new data point and $\vect{z}_{t+1}$ is
the updated centroid for $n+1$ data points
(see, e.g., BIRCH by \cite{stream_clust:Zhang:1996}).
Finding canonical centroids
in non-Euclidean space typically has no closed form and is a
computationally expensive optimization problem which needs access to all
data points belonging
to the cluster~\citep{misc:Leisch:2006}.
Since we do not store the data points for our clusters,
even exact medoids cannot be found and we have to resort to
{\em fixed pseudo medoids} or {\em moving pseudo centroids.}
We define fixed pseudo medoids
as the first data point which creates a
new cluster. The idea is that since we use a fixed
threshold around the center, points will be added around the initial data point
which makes it a reasonable center possibly close to the real medoid.
As an alternative approach, if we have at least a linear space, we define
moving pseudo centroids as the first data point and then,
to approximate the adjustment, we apply a simple updating scheme that moves
a pseudo centroid towards each new data point that is assigned to its
cluster:
$$\vect {z}_{t+1} = (1-\alpha) \vect{z}_t + \alpha \vect{y}$$
where $\alpha$ controls how
much the pseudo centroid moves in the direction of the new data point.
Typically we use $\alpha = \frac{1}{n+1}$ which results in
an approximation of the centroid that is equal to
adjustments made for centroids in Euclidean space.

Note, that we do not store the sums and sum of squares of observations
like BIRCH~\citep{stream_clust:Zhang:1996} and similar micro-cluster based
algorithms since this only helps with calculating measures
meaningful in Euclidean space and the clustering algorithm here is
intended to be independent
from the chosen proximity measure.

%For the EMM we store a data structure for the transition counts
%and a vector of initial counts
%One of the clusters is the current cluster.
%\marginpar{Initial probabilities and $\epsilon$}

Algorithm to add a new data point to a clustering:
\begin{enumerate}
\item Compute dissimilarities between the new data point and the $k$
    centers.
\item Find the closest cluster with a dissimilarity smaller than the threshold.
\item If such a cluster exists then assign the new point to the cluster
and adjust the cluster center.
\item Otherwise create a new cluster for the point.
\end{enumerate}

To observe memory limitations, clusters with very low counts (outliers)
can be removed or close clusters can be merged during clustering.

%After the data point
%\item Update EMM (the transition count from the current cluster to the new
%    cluster.
%\item Let the current cluster be the new cluster.

The clustering produces a set of micro-clusters.
These micro-clusters can be directly used for an application or
they can be reclustered to create a final clustering to present
to a user or to be used by an application.
For reclustering, the micro-cluster centers are
treated as data points and clustered by
an arbitrary algorithm (hierarchical clustering, $k$-means,
$k$-medoids, etc.). This choice of clustering algorithm gives the user
the flexibility to accommodate
apriori knowledge about the data and the shape of expected clusters.
For example for spherical clusters $k$-means or $k$-medoids can be used
and if clusters of arbitrary shape are expected, hierarchical clustering
with single linkage make sense. Reclustering micro-clusters results in
merging the corresponding states in the MC.


\section{Implementation details}
\label{sec:implementation}\nopagebreak

Package~\pkg{rEMM} implements the simple data stream
clustering algorithm threshold
NN~(tNN) described above with an added temporal EMM layer.
The package uses the \proglang{S4} class system and
builds on the infrastructure provided by the
packages~\pkg{proxy}~\citep{R:proxy:2009} for dissimilarity computation,
\pkg{cluster}~\citep{R:cluster:2009} for clustering,
%\pkg{graph}~\citep{R:graph:2009} to represent and manipulate the Markov chain
%as a graph,
and \pkg{Rgraphviz}~\citep{R:Rgraphviz:2009} for one of the
visualization options.

The central class in the package is \class{EMM} which contains
two classes, class~\class{tNN} which contains all information
pertaining to the clustering and class~\class{TRACDS}
(short for temporal relationship among clusters for data streams) for the
temporal aspects. Figure~\ref{fig:rEMM_classes} shows the UML class
diagram~\citep{misc:Fowler:2004}.
The advantage of
separating the classes is that
for future development
it is easier to replace the clustering algorithm
or perform changes on the temporal layer without breaking the whole system.

\begin{figure}
\centering
\includegraphics[width=4cm]{rEMM_class}
\caption{UML class diagram for \class{EMM}}
\label{fig:rEMM_classes}
\end{figure}


Class~\class{tNN} contains the clustering information
used by threshold NN:
\begin{itemize}
\item Used dissimilarity measure
\item Dissimilarity threshold for micro-clusters
\item An indicator if (pseudo) centroids or pseudo medoids are used
\item The cluster centers as a $K \times d$ matrix containing
the centers ($d$-dimensional
vectors) for the $K$ clusters currently used. Note that $K$ changes
over time when clusters are added or deleted.
\item The cluster count vector $\vect{n}= (n_1,n_2,\ldots,n_K$) with the number of data points
currently assigned to each cluster.
\end{itemize}

Class~\class{TRACDS} contains exclusively temporal information:
\begin{itemize}
%\item The Markov Chain as an object of class \class{graphNEL}. The
%directed graph is represented by nodes and an edge
%list (see package~\pkg{graph}) and represents
%the transition count matrix~$\mat{C}$ in a sparse format which can
%efficiently be manipulated.
%\item Initial transition count vector $\vect{c^\epsilon}$.
\item The Markov Chain is represented by an object of the internal class
\class{SimpleMC} which allows for fast manipulation of
the transition count matrix~$\mat{C}$. It also stores the
initial transition count
vector $\vect{c^\epsilon}$.
\item Current state~$\gamma$ as a state index. \code{NA} represents
no state~($\epsilon$).
\end{itemize}

To improve performance
we emulate pass-by-reference semantics for
objects of the classes~\class{EMM}, \class{TRACDS} and \class{tNN}.
This is achieved by storing all variable information
in the objects inside of an environment which lets to the objects
without copying the whole object. For example,  for
an object of class \class{EMM} called \code{emm} and some new data
\begin{center}
\code{build(emm, newdata)}
\end{center}
will change the information inside the \code{emm} object even though
the function's result is not assigned back to \code{emm}.
Note that this means that \class{EMM}, \class{TRACDS} and \class{tNN} objects
have to be explicitely copied with the provided \func{copy} method if
a true (deep) copy is needed.

An \class{EMM}~object is created by function \func{EMM} which
initializes an empty clustering with a temporal layer.
Several methods are defined
for either classe~\class{tNN} or \class{TRACDS}. Only methods which need
clustering and temporal information together (e.g., building a new EMM or
plotting an EMM) are directly defined for \class{EMM}.  Since \class{EMM}
contains \class{tNN} and \class{TRACDS}, all methods can directly be used
for \class{EMM} objects. The reason of separation is flexibility for
future development.

The temporal layer information from class \class{TRACDS} can be accessed using
\begin{itemize}
\item \func{nstates} (number of states),
\item \func{states} (names of states),
\item \func{current\_state} (get current state),
\item \func{transition} (access count or probability of a certain transition),
\item \func{transition\_matrix} (compute a transition count or probability
    matrix),
\item \func{initial\_transition} (get initial transition count vector).
\end{itemize}

To access information about the clustering from class \class{tNN},
we provide the functions
\begin{itemize}
\item \func{nclusters} (number of clusters),
\item \func{clusters} (names of clusters),
\item \func{cluster\_counts} (number of observations assigned to each cluster),
\item \func{cluster\_centers} (centroids/medoids of clusters).
\end{itemize}

For convenience, a method \func{size} is provides for \class{EMM} which
uses \func{nclusters} in \class{tNN} to return the number of clusters/states
in the model.

Clustering and building the EMM is integrated in the
function~\func{build}. It adds new data points by first
clustering and then updating the MC structure. For convenience,
\func{build} can be called with several data points as a matrix, however,
internally the data points (rows) are processed sequentially.
%Currently \proglang{R} has no infrastructure to manage data stream sources.
%Once such an infrastructure becomes available we will add a
%function~\func{update} which will retrieve all new data points from the
%stream and updates the model.

%To speed up
%processing, \func{build} directly manipulates the \class{graphNEL}
%data structure
%without using the functions provided in \pkg{graph}.
%This removes the overhead of copying the whole graph data structure for each
%manipulation.

To process multiple sequences, \func{reset} is provided. It sets the
current state to no state~($\gamma=\epsilon$). The next observation
will start a new sequence and the initial transition count vector will
be updated. For convenience, a row of all \code{NAs} in a sequence of
data points supplied to \func{build} as a matrix also works as a reset.

\pkg{rEMM} implements cluster structure fading by two  mechanisms. First,
\func{build} has a decay rate parameter~\code{lambda}.
If this parameter is set, \func{build} automatically
fades all counts before a new data point is added. The second
mechanism is to explicitly call the function~\func{fade} whenever
fading is needed. This has the advantage that the overhead of manipulating
all counts in the EMM can be reduced and that fading can be used in a more
flexible manner. For example, if the data points are arriving at an irregular
rate, \func{fade} could be called at regular time intervals
(e.g., every second).

To manipulate states/clusters and transitions, \pkg{rEMM} offers a wide array
of functions.  \func{remove\_clusters} and \func{remove\_transitions} remove
user specified states/clusters or transitions from the model.  To find rare
clusters or transitions  with a count below a specified threshold
\func{rare\_clusters} and \func{rare\_transitions} can be used.  \func{prune}
combines finding rare clusters or transitions and removing them into a
convenience function.
For some applications transitions from a state to itself might not be
interesting. These transitions can be removed by using
\func{remove\_selftransitions}. The last manipulation function is
\func{merge\_clusters} which combines several clusters/states into a single
cluster/state.

As described above, the threshold NN data stream clustering algorithm can
use an optional reclustering phase to combine micro-clusters into a
final clustering.
For reclustering we provide several wrapper functions for popular
clustering methods in \pkg{rEMM}:
\func{recluster\_hclust} for hierarchical clustering,
\func{recluster\_kmeans} for $k$-means and
\func{recluster\_pam} for $k$-medoids.
However, it is easy to use any other clustering method. All that is needed is
a vector with the cluster assignments for each state/cluster. This vector
can be supplied to \func{merge\_clusters} with \code{clustering=TRUE}
to create a reclustered EMM. Optionally new centers
calculated by the clustering algorithm
can also be supplied to \func{merge\_clusters} as
the parameter \code{new\_center}.
%\marginpar{var threshold}

Predicting a future state and calculating
the probability of a new sequence are implemented as \func{predict}
and \func{score}, respectively.

The helper function \func{find\_clusters} returns the
cluster/state sequence for given data points. The matching can be
nearest neighbor or exact.
Nearest neighbor always returns a matching cluster, while exact will return no
cluster (\code{NA}) if a data point does not fall within the threshold of any
cluster.

Finally, \func{plot} implements several visualization methods
for class~\class{EMM}.

In the next section we give some examples of how
to use \pkg{rEMM} in practice.

\section{Examples}
\label{sec:examples}\nopagebreak

\subsection{Basic usage}


First, we load the package and a simple data set called {\em EMMTraffic,} which
comes with the package and was used by~\cite{emm:Dunham:2004} to illustrate
EMMs. Each of the 12 observations in this hypothetical data set is a
vector of seven values obtained from sensors located at specific points on
roads. Each sensor collects a count of the number of vehicles which have
crossed this sensor in the preceding time interval.

<<>>=
library("rEMM")
data(EMMTraffic)
EMMTraffic
@

We use \func{EMM} to create a new EMM object using extended Jaccard as
proximity measure and a dissimilarity threshold of 0.2.
For the extended Jaccard measure pseudo medoids are automatically chosen
(use \code{centroids = TRUE} in \func{EMM} to use pseudo centroids).
Then we build a model
using the EMMTraffic data set. Note that \func{build} takes the whole
data set at once, but this is only for convenience. Internally the data
points are processed
as a data stream, strictly one after the other in a single pass.
<<>>=
emm <- EMM(threshold = 0.2, measure = "eJaccard")
build(emm, EMMTraffic)
size(emm)
ntransitions(emm)
@

Note that we do not need to assign the result of \func{build} back to
\code{emm}. The information in \code{emm} is changed by \func{build} inside the
object since class \class{EMM} emulates pass-by-reference semantics.

The resulting EMM has \Sexpr{size(emm)} states.
The number of data points represented by each cluster can be accessed
via \func{cluster\_counts}.
<<>>=
cluster_counts(emm)
@

Cluster~2 has with a count of three the most assigned data points.
The cluster centers can be inspected using \func{cluster\_centers}.
<<>>=
cluster_centers(emm)
@

\func{plot} for \class{EMM} objects provides several visualization methods.
For example, the default method is as a graph using \pkg{igraph}. We use
here the method "graph" which uses \pkg{Rgraphviz}, a package which
has to be installed separately from the Bioconduictor
project\footnote{\url{http://www.bioconductor.org/}}.

<<Traffic_graph, fig=TRUE, include=FALSE>>=
plot(emm, method = "graph")
@


The resulting graph is presented in Figure~\ref{fig:Traffic_graph}. In this
representation the vertex size and the arrow width code for the number of
observations represented by each state and the transition counts, i.e.,
more popular clusters and transitions are more prominently displayed.


\begin{figure}
\centering
\includegraphics[width=.5\linewidth]{rEMM-Traffic_graph}
\caption{Graph representation of an EMM for the EMMTraffic data set.}
\label{fig:Traffic_graph}
\end{figure}


The current transition probability matrix of the EMM
can be calculated using
\func{transition\_matrix}.
<<>>=
transition_matrix(emm)
@

Alternatively we can get also get the raw transition count matrix.
<<>>=
transition_matrix(emm, type = "counts")
@

%Log odds are calculated as $ln(a/(1/n))$ where $a$ is the probability of
%the transition and $n$ is the number of states in the EMM.
%$1/n$ is the probability of a transition under the null model
%which assumes that the transition probability from each state
%to each other state (including staying in the same state) is the same, i.e.,
%the null model has a transition matrix with all entries equal to $1/n$.
Individual transition probabilities or counts can be obtained
more efficiently via \func{transition}.

<<>>=
transition(emm, "2", "1", type = "probability")
@


Using the EMM model, we can predict a future cluster given a current cluster
For example, we can predict the most likely cluster two time steps away from
cluster~2.
<<>>=
predict(emm, n = 2, current = "2")
@

\func{predict} with \code{probabilities = TRUE} produced the
probability distribution over all clusters.
<<>>=
predict(emm,
  n = 2,
  current = "2",
  probabilities = TRUE)
@

In this example cluster~4 was predicted since it has the highest probability.
If several clusters have the same probability the tie is randomly broken.

\subsection{Manipulating EMMs}
EMMs can be manipulated by removing clusters or transitions and
by merging clusters.
Figure~\ref{fig:Traffic_man}(a) shows
again the EMM for the EMMTraffic data set created above.
We can remove a cluster
with \func{remove\_clusters}. For example, we remove cluster~3 and
display the resulting EMM in Fig~\ref{fig:Traffic_man}(b).

<<Traffic_r3, fig=TRUE, include=FALSE>>=
emm_3removed <- remove_clusters(emm, "3")
plot(emm_3removed, method = "graph")
@

Note that a copy of \code{emm} is explicitely created
in order have two independent copies in memory.

Removing transitions is done with \func{remove\_transitions}.
In the following example we remove the transition from cluster~5 to cluster~2
from the original EMM for EMMTraffic in Figure~\ref{fig:Traffic_man}(a).
The resulting graph is shown in Fig~\ref{fig:Traffic_man}(c).

<<Traffic_rt52, fig=TRUE, include=FALSE>>=
emm_52removed <- remove_transitions(emm, "5", "2")
plot(emm_52removed, method = "graph")
@

Here a reference of a copy of \code{emm} is passed on to
\func{remove\_transitions} and then assigned to \code{emm\_52removed}.

Clusters can be merged using \func{merge\_clusters}. Here we merge
clusters~2 and 5 into a combined cluster.
The combined cluster automatically gets the name of the first cluster in the
merge vector. The resulting EMM is shown in Fig~\ref{fig:Traffic_man}(d).


<<Traffic_m25, fig=TRUE, include=FALSE>>=
emm_25merged <- merge_clusters(emm, c("2", "5"))
plot(emm_25merged, method = "graph")
@

\begin{figure}[t]
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.7\linewidth]{rEMM-Traffic_graph}
\\(a)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.7\linewidth]{rEMM-Traffic_r3}
\\(b)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.7\linewidth]{rEMM-Traffic_rt52}
\\(c)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.7\linewidth]{rEMM-Traffic_m25}
\\(d)
\end{minipage}
\caption{Graph representation for an EMM for the EMMTraffic data set.
(a) shows the original EMM, in (b) cluster~3 is removed, in
(c) the transition from cluster~5 to cluster~2 is removed, and in
(d) clusters~2 and 5 are merged.}
\label{fig:Traffic_man}
\end{figure}

Note that a transition from the combined cluster~2 to itself is created which
represents the transition from cluster~5 to cluster~2 in the original EMM.

\subsection{Using cluster structure fading and pruning}

EMMs can adapt to changes in data over time. This is achieved by fading
the cluster structure using a decay rate.
To show the effect,
we train an EMM on the EMMTraffic data with a rather high decay
rate of $\lambda=1$. Since the weight is calculated by $w_t = 2^{-\lambda t}$,
the observations are weighted $1, \frac{1}{2}, \frac{1}{4},\dots$.

<<Traffic_l, fig=TRUE, include=FALSE>>=
emm_fading <- EMM(threshold = 0.2,
  measure = "eJaccard",
  lambda = 1)
build(emm_fading, EMMTraffic)
plot(emm_fading, method = "graph")
@

The resulting graph is shown in Figure~\ref{fig:Traffic_learning}(b).
The clusters which were created earlier on (clusters with lower index number)
are smaller (represent a lower weighted number of observations)
compared to the original
EMM without fading
displayed in Figure~\ref{fig:Traffic_learning}(a).

Over time clusters in an EMM can become obsolete and no new observations
are assigned to them. Similarly transitions might become obsolete over time.
To simplify the model and improve efficiency, such obsolete clusters and
transitions can be pruned. For the example here, we prune all clusters
which have a weighted count of less than $0.1$ and
show the resulting model in Figure~\ref{fig:Traffic_learning}(c).

<<Traffic_lp, fig=TRUE, include=FALSE>>=
emm_fading_pruned <- prune(
  emm_fading,
  count_threshold = 0.1,
  clusters = TRUE,
  transitions = TRUE
)
plot(emm_fading_pruned, method = "graph")
@


\begin{figure}[t]
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.7\linewidth]{rEMM-Traffic_graph}
\\(a)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.7\linewidth]{rEMM-Traffic_l}
\\(b)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.7\linewidth]{rEMM-Traffic_lp}
\\(c)
\end{minipage}
\caption{Graph representation of an EMM for the EMMTraffic data set.
(a) shows the original EMM. (b) shows an EMM with a learning rate of
$\lambda=1$.
(c) EMM with learning rate after pruning with a count threshold of $0.1$.}
\label{fig:Traffic_learning}
\end{figure}

\subsection{Visualization options}

We use a simulated data set called {\em EMMsim} which is included in~\pkg{rEMM}.
The data contains four
well separated clusters
in $\mathbb{R}^2$. Each cluster is
represented by a bivariate normally distributed random variable $X_i
\sim N_2(\vect{\mu}, \mat{\Sigma})$. $\vect{\mu}$ are the
coordinates of the mean of the
distribution and $\mat{\Sigma}$ is the covariance matrix.


The temporal structure of the data is modeled by the fixed sequence
$\langle 1,2,1,3,4\rangle$
through the four clusters
which is
repeated 40 times (200 data points) for the training data set
and 5 times (25 data points) for the
test data.

<<>>=
data("EMMsim")
@

Since the data set is in $2$-dimensional space,
we can directly visualize the data set as a scatter plot
(see Figure~\ref{fig:sim_data}). We overlayed the test sequence
to show the temporal structure, the points in the
test data are numbered and lines
connect the points in sequential order.

<<sim_data, fig=TRUE, include=FALSE>>=
plot(EMMsim_train, col = "gray", pch = EMMsim_sequence_train)
lines(EMMsim_test, col = "gray")
points(EMMsim_test, col = "red", pch = 5)
text(EMMsim_test, labels = 1:nrow(EMMsim_test), pos = 3)
@

\begin{figure}
\centering
%\begin{minipage}[b]{.49\linewidth}
%\centering
\includegraphics[width=.5\linewidth]{rEMM-sim_data}
%\end{minipage}
\caption{Simulated data set with four clusters. The
points of the
test data set are plotted in red
and the temporal structure is depicted by
sequential numbers and lines between the data points.}%
\label{fig:sim_data}
\end{figure}

We create an EMM by clustering using Euclidean distance and a threshold of
0.1.

<<sim_graph, fig=TRUE, include=FALSE>>=
emm <- EMM(threshold = 0.1, measure = "euclidean")
build(emm, EMMsim_train)
plot(emm)
@

The default EMM visualized as a graph
using package~\pkg{igraph}
is shown in Figure~\ref{fig:sim_graph}(a).

<<sim_graphviz, fig=TRUE, include=FALSE>>=
plot(emm, method = "graph")
@

Using method graph the same graph is rendered using the Graphviz library
(if package~\pkg{Rgraphviz} is installed).
This visualization is shown in Figure~\ref{fig:sim_graph}(b).
In both graph-based visualizations
the positions of the vertices of the graph (states/clusters)
are solely chosen to optimize the
layout which results in a not very informative visualization.
The next visualization method
uses the relative position of the clusters
to represent the proximity between the cluster centers.

<<sim_MDS, fig=TRUE, include=FALSE>>=
plot(emm, method = "MDS")
@

This results in the visualization in Figure~\ref{fig:sim_graph}(c) which shows
the same EMM graph but the relative position of the clusters/states
is determined
by the proximity of the cluster centers.
If the data space has more than two dimensions or a non-Euclidean
distance measure is used, the position of the states
will be determined using multidimensional
scaling~\citep[MDS, ][]{misc:Cox:2001} to
preserve the proximity information between clusters
in the two dimensional representation as much as possible.
Since the data set in this example is already in two-dimensional
Euclidean space, the original coordinates of the centers are
directly used.
The size of the clusters and the width of
the arrows represent again cluster and transition counts.

We can also project the points in the data set
into $2$-dimensional space and then add the centers of the clusters
(see Figure~\ref{fig:sim_graph}(d)).

<<sim_MDS2, fig=TRUE, include=FALSE>>=
plot(emm, method = "MDS", data = EMMsim_train)
@


\begin{figure}[tp]
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.9\linewidth]{rEMM-sim_graph}
\\(a)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.9\linewidth]{rEMM-sim_graphviz}
\\(b)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.9\linewidth]{rEMM-sim_MDS}
\\(c)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=.9\linewidth]{rEMM-sim_MDS2}
\\(d)
\end{minipage}
\caption{Visualization of the EMM for the simulated data.
(a) and (b) As a simple graph.
(c) A graph using vertex placement to represent dissimilarities.
(d) Projection of state centers onto the simulated data.
}%
\label{fig:sim_graph}
\end{figure}

The simple graph representations in Figures~\ref{fig:sim_graph}(a) and (b) show a rather
complicated graph for the EMM. However,
Figure~\ref{fig:sim_graph}(c) with the vertices positioned
to represent similarities between cluster centers shows more structure.
The clusters clearly fall into four groups.
The projection of the cluster centers onto the data set in
Figure~\ref{fig:sim_graph}(d) shows that the four groups represent the four
clusters in the data where the larger clusters are split into
several micro-clusters. We will introduce reclustering to simplify
the structure in a later section.

%\marginpar{add plot with state\_counts and trans\_counts}

\subsection{Scoring new sequences}

To score a new sequence with a model we have to perform several steps:
\begin{enumerate}
\item Assign all data point in the sequence to state in the model
\item Evaluate the how well the sequence of states matches the model
\end{enumerate}

For the first step we need to define an assignment function $s(i)$
which retuns the state for the $i^\textrm{th}$ data point in
the new sequence.
There are several assignment methods possible. We can use exactly the same
assignemnt procedure as during clustering.
This is the default method called \code{"exact"} where a data point is assigned
to its nearest state only if it falls within the threshold around the
state's center.
Otherwise, no assignment (a missing value) is made. Alternatively, we can
assign a data point to its nearest neighbor (\code{"nn"}), i.e., the
closest state no matter how far it is away. A third option is to use
nerest neighbor assignment with a weighting scheme (\code{"weighted"}).

The assignment function transforms the new sequence of data points
into a sequence of states. To evaluate how well the sequence
fits the model stored in the transition matrix $A$ can be done several ways.
The likelihood of the model given the new sequence is given by:

\begin{equation}
S_\mathrm{likelihood} = \prod_{i=1}^{l-1}{a_{s(i),s(i+1)}}
\end{equation}

Note that for a sequence of
length $l$ we have $l-1$ transitions. The so-called
average log-loss is another option.

\begin{equation}
S_\mathrm{log\_loss} = -\frac{1}{l-1} \sum_{i=1}^{l-1}{\mathrm{log}_2(a_{s(i),s(i+1)})}
\end{equation}

The average log-loss is the number of bits needed to encode the new
sequence given the model.

Scores similar to the likelihood and the average log-loss
can be defined as:
\begin{align}
S_\mathrm{product} &= \sqrt[l-1]{\prod_{i=1}^{l-1}{a_{s(i),s(i+1)}}} \\
S_\mathrm{log\_sum} &= \frac{1}{l-1} \sum_{i=1}^{l-1}{\mathrm{log}(a_{s(i),s(i+1)})} \\
S_\mathrm{sum} &= \frac{1}{l-1} \sum_{i=1}^{l-1}{a_{s(i),s(i+1)}}
\end{align}

For weighted nearest neighbor assignment the
weights are used in the following way:

\begin{align}
S_\mathrm{product}^\mathrm{weighted} &= \sqrt[l-1]{\prod_{i=1}^{l-1}{w_i\, a_{s(i),s(i+1)}}} \\
S_\mathrm{log\_sum}^\mathrm{weighted} &= \frac{1}{l-1} \sum_{i=1}^{l-1}{\mathrm{log}(w_i\,   a_{s(i),s(i+1)})} \\
S_\mathrm{sum}^\mathrm{weighted} &= \frac{1}{l-1} \sum_{i=1}^{l-1}{w_i\,   a_{s(i),s(i+1)}}
\end{align}
where
$s(i)$ represents the state the  $i$-th data point in the new sequence is assigned to, and
$$w_i = \mathrm{simil}(x_i,s(i))\,\mathrm{simil}(x_{i+1},s(i+1))$$ with

$$\mathrm{simil}(x,s) =  1- \frac{1}{1+e^{-\frac{\mathrm{d}(x, s)/t -\mu}{\sigma}}}$$

where $\mathrm{d}()$ is the distance function and $t$ is the threshold used
for building the model. $\mu=1.5$ and $\sigma=.2$ are parameters of the
logistic distribution used for transformation. Figure~\ref{fig:simil}
visualizes the relationship between the distance of data point and cluster
and the resulting similarity weight.

<<simil, fig=TRUE, echo=FALSE, include=FALSE>>=
x <- seq(0, 5, length.out = 50)
plot(
  x,
  rEMM:::.simil_weight(x, 1),
  type = "l",
  xlab = "d(x,s)/t",
  ylab = "simil(x,s)"
)
@

\begin{figure}[tbp]
\centering
\includegraphics[width=.5\linewidth]{rEMM-simil}
\caption{Similarity weight calculation given the distance and the used threshold $t$.}
\label{fig:simil}
\end{figure}

Another very rough score can be obtained by just counting the number of
transitions in the new sequence which are also present (supported) in the
model.

\begin{equation}
S_\mathrm{supported\_transitions} = \frac{1}{l-1} \sum_{i=1}^{l-1}{\mathrm{I}(a_{s(i),s(i+1)})}
\end{equation}
where $\mathrm{I}(v)$ is indicator function which is 0 for $v=0$ and $1$
otherwise. Supported transactions can also be used with
weighted nearest neighbors.
\begin{equation}
S^{\mathrm{weighted}}_\mathrm{supported\_transitions} = \frac{1}{l-1} \sum_{i=1}^{l-1}{w_i\mathrm{I}(a_{s(i),s(i+1)})}
\end{equation}


To take the initial
transition probability also into account is straight forward to
add the initial probability $a_{\epsilon,s(1)}$ to the equations above.

As an example, we
calculate how well the test data
fits the EMM created
for the EMMsim data in the section above. The test data is
supplied together with the training set in \pkg{rEMM}.
<<>>=
score(emm, EMMsim_test, method = "log_loss")
score(emm, EMMsim_test, method = "likelihood")
score(emm, EMMsim_test, method = "product")
score(emm, EMMsim_test, method = "sum")
score(emm, EMMsim_test, method = "supported_transitions")
@

%\marginpar{Maybe introduce log\_odds for comparing EMMs of different size?}

Even though the test data was generated using exactly the same model as
the training data,
the likelihood and the normalized product,
produce a score of 0 while the average log-loss is infinity. The normalized sum is
also very low. Only the supported transition count is high.
To analyze the problem we can look at the transition
table for the test sequence.
The transition table is computed by \func{transition\_table}.

%\marginpar{Problem with plus\_one and age!}
<<>>=
transition_table(emm, EMMsim_test)
@

The low score is caused by data points that do not fall within the threshold
for any cluster (\code{<NA>} above) and by missing transitions in the matching
sequence of clusters (counts and probabilities of zero above).  These missing
transitions are the result of the fragmentation of the real clusters into
many micro-clusters (see Figures~\ref{fig:sim_graph}(b) and (c)). Suppose we
have two clusters called cluster~A and cluster~B and after an observation
in cluster A always an observation in cluster B follows. If now cluster A
and cluster B are represented by many micro-clusters each, it is
likely that we find a pair of micro-clusters (one in A and one in B) for
which we did not see a transition yet and thus will have a
transition count/probability of zero.

We can use different matching strategies to deal with this problem.
Implemented options are nearest neighbor (\code{match\_cluster="nn"}) and
weighted nearest neighbor (\code{match\_cluster="weighted"}).

<<>>=
score(emm, EMMsim_test, method="product", match_cluster="nn")
score(emm, EMMsim_test, method="product", match_cluster="weighted")
@

Note that the weighted score is slightly smaller since the weight
penalizes the data points which are farther away from the cluster centers.
Another option is to use
a number which is a multiplicator for the threshold in which
the data point has to fall to be assigned to the cluster. For example with
\code{match\_cluster = 2}, the data point has to fall within two times the threshold
around the cluster's center.
Here a new data point is assigned to the closest cluster even if it falls
outside the threshold (or up to 10\% outside for the second example).

<<>>=
score(emm, EMMsim_test, method = "supported_transitions", match_cluster = 1.1)
@

The problem with missing transitions can be reduced by
starting with a prior distribution of transition probabilities (\code{prior = TRUE}
is the default setting). We implement the
simple case where we start with a uniform transition probability distribution,
i.e., if no data is
available we assume that the transitions to all other states are equally
likely. This can be done by
using a uniform prior, giving each transition an initial count
of one~\citep{Jaynes:2003}.

Using nearest neighbor and uniform initial counts of one produces the
following scores.
<<>>=
methods <- c("product", "sum", "log_loss", "likelihood")
sapply(
  methods,
  FUN = function(m)
    score(emm, EMMsim_test, method = m, match = "weighted")
)
@

Since we only have micro-clusters, the scores are still extremely small.
To get a better model, we will recluster the states in the following section.

\subsection{Reclustering states}

For this example, we use the EMM created in the previous section for the EMMsim
data set.  For reclustering, we use here hierarchical clustering with
average linkage.
To find the best number of clusters $k$, we  create clustered EMMs for
different values of $k$ and then score the resulting models using the test
data.

We use \func{recluster\_hclust} to create a list of clustered
EMMs for $k=2,3,\dots,10$.
Any recluster function in \pkg{rEMM} returns with the resulting EMM information
about the clustering as the attribute \code{cluster\_info}.
Here we plot the dendrogram which is shown in Fig~\ref{fig:sim_hc}.

<<sim_hc, fig=TRUE, include=FALSE>>=
k <- 2:10
emmc <- recluster_hclust(emm, k = k, method = "average")
plot(attr(emmc, "cluster_info")$dendrogram)
@

\begin{figure}[tbp]
\centering
\includegraphics[width=.5\linewidth]{rEMM-sim_hc}
\caption{Dendrogram for clustering state centers of the EMM build from
simulated data.}
\label{fig:sim_hc}
\end{figure}


<<>>=
sc <- sapply(emmc, score, EMMsim_test, "log_likelihood")
names(sc) <- k
sc
@

To find the best performing of these nested models we use the log-likelihood.
The best performing model has a score of \Sexpr{round(max(sc),3)}
for a $k$ of \Sexpr{names(sc[which.max(sc)])}.
This model is depicted in Figure~\ref{fig:sim_optc}(a).
Since the four groups in the data set are well separated, reclustering finds
the original structure (see Figure~\ref{fig:sim_optc}(b)) with all points
assigned to the correct state.


<<sim_optc_graph, fig=TRUE, include=FALSE, echo=FALSE>>=
plot(emmc[[which.max(sc)]], method = "MDS")
@
<<sim_optc_MDS, fig=TRUE, include=FALSE, echo=FALSE>>=
plot(emmc[[which.max(sc)]], method = "MDS", data = EMMsim_train)
@

\begin{figure}[tbp]
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=\linewidth]{rEMM-sim_optc_graph}\\
(a)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=\linewidth]{rEMM-sim_optc_MDS}\\
(b)
\end{minipage}
\caption{Best performing final clustering for EMM with a $k$ of
\Sexpr{names(sc[which.max(sc)])}.}
\label{fig:sim_optc}
\end{figure}


\section{Applications}
\label{sec:applications}\nopagebreak

\subsection{Analyzing river flow data}
The \pkg{rEMM} package also contains a data set called {\em Derwent} which was
originally used by~\cite{emm:Dunham:2004}.  It consists of river flow readings
(measured in $m^3$ per second) from four catchments of the river Derwent and
two of its main tributaries in northern England.  The data was collected daily
for roughly 5 years (1918 observations) from November~1,~1971 to
January~31,~1977.
The catchments are Long Bridge, Matlock Bath, Chat Sworth, What
Stand Well, Ashford (river Wye) and Wind Field Park (river Amber).

The data set is interesting since it contains annual changes of
river levels and also some special flooding events.

<<>>=
data(Derwent)
summary(Derwent)
@

From the summary we see that the gauged flows vary among catchments
significantly (from 0.143 to 14.238). The influence of differences in averages
flows can be removed by scaling the data before building the EMM.  From the
summary we also see that for the Ashford and Wind Field Park catchments a
significant amount of observations is not available.  EMM deals with these
missing values by using only the non-missing dimensions of the observations for
the proximity calculations (see package~\pkg{proxy} for details).

<<Derwent1, fig=TRUE, include=FALSE>>=
plot(Derwent[, 1],
  type = "l",
  ylab = "Gauged flow",
  main = colnames(Derwent)[1])
@
\begin{figure}
\centering
\includegraphics[width=.5\linewidth]{rEMM-Derwent1}
\caption{Gauged flow (in $m^3/s$) of the river Derwent at the
Long Bridge catchment.}
\label{fig:Derwent1}
\end{figure}

In Figure~\ref{fig:Derwent1} we can see the annual flow pattern for the Long
Bridge catchment with higher flows in September to March and lower flows in the
summer months. The first year seems to have more variability in the summer
months and the second year has an unusual event (around the index of 600 in
Figure~\ref{fig:Derwent1}) with a flow above $100 m^3/s$ which can be classified
as flooding.

We build an EMM from the (centered and) scaled river data using Euclidean
distance between the vectors containing the flows from the six catchments and
experimentally found a distance threshold of 3 (just above the
3rd quartile of the distance distribution between all scaled observations)
to give useful results.

<<Derwent_cluster_counts, fig=TRUE, include=FALSE, width=10>>=
Derwent_scaled <- scale(Derwent)
emm <- EMM(measure = "euclidean", threshold = 3)
build(emm, Derwent_scaled)
cluster_counts(emm)
cluster_centers(emm)
plot(emm, method = "cluster_counts", log = "y")
@
\begin{figure}
\centering
\includegraphics[width=.8\linewidth]{rEMM-Derwent_cluster_counts}
\caption{Distribution of state counts of the EMM for the Derwent data.}
\label{fig:Derwent_cluster_counts}
\end{figure}

The resulting EMM has \Sexpr{size(emm)} clusters/states. In
Figure~\ref{fig:Derwent_cluster_counts} shows that the cluster counts have
a very skewed distribution with clusters~1 and 2 representing most observations
and clusters~5, 9, 10, 11, 12, 17, 19 and 20 being extremely rare.

<<Derwent_EMM1, fig=TRUE, include=FALSE>>=
plot(emm, method = "MDS")
@
\begin{figure}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=\linewidth]{rEMM-Derwent_EMM1}
\\(a)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=\linewidth]{rEMM-Derwent_EMM2}
\\(b)
\end{minipage}
\caption{Cluster centers of the EMM for the Derwent data set projected
on $2$-dimensional space.
(a) shows the full EMM and (b) shows a pruned EMM (only the most
frequently used states)}
\label{fig:Derwent_EMM}
\end{figure}

The projection of the cluster centers into $2$-dimensional space in
Figure~\ref{fig:Derwent_EMM}(a) reveals that all but
clusters~11 and 12
are placed closely together.

Next we look at frequent clusters and transitions.
We define rare here as all
clusters/transitions that represent less than 0.5\% of the observations.
On average this translates into less than two daily observation per year.
We calculate a count threshold,
use \func{prune} to remove rare clusters.

<<Derwent_EMM2, fig=TRUE, include=FALSE>>=
rare_threshold <- sum(cluster_counts(emm)) * 0.005
rare_threshold
plot(prune(emm, rare_threshold), method = "MDS")
@

The pruned model depicted in Figure~\ref{fig:Derwent_EMM}(b) shows that 5
clusters represent approximately 99.5\% of the river's behavior.
All five clusters come from the lower half of the large group of clusters in
Figure~\ref{fig:Derwent_EMM}(a).
Clusters~1 and 2 are the most frequently occurring clusters and
the wide bidirectional
arrow connecting them means that observing transitions between these
two clusters are common.
%<<>>=
%rare <- names(which(cluster_counts(emm)<rare_threshold))
%rare
%@
To analyze the meaning of the two outlier clusters (11 and 12)
identified in Figure~\ref{fig:Derwent_EMM}(a) above, we plot the flows at a
catchment and mark the observations for these states.

<<Derwent2, fig=TRUE, include=FALSE>>=
catchment <- 1
plot(Derwent[, catchment],
  type = "l",
  ylab = "Gauged flows",
  main = colnames(Derwent)[catchment])
state_sequence <- find_clusters(emm, Derwent_scaled)

mark_states <-
  function(states,
    state_sequence,
    ys,
    col = 0,
    label = NULL,
    ...) {
    x <- which(state_sequence %in% states)
    points(x, ys[x], col = col, ...)
    if (!is.null(label))
      text(x, ys[x], label, pos = 4, col = col)
  }

mark_states("11", state_sequence, Derwent[, catchment], col = "blue", label = "11")
mark_states("12", state_sequence, Derwent[, catchment], col = "red",  label = "12")
@
\begin{figure}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=\linewidth]{rEMM-Derwent2}
\\(a)
\end{minipage}
\begin{minipage}[b]{.49\linewidth}
\centering
\includegraphics[width=\linewidth]{rEMM-Derwent3}
\\(b)
\end{minipage}
\caption{Gauged flow (in $m^3/s$) at (a) the
Long Bridge catchment and (b) the Amber at the
Wind Field Park catchment. Outliers (states~11 and 12) are marked.}
\label{fig:Derwent2}
\end{figure}

In Figure~\ref{fig:Derwent2}(a) we see that cluster~12 has a
river flow in excess of
$100 m^3/s$ which only happened once in the observation period. Cluster~11
seems to be a regular observation with medium flow around $20 m^3/s$ and it
needs more analysis to find out why this cluster is also an outlier directly
leading to cluster~12.

<<Derwent3, fig=TRUE, include=FALSE>>=
catchment <- 6
plot(Derwent[, catchment],
  type = "l",
  ylab = "Gauged flow",
  main = colnames(Derwent)[catchment])

mark_states("11", state_sequence, Derwent[, catchment], col = "blue", label = "11")
mark_states("12", state_sequence, Derwent[, catchment], col = "red",  label = "12")
@

The catchment at Wind Field Park is at the Amber river which is a tributary of
the Derwent and we see in Figure~\ref{fig:Derwent2}(b) that the day before the
flood occurs, the flow shoots up to $4 m^3/s$ which is caught by cluster~11.
The
temporal structure clearly indicated that a flood is imminent the next day.

%Next we look at the probability of going from state~1 (normal condition)
%to state~12 (flooding) in $n=1,2,\dots,10$ days.
%
%<<>>=
%n <- 1:10
%probs <- sapply(n, FUN = function(n) predict(emm, n=n,
%current="1", probabilities=TRUE))[12,]
%names(probs) <- n
%probs
%@
%
%From the probabilities we see that we cannot go directly from 1 to 12.
%Only after two steps we can get to state~12 and the probability is then
%between 0.0005 and 0.0007.
%
%<<fig=TRUE>>=
%hc <- cluster_states(emm)
%plot(hc)
%@
%<<fig=TRUE>>=
% emm20 <- merge_clusters(emm, cutree(hc, h=20), clustering=TRUE)
% state_centers(emm20)
% plot(remove_selftransitions(emm20))
%@
%<<fig=TRUE>>=
%plot(remove_selftransitions(emm20), method = "graph")
%@

\subsection{Genetic sequence analysis}
\label{sec:genetic_sequence_analysis}\nopagebreak
The \pkg{rEMM} package also contains examples for 16S ribosomal RNA (rRNA)
sequences for the two phylogenetic classes, Alphaproteobacteria and Mollicutes.
16S rRNA is a component of the ribosomal subunit 30S and is regularly
used for phylogenetic studies~\citep[e.g., see][]{rna:Wang:2007}.
Typically alignment heuristics like BLAST~\citep{rna:Altschul:1990}
or a Hidden Markov Model (HMM)~\citep[e.g.,][]{rna:Hughey:1996} are used
for evaluating the similarity between two or more sequences. However,
these procedures are computationally very expensive.

An alternative approach is to describe the structure in terms of the occurrence
frequency of so called $n$-words, subsequences of length $n$. Counting the
occurrences of the $4^n$ (there are four different nucleotide types)
$n$-words is straight forward
and computing similarities between frequency profiles if very
efficient. Because no alignment is computed, such methods are called
alignment-free~\citep{rna:Vinga:2003}.
%A direct application of this
%approach is cd-hit, a method that finds identical subsequences by
%first evaluating the number of matching $n$-word frequencies and only
%if the

\pkg{rEMM} contains preprocessed sequence data for 30 16S sequences of the
phylogenetic class Mollicutes.  The sequences were preprocessed by cutting them
into windows of length 100 nucleotides without overlap and then for
each window the occurrence of triplets of nucleotides was counted resulting in
$4^3=64$ counts per window. Each window will be used as an observation to build
the EMM.  The counts for the 30 sequences are organized as a matrix and
sequences are separated by rows of \code{NA} resulting in resetting the EMM
during the build process.

\cite{rna:Vinga:2003} review dissimilarity measures used for alignment-free
methods. The most commonly used measures are Euclidean distance, $d^2$ distance
(a weighted Euclidean distance), Mahalanobis distance and Kullback-Leibler
discrepancy~(KLD). Since \cite{rna:Wu:2001} find in their experiments that
KLD provides good results while it still can
be computed as fast as Euclidean distance, it is also used here.
Since KLD becomes $-\infty$ for counts of zero, we
add one to all counts which conceptually means that we start building
the EMM with a prior that all triplets have the equal occurrence
probability~\cite[see][]{rna:Wu:2001}.
We use an experimentally found threshold of 0.1.
<<>>=
data("16S")

emm <- EMM(threshold = 0.1, "Kullback")
build(emm, Mollicutes16S + 1)
@

<<Mollicutes_graph, fig=TRUE, include=FALSE>>=
plot(emm, method = "graph")
it <- initial_transition(emm)
it[it > 0]
@

\begin{figure}
\centering
\includegraphics[width=\linewidth]{rEMM-Mollicutes_graph}
\caption{An EMM representing 16S sequences from the class Mollicutes
represented as a graph.}
\label{fig:Mollicutes_graph}
\end{figure}

The graph representation of the EMM is shown in
Figure~\ref{fig:Mollicutes_graph}. Note that each cluster/state
in the EMM corresponds
to one or more windows of the rRNA sequence (the size of the cluster indicates
the number of windows). The initial transition probabilities show that most
sequences start the first count window in cluster~1.  Several interesting
observations can be made from this representation.
\begin{itemize}
\item There exists a path through the graph using only the largest clusters
	and widest arrows
    which represents the most common sequence of windows.
\item There are several places in the EMM where almost
    all sequences converge (e.g., 4 and 14)
\item There are places with high variability where many possible
    parallel paths exist (e.g., 7, 27, 20, 35, 33, 28, 65, 71)
\item The window composition changes over the whole sequences since there are
    no edges going back or skipping states on the way down.
\end{itemize}

In general it is interesting that the graph has no loops since
\cite{rna:Deschavanne:1999} found in their study using Chaos Game
Representation that the variability along genomes and among genomes is low.
However, they looked at longer sequences and we look here at the micro
structure of a very short sequence.
These observations merit closer analysis by biologists.

\section{Conclusion}
\label{sec:conclusion}\nopagebreak

Temporal structure in data streams is ignored by current data stream clustering
algorithms. A temporal EMM layer can be used to retain such structure.
In this paper we showed that a temporal EMM layer can be added to any data
stream clustering algorithm which works with dynamically creating, deleting
and merging clusters.
As an example, we implemented in \pkg{rEMM} a simple data stream
clustering algorithm and the temporal EMM layer and demonstrated its usefulness
with two applications.

Future work will include extending popular data stream clustering algorithms
with EMM, incorporate higher order models and add support for
reading data directly from data stream sources.

\section*{Acknowledgments}
The authors would like to thank the anonymous reviewers for their valuable
comments.

Part of the research presented in this paper was supported in part by
the National Science Foundation under Grant No. IIS-0948893 and
the National Human Genome Research Institute under Grant No. R21HG005912.

%\bibliographystyle{abbrvnat}
\bibliography{rEMM}
\end{document}
