
Clustering of Time Series Subsequences Is Meaningless (2005) [pdf] - silverpikezero
http://www.cs.ucr.edu/~eamonn/meaningless.pdf
======
daemonk
So the heart of the problem, from what I can gather, is the sliding window
approach to generating subsequences. This approach necessarily generates
"redundantly similar" windows, ie. windows will be most similar to their
closest neighboring windows since there is an overlapping length of window - 1

The sliding window approach is not adding random noise to the clustering, it
is adding redundantly similar data to the clustering resulting in almost
perfect sine-wave-like cluster centers.

~~~
daemonk
__update to my original post __

I am coming from a genomics background, so when I read about sliding window on
time series data, I think about sliding window approach on regions of genome
sequences.

The ultimate point of doing this clustering is to find repetitive sub-
sequences in the series. This is also something we do commonly in the genomics
field (repeatmasker/repeatmodeler software for example).

Something we can look at in genomes is a "k-mer coverage". A k-mer is just a
k-lengthed sub-sequence (A,T,G,C), analogous to a k-length sub-sequence of a
time series.

By scanning the genome, we can tally up how many times a k-mer appears in the
genome. With this tally, we can determine a k-mer coverage for a region on the
genome. This gives us an idea how many times we see this same region on the
rest of the genome.

Maybe this approach can be adapted somehow? The only problem is that in
genomics, we have four discrete classes (A,G,T,C), making finding exact
matching relatively easy. In time-series, we have continuous data, making
finding matches a lot tougher to do and define.

~~~
brudgers
Disclaimer: _I am not defending or denying the claims in the paper. I have a
philosophical interest in time._ [1]

The difference I think is that the series in a genome is lexicographic rather
than temporal: i.e. which end of a particular strand you read from is
irrelevant. On the other hand, a time series has an independent ordering. By
analogy, a time series is a directed graph a genome is undirected.

That is the algorithm for finding a subsequence in a time series can be used
for finding a subsequence in a genome. But the meaning of a subsequence is
different. It's:

    
    
      A - G - A - C
    

versus

    
    
      C-> A -> G -> A.
    

Time series data contains implication and causality. Any claim about time
series data is implicitly a claim about causality: e.g. we might claim a time
series appears to be random. We don't really talk about a genome's amino acid
sequence being random because the causality (other than the trivial case of
analogues) lies outside the sequence - the reason for

    
    
      T - A - C
    

is assumed to lie outside of T - A - C.

[1]: [http://plato.stanford.edu/entries/kant-hume-
causality/#TimDe...](http://plato.stanford.edu/entries/kant-hume-
causality/#TimDetAnaExpUniNat)

~~~
datahipster
This is pretty much the main idea behind another paper that Eamonn wrote [0].
The idea is that you can normalize and discretize timeseries data into a
lecicographical representation, then perform all sorts of interests analyses
on it. For example, searching for common subsequences can be done via regular
expressions.

[0]
[http://www.cs.ucr.edu/~eamonn/iSAX.pdf](http://www.cs.ucr.edu/~eamonn/iSAX.pdf)

------
humanarity
"The cluster centers found by STS clustering on any particular run of k-means
on stock market dataset are not significantly more similar to each other than
they are to cluster centers taken from random walk data! In other words, if we
were asked to perform clustering on a particular stock market dataset, we
could reuse an old clustering obtained from random walk data, and no one could
tell the difference."

"As the sliding window passes by, the datapoint first appears as the rightmost
value in the window, then it goes on to appear exactly once in every possible
location within the sliding window. So the t_i datapoint contribution to the
overall shape is the same everywhere..."

"Another way to look at it is that every value v_i in the mean vector, 1 ≤ i ≤
w, is computed by averaging essentially every value in the original time
series; more precisely, from t_i to t_m-w+i . So for a time series of m = 1024
and w = 32, the first value in the mean vector is the average of t[1..993];
the second value is the average of t[2…994], and so forth. Again, the only
datapoints not being included in every computation are the ones at the very
beginning and at the very end, and their effects are negligible
asymptotically."

~~~
mathgenius
I wonder if the same effect applies to an exponential sliding window, ie.
weight more "recent" samples higher than older samples, so effectively it's an
infinite window. Instead of using a hard cutoff window.

~~~
humanarity
Hmm, good question. I'd say this would simply distribute the contribution of
each window member along an exponential (rather than constant) trend. The
terms in the mean vector would then be polynomials of the window members. Even
if you lined up t_(x^n) weight with t_(x^m), it seems to me the other t_i
filtered at exponential distances wouldn't line up. You'd could probably
arrange it to get split on these nonidentical term weights, and maybe this
could work as a (somewhat involved) method of feature selection that didn't
degenerate to identical contributions. Interesting idea! - and maybe there's a
few papers about it somewhere :)

------
GFK_of_xmaspast
Eamonn Keogh is a major advocate of Dynamic Time Warping and this should be
kept in mind while reading this paper.

(he does have interesting stuff here:
[http://www.cs.ucr.edu/~eamonn/selected_publications.htm](http://www.cs.ucr.edu/~eamonn/selected_publications.htm))

~~~
gmarsh
Exactly the comment I was looking for - I've done a bit of work classifying
time series using 1-NN and DTW as the distance metric and it works quite well.

I was under the impression this work was more well-known, though. It's
surprising to not see it mentioned considering how long DTW has been around.

------
kilbuz
From the abstract: "forced to obey a certain constraint that is pathologically
unlikely to be satisfied by any dataset"

Like assuming your data come from a normal distribution? Box guides us: 'all
models are wrong, some are useful.'

~~~
pvnick
We found the same issue with non-normal distributions in our time series data
sets when using SAX (developed by this paper's authors - assumes normality for
dimensionality reduction) and addressed it by using quantiles in the piecewise
aggregate approximation step. The quantile breakpoints behaved much better
than the "normal" breakpoints.

------
baldeagle
when I last looked into time series clustering, it seemed like other
statistical techniques worked better for predicting failure modes. I still
think there is merit in motif recognition and rules based around that concept,
but I haven't seen very much of interest outside of EKGs.

------
amelius
From the abstract:

> While this constraint can be intuitively demonstrated with a simple
> illustration and is simple to prove, it has never appeared in the
> literature.

Ok, I would love to see a simple illustration here :)

------
engi_nerd
What about cluster analysis of whole time series (which are of differing
lengths)? Is that meaningful?

------
eveningcoffee
I think that it should read "Clustering of Time Series Subsequences With
K-means Is Meaningless".

~~~
JD557
The article actually claims that the problem is not related with the
clustering algorithm, but with the sliding window approach.

------
p4wnc6
I think one way to summarize it is: by constructing feature vector components
that are parochially based upon the mere _position_ within some data, you
introduce correlation and causality effects between the components of your
feature vector.

For example, think about using regression instead of clustering. Your design
matrix is still going to be S from the paper, where the p-th row of S is the
p-th "iteration" of the window (while it is is being slid from start to finish
of the data).

So column 1 of p=1 window will trivially be perfectly correlated with column 2
of p=0 window, and column 0 of p=2, at least for a number of rows until that
data point has slid out of the window. So in some sense, past observations
(earlier columns in S) will "cause" later observations (later columns in S)
due to the sliding mechanism.

In regression this is a well-explored problem with multi-collinearity and also
with conditional effect sizes when causal relationships are not separated into
separate equations or corrected by adding dummy variables for conditional
levels of one covariate given others.

It's not at all surprising that something similar would happen when
clustering.

Sadly, it's also not that surprising that the entire research community
doesn't care too much and is happy to crank out papers with this issue anyway.
It's similar to data mining for statistical significance or other problems,
and indeed machine learning is not at all a silver bullet to get around those
problems.

One approach that might help reduce the problem is to randomly sub-sample
windows. It would be a lot of work, and probably be computationally costly,
but you could in principal devise a sampling scheme where you jointly sample
_all_ of the sub-windows at once, in such a way as to ensure certain bounds on
the total amount of overlap between them, using probably some sort of
Metropolis-like scheme. It's not clear to me if this is worthwhile or not, and
I tend to prefer the idea of quantizing the signal in some way to make it a
lexicographic problem instead.

Also note that the problem is _a lot_ less worrisome in supervised learning,
like training a decision tree or SVM on overlapping sub-windows. The reason is
that you (presumably) have control over the labels. So you can ensure that two
highly correlated sub-samples (like window p and window p+1) get the same
label most of the time when it matters (e.g. when you're right on top of the
interesting phenomenon you want to classify). _Crucially_ the reason this
helps is that you not only can ensure that two sample windows which differ by
only a slight amount of sliding will get the same label, but also that those
windows will get the same label as other instances of the same phenomenon
elsewhere in the signal, perhaps far away in some other sub-window nowhere
near it. It means you are clamping the positive training samples to reflect
the signal properties you want, rather than hoping the clustering will
intrinsically pick that to be the notion of similarity it uses when
classifying samples (which, as the paper points out, you have no reason to
expect that it will).

But it will still suffer somewhat to subjective choices about when, exactly,
the window has slid far enough away from the labeled phenomenon to have
"rolled off" enough that the label should be for the other class (in a binary
case). What's bad about this is that however this gets codified into the
training set needs to be taken into account as part of the experimental design
and accounted for when reporting anything like statistical significance
results later on. But because this choice, which is effectively a "roll off"
parameter, is not written down or made explicitly part of the model anywhere,
most people just ignore, or don't think about the way it might have an effect
on the result.

