
Significant Pattern Mining for Time Series - cbock90
https://christian.bock.ml/posts/significant_shapelets/
======
bra-ket
related: Matrix Profiles for time series
[https://www.cs.ucr.edu/~eamonn/MatrixProfile.html](https://www.cs.ucr.edu/~eamonn/MatrixProfile.html)

~~~
uoaei
See Stumpy for a handy library to get this working quickly (written in
Python):
[https://github.com/TDAmeritrade/stumpy](https://github.com/TDAmeritrade/stumpy)

~~~
seanlaw
Hi all, I am the creator of STUMPY and wanted to thank you for your interest.
Please feel free to post questions on our Github issues and we'll try to
assist where we can.

------
amai
Don’t forget: „Clustering of Time Series Subsequences is Meaningless“ :
[https://www.cs.ucr.edu/~eamonn/meaningless.pdf](https://www.cs.ucr.edu/~eamonn/meaningless.pdf)

~~~
Topolomancer
But this is not about clustering. It's about figuring out to what extent a
certain subclass of features, namely the 'shapelets', are statistically
significantly associated with a pre-defined binary outcome.

The paper you mentioned is interesting, though, because it shows an issue that
many algorithms are privy to: if the number of samples/features gets too
large, at some point, you are only comparing _means_.

(We are working on a paper to show the issues of this when it comes to time
series classification.)

------
valyala
Where to store time series data for further analysis? It is possible to use
Prometheus for this - see [https://medium.com/@valyala/analyzing-prometheus-
data-with-e...](https://medium.com/@valyala/analyzing-prometheus-data-with-
external-tools-5f3e5e147639)

------
graycat
Their math in their description of their data is in error: They need to state
that the T_i (T with a subscript i), for i = 0, 1, 2, ..., n are distinct.

More standard would be a function d: {0, 1, ..., n} --> R^{1 x m} x {0, 1}.

~~~
Topolomancer
Seems to be standard terminology for time series classification to me, to be
honest. I think the approach would also work if there are duplicates in the
data. Although the estimate would be overly optimistic, right?

~~~
graycat
With their notation they have not specified that the T's are unique. So, a
first fix up would be just to state that the T's were distinct. And it would
help to be explicit that i from 0, 1, 2, ... corresponded to increasing time.
Moreover, is the data equally spaced in time? Likely, yes, and in that case,
clearly say so.

~~~
jmmcd
No, i indexes the patient, not time. (T_0, y_0) is one patients entire time
series.

------
module0000
This sure reads and looks like technical analysis indicators for time series
data.

It's useful though - example: 5 day MA of disk errors rises over the 15 day ==
likely failure

