
Mining Time-series with Trillions of Points: Dynamic Time Warping at scale - datascientist
http://practicalquant.blogspot.com/2012/10/mining-time-series-with-trillions-of.html
======
davedx
I love this stuff. Applied Computer Science. I wish there were more articles
like this on Hacker News and less of the patent-wars and gadget articles.

~~~
robbles
Agreed, this kind of content is the best of HN in my opinion, along with
programming / ops "war stories" like this one:
<http://news.ycombinator.com/item?id=4709438>

------
peripetylabs
The "time warp" aspect is very interesting.

There is another algorithm for computing similarity between time series, which
could still be very useful: _cross-correlation_ (also called "coherence" in
signals processing). Cross-correlation is O(n log n), so it's perfect for big
data applications.

An illustration:

[http://en.wikipedia.org/wiki/Coherence_(signal_processing)#E...](http://en.wikipedia.org/wiki/Coherence_\(signal_processing\)#Example)

~~~
phalina
Yeah, I was thinking the same thing as I read this. If you calculate the cross
correlation of two signals by performing a Fourier transform and then
multiplying them, it gives you the correlation of the signals at different
offsets as well. Maybe it's hard to calculate the FFT of a dataset that
doesn't fit in one machine's memory. Been a while since I studied the FFT.

~~~
aufreak3
Searching using cross correlation is effectively the same as searching using
the euclidean distance metric I think, for which also they have a fast
implementation.

DTW, however, can answer more interesting questions. For example, if you have
two performances of a song captured in MIDI, the timing of each note played
can vary a little, tempo can fluctuate and sometimes extra notes can also be
introduced. DTW can help find the best mapping between two such performances.

------
tomkinstinch
This reminds me of the sequence alignment algorithms we use in bioinformatics
to find scores to rank nucleic acid or amino acid sequences by similarity[ex.
1].

It's cool to see it extended to data beyond discrete base/aa values.

The "time warping" aspect also kinda reminds me of the methods used in remote
sensing for comparing spectral signatures (SCM/SAM) [ex. 2].

I'd be interested in learning where/how this is being used for problems in
finance or economics.

1\.
[http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorith...](http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)

2\.
ftp://geo.arc.nasa.gov/pub/stevek/Spectral%20Correlation/Osmar_1_carvalho__web.pdf

------
scotty79
What is that Gun/NoGun problem mentioned here
[http://alumni.cs.ucr.edu/~lexiangy/Shapelet/kdd2009shapelet....](http://alumni.cs.ucr.edu/~lexiangy/Shapelet/kdd2009shapelet.pdf)
?

------
dhm116
I didn't have a ton of time to search, but did they mention what kind of
software they used? I'm curious what database(s) and schema(ish) techniques
were used to hold all of that time series data?

~~~
peripetylabs
From the video, it looks to me like they used MATLAB.

There are a few open-source databases of physiological signals like the ECG
used in the video. One is PhysioBank:

<http://www.physionet.org/physiobank/>

