Hacker News new | comments | show | ask | jobs | submit login
Mining Time-series with Trillions of Points: Dynamic Time Warping at scale (practicalquant.blogspot.com)
80 points by datascientist 1852 days ago | hide | past | web | favorite | 11 comments



I love this stuff. Applied Computer Science. I wish there were more articles like this on Hacker News and less of the patent-wars and gadget articles.


Agreed, this kind of content is the best of HN in my opinion, along with programming / ops "war stories" like this one: http://news.ycombinator.com/item?id=4709438


The "time warp" aspect is very interesting.

There is another algorithm for computing similarity between time series, which could still be very useful: cross-correlation (also called "coherence" in signals processing). Cross-correlation is O(n log n), so it's perfect for big data applications.

An illustration:

http://en.wikipedia.org/wiki/Coherence_(signal_processing)#E...


Yeah, I was thinking the same thing as I read this. If you calculate the cross correlation of two signals by performing a Fourier transform and then multiplying them, it gives you the correlation of the signals at different offsets as well. Maybe it's hard to calculate the FFT of a dataset that doesn't fit in one machine's memory. Been a while since I studied the FFT.


Searching using cross correlation is effectively the same as searching using the euclidean distance metric I think, for which also they have a fast implementation.

DTW, however, can answer more interesting questions. For example, if you have two performances of a song captured in MIDI, the timing of each note played can vary a little, tempo can fluctuate and sometimes extra notes can also be introduced. DTW can help find the best mapping between two such performances.


This reminds me of the sequence alignment algorithms we use in bioinformatics to find scores to rank nucleic acid or amino acid sequences by similarity[ex. 1].

It's cool to see it extended to data beyond discrete base/aa values.

The "time warping" aspect also kinda reminds me of the methods used in remote sensing for comparing spectral signatures (SCM/SAM) [ex. 2].

I'd be interested in learning where/how this is being used for problems in finance or economics.

1. http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorith...

2. ftp://geo.arc.nasa.gov/pub/stevek/Spectral%20Correlation/Osmar_1_carvalho__web.pdf


What is that Gun/NoGun problem mentioned here http://alumni.cs.ucr.edu/~lexiangy/Shapelet/kdd2009shapelet.... ?


I didn't have a ton of time to search, but did they mention what kind of software they used? I'm curious what database(s) and schema(ish) techniques were used to hold all of that time series data?


From the video, it looks to me like they used MATLAB.

There are a few open-source databases of physiological signals like the ECG used in the video. One is PhysioBank:

http://www.physionet.org/physiobank/


Here is a link to their C++ code: http://www.cs.ucr.edu/~eamonn/trillion.zip


Take a look at the iSAX stuff.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: