Hacker News new | past | comments | ask | show | jobs | submit login
Downsampling Time Series for Visual Representation [pdf] (2013) (skemman.is)
139 points by todsacerdoti 32 days ago | hide | past | favorite | 32 comments

Interesting to see this pop up here! I was looking for a solution to plot terabytes of time series during my MSc Thesis at CERN and ended up adding support for this to a Grafana fork. Fun times.

Details here: https://masterscrat.github.io/documents/Florian_Laurent_CERN...

This looks cool, I especially like the "intuitive" longest line algorithm.

~6 years ago I was trying to build a dashboard that displayed real-time measurements from a beehive. The sensors would take temperature, weight, humidity, etc. Back then I used simplify.js [0] which uses two simplification algorithms in two passes. The more compute intensive one is the Ramer Douglas Peucker algorithm [1]. One issue I had with streaming data is that new data points could change the past line, at least with my naive implementation. I'd love to see a real-time / streaming time series simplification algorithm where the past points don't appear to change wildly despite continuing to downsample.

[0] https://github.com/mourner/simplify-js

[1] https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93...

Some (often desirable) properties of RDP for timeseries is that is preserves details at extreme events and does not shift peaks. Very useful for e.g. seismic data. I have compressed tons of data with 1% subsampling with hardly any information loss

This is a form of compression that is made to preserve chart visual looks, if I understand it right. There is no analysis on how it bias the data with the down sample.

My instinct is the algorithm looks too specialized for this one task (which is good sometimes, if used for the right task).

As a thought could you not use a FFT to the frequency domain and remove the high frequency data. And then retransform it back to the time domain.

Fourier transforms are used all over the place, and FFT libs are usually well optimized and can even be hardware accelerated.

Without checking it, the cut off frequency would be determined by the Nyquist–Shannon sampling theorem. And should be dependent of the width of the graph in pixels.

So if the graph is resized, then recompute the new down sampled timeseries from the new Nyquist–Shannon sampling limit of the new width.

A sliding DFT can also be used for streaming data for realtime.

You've just described in very basic terms how linear downsampling works (filter -> decimate). It's orders of magnitude more expensive than what is described in the paper, and a crappy sinc interpolator (equivalent to zeroing FFT bins) is not going to be much improvement over the nonlinear approaches described - either objectively in terms of signal degradation due to ripple effects or subjectively, as the paper has explored.

Specifically, this is trying to downsample a chart while only using points from the original time series. Something like an FFT would generate a new series of points which would most likely not be in the original dataset, though they might visually represent them well.

Why not just use a low-pass linear-phase FIR filter instead of FFT/iFFT? Same result but achieved in time-domain, somewhat more performant.

That depends on the filter order. Evaluating a FIR is O(N M^2) where N is the number of data points and M is the size of the FIR. Doing the same with an FFT is O(N Mlog(M)).

Neither is particularly "performant." FFTs and FIRs are big guns!

Would FFT even be appropriate for nonstationary time series? Wavelets seem more appropriate for such time series (not sure about how peformant they would be though, the R package runs pretty slow even for small datasets)

A demo for "Largest-Triangle-Three-Buckets" algorithm https://www.base.is/flot/

And the plugin used in the demo: https://github.com/sveinn-steinarsson/flot-downsample

Why reduce it to a single line when you can paint hulls around your curve? For coarser intervals you can also use candle stick charts.

Does anybody know an approach for downsampling time series when you need to compare two downsampled series? Let's say:

1) You measure millions of (time, count) events, covering 24hrs.

2) You downsample this somehow to save space.

3) You start measuring (time, count) events again, covering 7 hours. You expect the data to match up with the previous measurements, and you'd like to plot differences.

Is there a way to trim the downsampled 24hr set so that it matches the downsampled 7hr set?

Motivation: https://github.com/a-b-street/abstreet/issues/85

I've seen this called window alignment. You can either pick a default and stick to it everywhere, e.g. truncate to the nearest minute or Nth-minute, but this only works for divisors of an hour for example so that it's easy to align different sources at any starting point.

Resampling is pretty effective; linear interpolation is usually good enough because the data is already downsampled, presumably with an averaging or aggregating step that can be interpolated over. The mean-value theorem is your friend if you're looking at derivatives of time series data.

The type of metric also matters; gauges can be simply interpolated. Counters make more sense to re-bucket into the chosen sampling period and offset, and for your specific question what you can do is switch from downsampled data to "realtime" data exactly at the bucket boundary so that you avoid the half-full bucket problem.

A good way to treat counter timeseries is as bundles of updates (deltas) that arrive at a certain cadence. That makes the transition from different cadences smooth by resampling to avoid overlapping updates, e.g. if you have a downsampled bucket counting events from 1:00 to 1:10 then resample other timeseries so that buckets begin exactly at 1:10 or end exactly at 1:00 instead of trying to let them overlap.

I haven't had nearly enough caffeine today to process this properly, but thank you -- it partly makes sense, and it's plenty to go off of later!

After re-reading your questions 1 and 2 I think interpolation basically answers them both. You want smooth and stable values for your counters at any time t, and the way to get that is define an interpolation function from (timeseries, absolute time) to value, or recreate a sufficient range of higher resolution timeseries values that can be graphed directly.

Linear interpolation will give you a line chart. Curve-fitting methods will smooth the charts out, but since you are working with counters you can put some constraints on the interpolation.

Since each bucket represents a change over a period of time it represents an area under the curve for the same interval which should match any graphic representation. https://stackoverflow.com/questions/34983360/create-a-piecew... is basically the question of how to interpolate between data points while preserving that area, which might be overkill for your application but is possible.

Interpolation is what I have seen used at industrial Plant I work at to trend high frequency data onto a HMI for example.

Reinsch Spline is commonly used for this.

Do you have as many events in both samples?

If not, even if you measure the same thing, the variance may differ - you can start applying the law of large numbers since you mentionned "millions" and just consider a normal distribution, but for rare events you need something different (Poisson)

Are the series static? If not (ex: temperature) you will have to model on time. As the link mentions people crossing intersection, I assume more people will at 9am than at 2am (7 hours ago). Likewise if one series is a weekday and the other a weekend, you may have surprises!

You have to consider other things (ex: day of the week).

I would recommend avoiding the problem altogether: fit something (KDE, ARIMA...) on your timeseries, and plot the fit with the 95% CI from X=0 to X=24

Do the same for the other series, from X=0 to X=7 and see if they are at least in the same CI

Then compute the correlation of the 2 fits (just to have a reference number) and use that as a metric.

To be more specific, I'm measuring throughput along road segments in a deterministic traffic simulation. Differences would occur when the user modifies the road network. The modifications often have no effect on most roads, but some wind up with more or less traffic than usual at a certain time.

I'm interested in showing differences like "during morning rush hour, there was more traffic here, but it was about the same the rest of the day", so I'm not sure a single correlation as a metric would be best. Ideally I'd plot two line plots and call out differences over time.

There is going to be complex behavior (adapting to traffic by routing around it) that a simulation will poorly predict

Also, you will have correlation between adjacent segments of the network- and the idea of extracting "rush hours" brings its own set of problems.

Consult with a statistician familiar with geographical models. I'm sorry I can't help much more than that. It's very different from finance.

Surprises from emergent behavior are some of the most satisfying things about working on this.

I'm not trying to extract rush hour patterns or anything quantitatively, just show two timeseries plots to the user and have them figure out what's going on.

Thanks for the ideas! Lots of new things to research. Stats isn't my background at all.

You probably aren't going to like this, but have you considered having the user select some roads and intersections in advance, and then only keep track of those roads? That should make the problem with storage manageable. Don't know if there's a better way.

Another thought: I see you're basically using histograms for visualization, you should try switching to displaying estimates of probability density functions. The estimation is done with "kernel density estimation" (KDE), and it's basically a much improved histogram. You will probably want to enable the user to select the KDE bandwidth (which is one positive real number), AFAIK different bandwidths may show patterns that exist on different time scales. In addition to the KDE, you may also want to display the estimate of the cumulative distribution function: the empirical distribution function (eCDF).

The eCDF is a really simple concept, KDE also isn't too bad. See here for KDE and eCDF intuition: https://mathisonian.github.io/kde/ (the former requires a desktop computer for viewing) https://en.wikipedia.org/wiki/Empirical_distribution_functio...

Implementing KDE isn't straightforward, several completely different approaches to it exist, but worst case you should be able to find a library for that.

Also, I should probably point you towards Vega and Vega Lite, it seems like they may be of use: https://vega.github.io/vega-lite/

Vega (and Vega Lite) are for interactive data visualization in the browser, I don't know if that suits your game, but they can give you the estimates that describe probability distributions from samples that I wrote about above.

You're basically calculating a histogram anyway so the most sensible thing to do with would be to calculate both histograms with the same buckets (simplest is just the sum every 5 minutes or so).

At any rate you want do to the same thing to both datasets, and if it's counts of some random event then the only thing you can really compare is the sum-total over the same period.

This is basically the approach I'm doing now, with granularity being 1 hour. The problem is that when you're in the middle of an hour, the bucket for the live sim only accounts for half of the hour, but the stored bucket from the full day's data represents the entire hour. I could assume the bucket accumulates its count linearly through the hour, and that'd be much better than my current approach, but it still feels a bit weak.

5 minute buckets would make this comparison problem easier, but it's too much data to store.


Here's a video of the problem with the current approach, which is storing a count per hour. No sliding windows; 7:59am and 8:01am are different buckets. Green is "less than usual", red is "more than usual." Every time the hour resets, lots of green appears, and slowly fades.

As it is you still have the problem that you're not comparing like with like, you can't compare the total for an hour with the total for part of an hour. So you'd basically need to restrict your visualization to only show the image at the moments when you do have all the data. I'd strongly suggest biting the bullet and just storing the data in 5 minute buckets if at all possible, then you can have both sides in the same format and compare them easily. 5 minute buckets is only 12 times the amount of data you currently need, which might be a reasonable trade-off.

If you can't or don't want to do that then you'll need to pull some tricks. In essence you want to compare two 'rates' against each other right? Now technically the 5-minute bucket idea is just a crude approximation, what you actually want is the average rate at any point in time a histogram is just a crude method to determine this rate. This means you might get away with it if you try to calculate this rate using 2 different methods on both sides.

For the presampled data you could:

- Interpolate the hourly data (keep in mind that the hourly totals run 30 minutes behind)

- Fit a function to the data rather than just sampling it (i.e. if its periodic you could just use the low frequency part of its Fourier series)

Now I presume the data coming in is coming in online (i.e. you can't just wait for a full sample and compare it afterwards, otherwise just do the same thing as above). This means you'll need to do something like:

- Use a sliding window (again keep in mind that this delays the signal)

- Use an online low-pass filter.

- Fit a function to the partial data (tricky and might be unstable)

Hope this helps.

You'll have to store more information, you just need to decide what is more space efficient. A possible approach would be to store both the bucket total, as well as the components to an equation that can roughly model the rate of change within the histogram. If you find an equation that generally models observed rates dependent on time `t`, then you can interpolate an approximate value for any point in time.

You can get sliding windows out of this as well, by using the equations for adjacent periods. You'll incur a runtime cost for this of course, but relatively small.

Do you have some criteria for the error?

Not particularly. The data comes from a deterministic traffic simulation, and I just want to show people roughly how their changes to a road network affect throughput along road segments. I'm interested in showing differences like "during morning rush hour, there was more traffic here, but it was about the same the rest of the day". "About the same" is fuzzy.

I think stock candle charts do the best job.

All you need to do is divide your time series into time segments and then take the first data point (open), the last data point (close), the highest point (high) and the lowest (low). Based on that data you can get a really good understand of the data that scales well based on what time frame or time segment you choose. As you zoom in or out, you don’t miss a lot of the important points of the data.

Great tool I've always found this a useful for "downsampling" by shape constraints as to not crush the DOM/Canvas for high performance frontend charting


Here's another demo featuring the methods discussed in the paper https://d3fc.io/examples/sample/

Neat. I've noticed that a few stock market apps/sites definitely use different methods to do this, and with dramatically different results.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact