Details here: https://masterscrat.github.io/documents/Florian_Laurent_CERN...
~6 years ago I was trying to build a dashboard that displayed real-time measurements from a beehive. The sensors would take temperature, weight, humidity, etc. Back then I used simplify.js  which uses two simplification algorithms in two passes. The more compute intensive one is the Ramer Douglas Peucker algorithm . One issue I had with streaming data is that new data points could change the past line, at least with my naive implementation. I'd love to see a real-time / streaming time series simplification algorithm where the past points don't appear to change wildly despite continuing to downsample.
My instinct is the algorithm looks too specialized for this one task (which is good sometimes, if used for the right task).
As a thought could you not use a FFT to the frequency domain and remove the high frequency data. And then retransform it back to the time domain.
Fourier transforms are used all over the place, and FFT libs are usually well optimized and can even be hardware accelerated.
Without checking it, the cut off frequency would be determined by the Nyquist–Shannon sampling theorem. And should be dependent of the width of the graph in pixels.
So if the graph is resized, then recompute the new down sampled timeseries from the new Nyquist–Shannon sampling limit of the new width.
A sliding DFT can also be used for streaming data for realtime.
Neither is particularly "performant." FFTs and FIRs are big guns!
And the plugin used in the demo:
1) You measure millions of (time, count) events, covering 24hrs.
2) You downsample this somehow to save space.
3) You start measuring (time, count) events again, covering 7 hours. You expect the data to match up with the previous measurements, and you'd like to plot differences.
Is there a way to trim the downsampled 24hr set so that it matches the downsampled 7hr set?
Resampling is pretty effective; linear interpolation is usually good enough because the data is already downsampled, presumably with an averaging or aggregating step that can be interpolated over. The mean-value theorem is your friend if you're looking at derivatives of time series data.
The type of metric also matters; gauges can be simply interpolated. Counters make more sense to re-bucket into the chosen sampling period and offset, and for your specific question what you can do is switch from downsampled data to "realtime" data exactly at the bucket boundary so that you avoid the half-full bucket problem.
A good way to treat counter timeseries is as bundles of updates (deltas) that arrive at a certain cadence. That makes the transition from different cadences smooth by resampling to avoid overlapping updates, e.g. if you have a downsampled bucket counting events from 1:00 to 1:10 then resample other timeseries so that buckets begin exactly at 1:10 or end exactly at 1:00 instead of trying to let them overlap.
Linear interpolation will give you a line chart. Curve-fitting methods will smooth the charts out, but since you are working with counters you can put some constraints on the interpolation.
Since each bucket represents a change over a period of time it represents an area under the curve for the same interval which should match any graphic representation. https://stackoverflow.com/questions/34983360/create-a-piecew... is basically the question of how to interpolate between data points while preserving that area, which might be overkill for your application but is possible.
Reinsch Spline is commonly used for this.
If not, even if you measure the same thing, the variance may differ - you can start applying the law of large numbers since you mentionned "millions" and just consider a normal distribution, but for rare events you need something different (Poisson)
Are the series static? If not (ex: temperature) you will have to model on time. As the link mentions people crossing intersection, I assume more people will at 9am than at 2am (7 hours ago). Likewise if one series is a weekday and the other a weekend, you may have surprises!
You have to consider other things (ex: day of the week).
I would recommend avoiding the problem altogether: fit something (KDE, ARIMA...) on your timeseries, and plot the fit with the 95% CI from X=0 to X=24
Do the same for the other series, from X=0 to X=7 and see if they are at least in the same CI
Then compute the correlation of the 2 fits (just to have a reference number) and use that as a metric.
I'm interested in showing differences like "during morning rush hour, there was more traffic here, but it was about the same the rest of the day", so I'm not sure a single correlation as a metric would be best. Ideally I'd plot two line plots and call out differences over time.
Also, you will have correlation between adjacent segments of the network- and the idea of extracting "rush hours" brings its own set of problems.
Consult with a statistician familiar with geographical models. I'm sorry I can't help much more than that. It's
very different from finance.
I'm not trying to extract rush hour patterns or anything quantitatively, just show two timeseries plots to the user and have them figure out what's going on.
Thanks for the ideas! Lots of new things to research. Stats isn't my background at all.
Another thought: I see you're basically using histograms for visualization, you should try switching to displaying estimates of probability density functions. The estimation is done with "kernel density estimation" (KDE), and it's basically a much improved histogram. You will probably want to enable the user to select the KDE bandwidth (which is one positive real number), AFAIK different bandwidths may show patterns that exist on different time scales. In addition to the KDE, you may also want to display the estimate of the cumulative distribution function: the empirical distribution function (eCDF).
The eCDF is a really simple concept, KDE also isn't too bad. See here for KDE and eCDF intuition: https://mathisonian.github.io/kde/ (the former requires a desktop computer for viewing) https://en.wikipedia.org/wiki/Empirical_distribution_functio...
Implementing KDE isn't straightforward, several completely different approaches to it exist, but worst case you should be able to find a library for that.
Also, I should probably point you towards Vega and Vega Lite, it seems like they may be of use: https://vega.github.io/vega-lite/
Vega (and Vega Lite) are for interactive data visualization in the browser, I don't know if that suits your game, but they can give you the estimates that describe probability distributions from samples that I wrote about above.
At any rate you want do to the same thing to both datasets, and if it's counts of some random event then the only thing you can really compare is the sum-total over the same period.
5 minute buckets would make this comparison problem easier, but it's too much data to store.
Here's a video of the problem with the current approach, which is storing a count per hour. No sliding windows; 7:59am and 8:01am are different buckets. Green is "less than usual", red is "more than usual." Every time the hour resets, lots of green appears, and slowly fades.
If you can't or don't want to do that then you'll need to pull some tricks. In essence you want to compare two 'rates' against each other right? Now technically the 5-minute bucket idea is just a crude approximation, what you actually want is the average rate at any point in time a histogram is just a crude method to determine this rate. This means you might get away with it if you try to calculate this rate using 2 different methods on both sides.
For the presampled data you could:
- Interpolate the hourly data (keep in mind that the hourly totals run 30 minutes behind)
- Fit a function to the data rather than just sampling it (i.e. if its periodic you could just use the low frequency part of its Fourier series)
Now I presume the data coming in is coming in online (i.e. you can't just wait for a full sample and compare it afterwards, otherwise just do the same thing as above). This means you'll need to do something like:
- Use a sliding window (again keep in mind that this delays the signal)
- Use an online low-pass filter.
- Fit a function to the partial data (tricky and might be unstable)
Hope this helps.
You can get sliding windows out of this as well, by using the equations for adjacent periods. You'll incur a runtime cost for this of course, but relatively small.
All you need to do is divide your time series into time segments and then take the first data point (open), the last data point (close), the highest point (high) and the lowest (low). Based on that data you can get a really good understand of the data that scales well based on what time frame or time segment you choose. As you zoom in or out, you don’t miss a lot of the important points of the data.