Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Fast Rolling Quantiles for Python (github.com/marmarelis)
79 points by myrl 47 days ago | hide | past | favorite | 10 comments

This is pretty cool. The title would be a bit more descriptive if it were “Fast Rolling Quantile Filters for Python”, since the high-pass/low-pass filter functionality seems to be the focus.

The README mentions it uses binary heaps - if you’re willing to accept some (bounded) approximation, then it should be possible to reduce memory usage and somewhat reduce runtime by using a sketching data structure like Dunning’s t-digest: https://github.com/tdunning/t-digest/blob/main/docs/t-digest....

There is an open source Python implementation, although I haven’t used it and can’t vouch for its quality: https://github.com/CamDavidsonPilon/tdigest

You can also just get/use exact answers with the `accumulation_tree` package that Python `tdigest` impl uses, though for large windows a B-tree is faster than a red-black ranked/instrumented binary tree. Either tree also scales better than heap pairs to multiple quantiles per moving window (like 5%, 50%, 95%). Python's `blist` package may not have the exact API you need to build from, but it is very close. (`blist` also does not optimize for the "histogramming case" where there can be many duplicate values, IIRC. Instead each value gets its own independent slot in the data structure.)

Thanks, I'll be sure to check these out! It's a really neat idea behind that paper.

Might be worth mentioning: `bottleneck` package, implemented in C, with various moving window functions - https://bottleneck.readthedocs.io/en/latest/reference.html#m...

Awesome! Rolling quantiles were a major bottleneck in my work a couple years ago, so much so that I ended up removing them and using approximations instead. It'll be good to try this if I ever go back to that code.

What's the advantage of this over Pandas' standard rolling quantiles?

import numpy as np

import pandas as pd

s_input = pd.Series(np.random.randn(1000))

s_p10 = s_input.rolling(10).quantile(0.1)

Wouldn’t it make more sense to speed up the current APIs used by lots of programs over introducing a new API?

Perhaps, but as far as I'm aware, none of them are as flexible as my interface (to support e.g. streaming, pipelining.) I have not looked at all of pandas' time-series functionality that closely.

I see, it's interesting that ypu wrote it all in C, pandas in Python

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact