But for this algorithm, you need to know the total length ("m") to set the thres...

cb321 · 2024-05-17T20:18:32 1715977112

Besides the ideas from istjohn, empath-nirvana, and rcarmo, you can also just "flip the script": solve for epsilon and report that as 1-delta confidence interval for the worst case data distribution as here: https://news.ycombinator.com/item?id=40388878

Best case error is of course zero, but if you look at my output then you will see as I did that the worst case is a very conservative bound (i.e. 15X bigger than what might "tend to happen". That matters a lot for "space usage" since the error =~ 1/sqrt(space) implying you need a lot more space for lower errors. 15^2 = 225X more space. Space optimization is usually well attended for this kind of problem. And, hey, maybe you know something about the input data distribution?

So, in addition to the worst case bound, average case errors under various distributional scenarios would be very interesting. Or even better "measuring as you go" enough distributional meta data to get a tighter error bound. That latter starts to sound like it's one of Knuth's Hard Questions Which if You Solve He'll Sign your PhD Thesis territory, though. Maybe a starting point would be some kind of online entropy(distribution) estimation, perhaps inspired by https://arxiv.org/abs/2105.07408 . And sure, maybe you need to bound the error ahead of time instead of inspecting it at any point in the stream.

istjohn · 2024-05-17T19:23:22 1715973802

You would want to calculate the threshold by choosing your target epsilon and delta and an 'm' equal to the largest conceivable size of the stream. Fortunately, the threshold increases with log(m), so it's inexpensive to anticipate several orders of magnitude more data than necessary. If you wanted, you could work backwards to calculate the actual 'epsilon' and 'delta' values for the actual 'm' of the stream after the fact.

empath-nirvana · 2024-05-17T18:58:48 1715972328

You actually don't need to do that part in the algorithm. If you don't know the length of the list, you can just choose a threshold that seems reasonable and calculate the margin of error after you're done processing. (or i guess at whatever checkpoints you want if it's continuous)

In this example, they have the length of the list and choose the threshold to give them a desired margin of error.

rcarmo · 2024-05-17T20:12:42 1715976762

See https://news.ycombinator.com/item?id=40390192