
Show HN: Efficient data loading using zeta distributions - fed135
https://github.com/fed135/ha-store
======
glifchits
I would like to see more on the zeta distribution aspect of this, but it
doesn't appear anywhere obvious in this repo.

~~~
vijay_nair
Few things I learned after a bit of research:

• the key to efficiency here seems to be “caching”, more specifically their
caching strategy

• traditionally, caching on the web is done by assuming resource access
follows the Zipf Distribution[1]

• Zeta Distributions are basically Zipf Distributions[2] so you can
effectively re-word the title as “Efficient data loading using caching” (zipf
= “caching” & zeta = zipf => zeta = “caching”)

• It’s important to note that Zipf/Zeta don’t model extremes very well, so
there’s potential for outliers causing costly cache misses. Monitor your logs!

\---

Further reading:

•
[https://pdfs.semanticscholar.org/337e/4b7f57ccbb7485950b93da...](https://pdfs.semanticscholar.org/337e/4b7f57ccbb7485950b93da9c5bb4ec4dc9ad.pdf)
(1999)

• [https://terrytao.wordpress.com/2009/07/03/benfords-law-
zipfs...](https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-
and-the-pareto-distribution/)

•
[https://en.wikipedia.org/wiki/Zipf%27s_law](https://en.wikipedia.org/wiki/Zipf%27s_law)

•
[https://www.springer.com/in/book/9781402080494](https://www.springer.com/in/book/9781402080494)

\---

[1] distribution follows a logarithm, so the most popular resource is accessed
disproportionately more than the second most popular item and so on.

Example is word frequency, modeled as 1/n; second most popular word occurs 50%
as much as the first most popular word (1/2), third most popular word occurs
33% as much as the first (1/3) and so on, showing an exponential falloff with
a long tail. It thus makes sense to cache the first 10 most popular words as
they are going to get accessed more than 90% of the time, giving you the
efficiency. Basically this is a form of power law and similar to Pareto
Distribution (20% of the things deliver 80% of the result)

[2] rigorously speaking, zeta is the normalized form of Zipf. But practically
they are similar enough that people use the terms interchangeably.

~~~
fed135
Damn, it's like I don't even need to write the paper at all :) Great research
work, it does capture the idea of the project.

~~~
vijay_nair
Given that this is HN, my comment probably came across as a disappointment and
most people were already aware of the surface level details of caching.

Hope I kept them at least mildly entertained while waiting for the real deal
to drop : )

~~~
fed135
Here's the recently updated Wiki page, it's not super in-depth, but please let
me know what you think :)

