
View Counting at Reddit - strzalek
https://redditblog.com/2017/05/24/view-counting-at-reddit/
======
haburka
I love the article on hyperloglog! It is really quite good to read even if
you're not interested in algorithms. I always liked number theory and I think
that it's very interesting that you can guess how many uniques there are by
counting how long your longest run of zeroes in a hash is.

I suppose this could be broken by injecting in a unique visitor id that would
hash to something with an absurd amount of zeroes? That's assuming that the
user has control over their user id and that I'm understanding the algorithm
correctly.

~~~
lucasschm
You are correct, but HyperLogLog has many buckets counting the longest run of
zeros in order to avoid the problem of outliers. I recently studied these
probabilistic algorithms and did a notebook with code and plots to show their
performance: [https://github.com/lucasschmidtc/Probabilistic-
Algorithms/bl...](https://github.com/lucasschmidtc/Probabilistic-
Algorithms/blob/master/Probabilistic%20Algorithms.ipynb)

~~~
snowcrshd
Thanks for sharing that!

Just skimmed through it and seems pretty interesting. I'll read it more in
depth later.

~~~
lucasschm
No problem. If there are mistakes or a segment is not clear, let me know

~~~
meetapoorvgupta
Thanks for the write up, Lucas. It was very intuitive and I learnt a lot.

I noticed that you used 5000 buckets to store the frequency of 7000 non-unique
words in the section on 'Counting Bloom Filters'. How is that better than
using 7000 buckets and a uniformly distributed hash function, which would
maintain frequencies perfectly? We would be using fewer buckets by an order of
magnitude in a real-world implementation to save memory.

~~~
lucasschm
Yeah, I should have given more thought to that number. Updated the example for
N=300. Thanks

------
nyar
"We want to better communicate the scale of Reddit to our users."

If that's true why did they hide vote numbers on comments and posts? It used
to say "xxx upvotes xxx downvotes" now it just gives a number and hides that.

~~~
jonknee
It's to deter bots. The numbers weren't previously accurate, they were fuzzed
(also to deter bots).

[https://www.reddit.com/wiki/faq#wiki_how_is_a_submission.27s...](https://www.reddit.com/wiki/faq#wiki_how_is_a_submission.27s_score_determined.3F)

~~~
ma2rten
I don't quite see the connection. How exactly does this deter bots?

~~~
Klathmon
It's difficult to see if their votes are counting, allowing Reddit to
silently-ignore their votes without them knowing.

~~~
jliptzin
Can't you just delay updating the count by some random number of
minutes/hours?

~~~
nowarninglabel
That be easy to test though if you were bot was effective or not, just post to
unpopular subreddits, make bot votes on those submissions, then check back the
next day. If votes not counted, then your bot is being ignored and you'd move
on to changing your IP address or building your next bot or such.

~~~
problems
There's no reason this exact same method won't work given their current
practice though.

------
mxmxm
Counting views/impressions in combination with Apache Kafka sounds like the
ideal use case for a stream processor like Apache Flink. It supports very
large state which can be managed off-hand. This should enable you to count the
exact number of unique views in real time with exactly once semantics. Here is
a blog post on large scale counting with more details. It also includes a
comparison with other streaming technologies like Sanza and Spark:
[https://data-artisans.com/blog/counting-in-streams-a-
hierarc...](https://data-artisans.com/blog/counting-in-streams-a-hierarchy-of-
needs)

Also check out this blog post by a Twitter engineer on counting ad
impressions: [https://data-artisans.com/blog/extending-the-yahoo-
streaming...](https://data-artisans.com/blog/extending-the-yahoo-streaming-
benchmark)

------
noamhacker
How do you test a system like this for accuracy? Is this done by simulating
millions of unique requests?

~~~
andreareina
The algorithm's accuracy is known. From the wiki[1]:

    
    
        The HyperLogLog algorithm is able to estimate 
        cardinalities of > 10^9 with a typical error rate of 2%
    

[1]
[https://en.wikipedia.org/wiki/HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog)

~~~
federicoponzi
But what about the implementation accuracy? :)

~~~
zeroxfe
Tests against both historical and synthetic datasets.

------
alzaeem
So how do they determine whether a user has viewed a post already? I would
think that unique counting is accomplished using the hyperloglog counter, but
the article says that this decision is made by the Nazar system, which doesn't
use the hyperloglog counter in Redis.

~~~
jimmaswell
Why can't they just associate a list of viewed posts with each user, or list
of users that viewed a post with each post, and check that? I don't get why
this needs any consideration.

~~~
sethammons
They addressed your second point in the article. On a popular post, you would
be storing several megabytes of data to capture/relate each unique user that
visited. That gets expensive at scale. HLL takes then down to a few kilobytes,
less than 1% of the original size.

For your first suggestion, you would have to do a very expensive look up. You
couldn't cache it effectively​ due to the requirement of near real time stats.
You could improve look up time using columnar storage, but the performance and
memory usage will be nowhere near as nice as with HLL.

Problems are harder at scale.

~~~
eropple
I've had a "phases of computing" article percolating for a while to this end.
Problems aren't just harder at scale, but they actively _change their
observable properties_ because of the stressors involved and where they crop
up.

------
stoicking
Given how much simpler it is to count total views than unique user views, why
is it more valuable to count unique user views?

~~~
jonathanbull
From a Reddit engineer:

"This was a product decision. Currently view counts are purely cosmetic, but
we did not want to rule out the possibility of them being used in ranking in
the future. As such, building in some degree of abuse protection made sense
(e.g. someone can't just sit on a page refreshing to make the view number go
up). I am fully expecting us to tweak this time window (and the duplication
heuristics in general) in future, especially as the way that users interact
with content will change as Reddit evolves."

[https://www.reddit.com/r/programming/comments/6da6n9/comment...](https://www.reddit.com/r/programming/comments/6da6n9/comment/di12fd2?st=J37U6ZXI&sh=1b5c2a56)

------
tudorconstantin
Wouldn't it had been easier to simply increment a counter for each visit and
then set a short lived cookie in the browser for that post? And put the spam
detection system before the counter increment

~~~
bognition
How do you concurrently update a counter?

~~~
BillinghamJ
Redis writes are atomic - you just use the increment function

~~~
bognition
Writes are atomic in redis because redis is single threaded. So you are
bounded by how fast redis can write. If you try to write any faster then redis
can handle you'll get queueing or errors.

~~~
btmorex
Run enough redis servers to handle the load. Choose a server by hashing a user
id. Total = sum of counts from all servers.

------
tsukaisute
Weird thing I have been seeing on Reddit is comment upvotes being off-by-one
periodically on page refreshes. Reload, you get 3. Reload again, you get 4.
Again, you get 3. Seems like a replication issue?

~~~
kondor6c
I believe they are using cassandra to store the upvotes

~~~
sverhagen
Just curious if this is a stab at Cassandra, or whether use of Cassandra would
automatically imply eventual consistency or something else that would appear
in this way?

~~~
ketralnis
Cassandra as it's often used can imply eventual consistency (e.g. counter
incrs with CL.ONE) but "eventual" in this case would be in the range of 10's
of ms

That said, reddit's upvote counters in particular are stored in Postgres, not
Cassandra

------
theomega
Very interesting article, thanks for publishing.

I have two related questions: 1\. I assume the process which reads from
Cassandra and puts it back to Redis is parallized if not even distributed. How
do you ensure correctness? Implementing 2PC seems extreme overhead. Or do you
lock in Redis? 2\. What database is used to actually store the view counts?
Cassandras Counters are afaik not very reliable...

~~~
kchandra
1\. Redis is atomic, so we use the SETNX operation to ensure that only one
write succeeds.

2\. We have HLLs in Redis, so we just issue a PFCOUNT and store the result of
that in Cassandra as an integer value. We don't use counters in Cassandra.

------
ronalbarbaren
Thanks Reddit guys. I hope engineer of Youtube will post similar article.
Still curious how Youtube count.

------
hellbanner
Slightly OT; but I wish reddit would use traditional forum style replies to
push threads up, instead of the positive feedback loop of votes with opinions
that agree with majority getting upvotes giving views which give
proportionally more upvotes

~~~
rjaco31
It _might_ be conceivable on smaller subreddits, but on the big ones it would
basically just drown everything into a sea of low-quality threads. I got the
feeling that the vast majority of threads never ever make it to the "front
page" of their subreddit.

~~~
sotojuan
> but on the big ones it would basically just drown everything into a sea of
> low-quality threads

The big subreddits' comment sections is largely low quality anyway.

Traditional forums have their downsides (anyone remember super nested quote
trains?), but I still find them superior to upvote + nested replies.

The best forum UI for me, though, are imageboards. Too bad they are associated
with a less than popular community.

------
federicoponzi
Probably noob question, but:

>> Nazar will then alter the event, adding a Boolean flag indicating whether
or not it should be counted, before sending the event back to Kafka.

Why don't they just discard it instead of reputting the event back to Kafka?

~~~
bashtoni
I suspect they archive events into S3 or similar for later analysis/training.

------
golergka
A beautiful example of how a feature that seems so easy to an end user can be
complex at scale.

------
fiatjaf
At [https://trackingco.de/](https://trackingco.de/) we store events on Redis
and compile them daily into a reduced string format, storing these on CouchDB.

------
ugh123
Forgive my ignorance, but isn't this what Google Analytics is for?

~~~
PetahNZ
Google Analytics is not accurate (its sampled), or realtime (48 hour turn
around).

~~~
raquo
^ For big sites like reddit, which is why you don't typically run into this
when using GA on your personal blog

------
qrbLPHiKpiux
Not applied to /r/the_donald however.

~~~
hexane360
Are you talking about the "impressions"/subscribers incident? Because that was
a mislabeled field that affected almost every other sub more than T_d.

[https://www.reddit.com/r/help/comments/62naj4/can_someone_ex...](https://www.reddit.com/r/help/comments/62naj4/can_someone_explain_why_there_is_such_a/dfnvegl/)

[https://www.reddit.com/r/SubredditDrama/comments/62nw33/rthe...](https://www.reddit.com/r/SubredditDrama/comments/62nw33/rthe_donald_thinks_it_has_discovered_evidence/dfo220k/)

