I suppose this could be broken by injecting in a unique visitor id that would hash to something with an absurd amount of zeroes? That's assuming that the user has control over their user id and that I'm understanding the algorithm correctly.
Just skimmed through it and seems pretty interesting.
I'll read it more in depth later.
I noticed that you used 5000 buckets to store the frequency of 7000 non-unique words in the section on 'Counting Bloom Filters'. How is that better than using 7000 buckets and a uniformly distributed hash function, which would maintain frequencies perfectly? We would be using fewer buckets by an order of magnitude in a real-world implementation to save memory.
The usual trick for preventing folks _maliciously_ sending in outliers is to use a hash with a secret key such as SipHash so that folks on the outside can't trivially figure out what inputs will lead to hashes with a lot of leading zeroes.
HyperLogLog is different because you have the entire population (not just a sample), and it's a multiset (the same element can appear more than once). Getting the size of a (non-multi) set is easy, you just keep a counter and increment it for each element; it only takes enough memory to maintain the counter. Counting the distinct members of a multiset takes a lot more memory because you have to remember whether you've already seen a particular element or not.
The tl;dr is that the German Tank Problem is about making an estimate of size when you have imperfect information, and HyperLogLog gives you an estimate when you have perfect information, but it's too expensive to make an exact calculation.
If that's true why did they hide vote numbers on comments and posts? It used to say "xxx upvotes xxx downvotes" now it just gives a number and hides that.
Reddit is definitely not a paragon of anti-bot engineering, I think most of those skills exist in adtech.
You could slowly remove the downvotes, but attentive people refreshing the page will see the number shrinking and become suspicious, because real users never suddenly do a mass retraction of their votes.
You could slowly add a bunch of upvotes, but then people will wonder why this comment with a consistent 12/207 popularity ratio for the past hour suddenly overcame it and became the most popular comment in the thread. People will suspect a coordinated raid took place.
Both approaches raise too much suspicion. The safest approach is to turn off the ability to view the separate upvote/downvote values altogether and use a simple easing function to artificially increase the comment's total score over time. When no one can see the upvote/downvote ratio or the volume of vote activity over time, they lose the ability to judge whether manipulation is taking place.
You need an excuse for the change, so don't forget to also come up with a spurious narrative about how it was supposedly done to fight bots.
It's way easier to say "Hey, we're giving you XYZ traffic, give us ABC dollars," when you have the figures in front of you rather than just upvote/downvote numbers.
Also check out this blog post by a Twitter engineer on counting ad impressions: https://data-artisans.com/blog/extending-the-yahoo-streaming...
The HyperLogLog algorithm is able to estimate
cardinalities of > 10^9 with a typical error rate of 2%
For your first suggestion, you would have to do a very expensive look up. You couldn't cache it effectively due to the requirement of near real time stats. You could improve look up time using columnar storage, but the performance and memory usage will be nowhere near as nice as with HLL.
Problems are harder at scale.
Viewing a single thread could require five hundred associations.
"This was a product decision. Currently view counts are purely cosmetic, but we did not want to rule out the possibility of them being used in ranking in the future. As such, building in some degree of abuse protection made sense (e.g. someone can't just sit on a page refreshing to make the view number go up). I am fully expecting us to tweak this time window (and the duplication heuristics in general) in future, especially as the way that users interact with content will change as Reddit evolves."
I assume most of the visits are from normal users, not spammers, so if a normal user has the cookie for the post set then it means it's a page refresh, so don't increase the counter.
Identifying sophisticated spammers accurately is more complicated though. You can't rely on any client side info (user agent, cookies, browser history, screen resolution, OS, etc) because they can all be modified. You can't rely on IP address either, because there are public hotspots used by genuine users also. I think their spam detector is more complicated than this and they have to use it for HLL also.
So, for the genuine users, a counter increased based on the cookie mechanism would've worked just fine.
That's not a safe assumption. If spam prevention controls aren't working well enough, bots start to outnumber humans very quickly.
Indeed, one of the main reasons for this architecture is to measure the difference between what it measures and what the cookies method gives.
The primary motivation on their part for using HLL is given in the intro.
>In order to maintain an exact count in real time we would need to know whether or not a specific user visited the post before. To know that information, we would need to store the set of users who had previously visited each post, and then check that set every time we processed a new view on a post. A naive implementation of this solution would be to store the unique user set as a hash table in memory, with the post ID as the key.
>This approach works well for less trafficked posts, but is very difficult to scale once a post becomes popular and the number of viewers rapidly increases. Several popular posts have over one million unique viewers! On posts like these, it becomes extremely taxing on both memory and CPU to store all the IDs and do frequent lookups into the set to see if someone has already visited before.
I think they want to break it into different services for (whatever) reasons.
Running a counter across a sharded in memory cache implementation (like Redis) for something like post views should be ok, but they might have some weird rule-sets for spam detection that are slow.
How many posts can a user visit in, lets say, an hour?
They said they want to avoid increasing the views when a user refreshes the page in a short interval
This is done on purpose , to prevent bots from calculating exact post/comment scores.
They always fuzz equally relative to the actual score.
So, instead of monitoring once from another browser, you monitor from 100 browsers, and calculate the average before and after your bot votes.
Fuzzing the votes like this is cheap for reddit, but makes it more expensive for the spammer.
The best spam-filter is tricked by a hand-crafted, personalized email that is well-targeted. However, the cost of these are insane compared to mailing the same thing to hundrets or thousands of users. So anything any spam-filter ever does is making spamming less lucrative for the spammer, and fuzzing does that.
That said, reddit's upvote counters in particular are stored in Postgres, not Cassandra
I have two related questions:
1. I assume the process which reads from Cassandra and puts it back to Redis is parallized if not even distributed. How do you ensure correctness? Implementing 2PC seems extreme overhead. Or do you lock in Redis?
2. What database is used to actually store the view counts? Cassandras Counters are afaik not very reliable...
2. We have HLLs in Redis, so we just issue a PFCOUNT and store the result of that in Cassandra as an integer value. We don't use counters in Cassandra.
As they are now, reddit and HN are require you to be there when something is "current", and that's ultimately not that much better than TV. Yeah, sometimes you can get something out of it, but it's nowhere near as useful or deep. Even worse, there is a tendency to get semi- or unrelated stuff one out when something slightly similar is discussed, instead of putting things exactly where they "belong", where they add to a useful corpus. [which is exactly what I'm doing right now, and am doing too often.. but I honestly would prefer the alternative which doesn't exist yet]
Just think about it, we have practically infinite storage, there's certainly plenty of expertise in reach of HN -- yet discussions get constantly restrained because our attention is limited. But it's limited because all we get are these flat lists, X items per page, next to no means to curate or organize anything. Imagine how much worse it would be without people who sometimes link to old but very relevant and insightful stories or individual comments. I'm very grateful to them, but to take them for granted, to rely on them for "structure" is totally stone age to me. Give us tools, give us transparency, let us figure out things even if they might be hard - stop trying to hand hold people you might underestimate! We don't need you as much as you need us.
To be honest, I think that a change in behavior needs to come from the website, top down, rather than allowing the community to opt in to an alternate scheme that will never hit the popularity threshold necessary to work. Like a sibling mentioned, Reddit has "newest", but because it's not on by default, no one bothers to use it.
I don't think that it's our attention spans that cause our browsing behaviors on aggregator sites- I think it's purely derived from the Ui/Ux of the site, and that if the site kept interesting conversations up longer, people would participate and have more than the surface-level reflexive remarks we see here. This is not just a hypothesis - we can see the alternate behavior born out on forum sites. I think it's a huge shame that HN has opted for this model (especially when it wants to be "the better news aggregator"). Think how many great conversations never happened here because the algorithm prematurely killed the thread - due to lack of upvotes or something else.
While sdrothrock is correct that these sites don't have much incentive to produce quality discussion, I do agree with you that there's value in seeing what gets votes as opposed to what gathers replies (not that they are necessarily mutually exclusive and not each plagued with their own problems). I wonder if it's worth trying a system that marries the two elements. Take a conventional BBS, for example. They often have a reputation system that in the best case encourage quality contributions and avoid posts only consisting of "good post, I agree." However, they can't be used to indicate from the index what discussions are worth treading. You can try highlighting the top voted comments beneath or side-by-side their respective links to the OP. Then you can infer what kind of discussion is taking place. Is it a bunch of metajokes or did someone just unload a lot of expert knowledge?
For news aggregators to facilitate enduring discussion is trickier as the nature of news is to remain current. Maybe if a link would otherwise decay but discussion is ongoing, the news entry on the index could take a backseat to its most 'active' threads. Indented new lines underneath the post that link to those threads, "The conversation is still going. Click to expand" or something like that.
 Perhaps highlighting only a number of posts proportional to the total number of comments or only the posts above a minimum score that is proportional to the total number of comments (just so you don't get highlighted for being one of the first responders). It's also probably important to not be able to vote on the post from a preview, so people have to click into the thread and hopefully read it and possibly contribute.
One thing I don't think people have addressed is that that's the goal for them; it's not necessarily to be "useful" to the end user. If they can make users feel like they have to be on reddit 24/7 to stay "current" and not miss anything, then they will. Facebook does the same thing.
The big subreddits' comment sections is largely low quality anyway.
Traditional forums have their downsides (anyone remember super nested quote trains?), but I still find them superior to upvote + nested replies.
The best forum UI for me, though, are imageboards. Too bad they are associated with a less than popular community.
>> Nazar will then alter the event, adding a Boolean flag indicating whether or not it should be counted, before sending the event back to Kafka.
Why don't they just discard it instead of reputting the event back to Kafka?