*ProfDrMorph* 2 points 1 year ago
So that means all posts in all subreddits (when browsing
'hot') are sorted this way:
1. all posts with more upvotes than downvotes with the
order determined by age (newer posts are preferred) and
popularity
2. all posts with the same number of up- and downvotes in
whatever order the database returns them
3. all posts with less upvotes than downvotes with the
order determined by age (older posts are preferred) and
popularity (posts with a lot more downvotes are preferred)
Because that's what the _hot() function implies if the
sorting algorithm uses it as a 'key'.
*ketralnis* 2 points 1 year ago
Yes that's accurate
I'm glad you found that thread and got to the top with it, I hate trying to dive into ongoing conversations and try to change peoples' minds when the hivemind has already made its decision. This comes up every 6 months or so, always with some sensational title like this.
It feels a little weird quoting myself, but I also said:
> The thing is, the two most important pages are the front page (or a subreddit's own hot page) and the new page. The new page is sorted by date ignoring hotness, and if something has a negative score it's not going to show up on the front/hot page anyway. The two other main opportunities to get popular (rising and the organic box) don't really use hotness either.
> So when it comes down to it, what happens below 0 is pretty moot. Smoothness around the real life dates and scores on the site is more important than smoothness around 0, where we don't really have listings that will display it anyway.
In summary, there don't exist listings in which the discontinuities at 0 really matter
That's begging the question. Posts with negative scores are completely banned from the front/hot page because of this bug/feature/discontinuity. You can't justify it with itself.
What you can say is that you want posts to disappear from the hot page as soon as they go to -1, in which case I'll say that it's more than a little weird for the first voter to hold so much power.
No, it wouldn't. Do you really think you're the first person to try and game the system like that? You know who else tries that? Spammers who think that down voting every submission but their own actually works.
But for the particular problem, it would be a solution, provided Reddit did not have any mechanics in place to prevent that exact thing. And it would be stupid to assume you do not.
I have no interest in doing anything like it, I browse Reddit a lot.
PS: Totally unrelated to this, but please look into the API returning a ton of HTTP 503/504 gateway timeouts. It's been happening to me across several servers in different regions of the world.
> Totally unrelated to this, but please look into the API returning a ton of HTTP 503/504 gateway timeouts. It's been happening to me across several servers in different regions of the world.
Ironically, that's probably the rate limiting blocking you. Are you hitting the API more often than once every 30 seconds?
I have a cron job that looks at comments in a thread about every 2-5 minutes. At most it should be making a call to find a subreddit and then a call to find the comments.
> [The bug is okay because] if something has a negative score it's not going to show up on the front/hot page anyway.
I...what? This is just wrong. If it were not for the bug, then many posts with a negative score would show up on hot pages. Due to the bug, many posts which would otherwise show up do not. The bug is changing how things are working, and it is doing so in a way which has clear impacts on hot pages.
> In summary, there don't exist listings in which the discontinuities at 0 really matter
To the extent this is true, it is true only because there is a bug in the code that hides posts with negative points from the hot pages. What you are saying is that it doesn't matter if posts with negative points are shown on hot, because posts with negative points are not shown on hot.
...I hesitate to even ask this, but, well: Do you actually understand the bug, and the impact it has on how posts are sorted? Because your attempts at explaining it make no sense. The reason negative posts don't show up on the front page is because of this bug. That's why the bug has some (slight) importance; it is not the reason why the bug has no impact. It does have an impact.
> The new page is sorted by date ignoring hotness, and if something has a negative score it's not going to show up on the front/hot page anyway.
The key word there is "anyway". We're discussing the code that makes them not show up on the hot pages, so the word "anyway" makes no sense. If your theory were true, I would expect him to say: "This code is important because if something has a negative score then we don't want it to show up on the hot page", but he doesn't; rather he treats negative articles not showing up as law of nature. Paraphrasing, he says "This code is unimportant because it doesn't do anything, because article wouldn't have shown up anyway" But it does do something, and they would have shown up.
My theory is that the confusion comes from him saying "front/hot" page. Everything he said is true about the front page, and the front and hot pages use the same algorithm. But the same algorithm applied to different data yields very different results, and everything he said is blatantly false when discussing the hot page of small, low-traffic subreddits.
In short, I think he's trying to say "hey, negative articles won't show up on the default front page no matter what, so what are you talking about?". And the response is "yes, but it has a huge impact everywhere". (Notably: The hot page of small subreddits, as well as some customized front pages and multireddits.) You can't conflate front/hot the way he does, because the bug is ONLY shown when _hot() is called on a low-traffic source.
> What I got from this, is that articles under 0 MUST NOT show up on the Hot page.
And the counterexample is given above: a not-very active subredddit, one new post in the last day, with a score of -1 after one vote. Are you absolutely sure that this post MUST NOT show up on the Hot page?
You might argue that it is not a software bug but a mistake or odd choice in the design thought process. The way that it seems to fall accidentally and obscurely from the implementation details argues strongly against this.
> In summary, there don't exist listings in which the discontinuities at 0 really matter
This is not true.
I am active in a small (local area) subreddit, and there sometimes a post totally disappears from the "hot" listing. Not only from page 1, but it cannot be found on pages 2 or 3 either.
When there is a recent post with 4 upvotes at the top of the hot list, a recent post with -1 votes still would deserve to rank higher than a week old post with a small positive vote score, don't you think?
I was really wondering, do the mods remove posts that quickly get 2 downvotes. But this bug explains my observations.
But there is! It's called "Not-that-active subreddits".
A post at -1 from yesterday, in a subreddit with only 1 post a day, is completely deserving of being on the front page of that sub (top 30) posts. If all negative ones are banished from the top... that's very bad.
By definition, not many people use any particular minimal-activity subreddit. There are a lot of subreddits. It's certainly possible many people use at least one minimal-activity subreddit.
The problem occurs on low-traffic subs. The front-page of a low-traffic sub will often have the 1s listed right on the front - new stuff ready for upvoting.
All it takes is one downvote to kick those things off the front. These low-traffic subs don't have enough users to have their own Knights of New constantly patrolling the /new view, so the Hot page is pretty much it.
Basically, it becomes a job of a moderator to constantly check /new to rescue anything that suffered a downvote infanticide.
And it has never occurred to the reddit developers to document the behaviour in the code base? Regardless of whether there’s a bug in the implementation, that is one poorly-written piece of code.
The best justification anyone can come up with for not fixing the bug is an excuse that it probably isn't that important? Huh? Even if it only affects 1% of users, why not fix it? It's only 2 characters, and there have apparently been pull requests for ready-made fixes already. Why not just fix it, make Reddit a tiny bit better, and not have people complaining about it every 6 months or whatever.
How does it make any sense to multiple the date by the sign of the score? It's true that it only matters when the score is negative, and so it doesn't really affect much in practice, but the code is clearly a typo.
Looking at that code I find it really hard to believe that is the intended formula. It's clearly an attempt at logarithmic magnitude in either direction with a time bonus added in. The fact that the typo usually doesn't affect things much is almost guaranteed - if it made results obviously bananas it would have been found sooner.
I doubt keltranis is missing the issue. Instead let's assume reddit's devs know what's what think of a why.
My guess is they want to fast bury spam. Spam in old posts will have been deleted and thus old posts are more trustworthy. Voting brigades are a smaller risk than the constant flood of spam.
I believe they have other, closed source, anti-spam measures. This might be the reason, but if it is 'tis a silly reason. If I understand correctly, it means that if 75% of people like a post, and 25% dislike it, that means it has a 25% chance of not being successful because the first vote is negative. Overall, this measure is effectively lowering the quality of all content on the site.
No, spam doesn't have anything to do with it. And actually, older (high-scoring, anyway) links are more likely to have spam comments because the moderators aren't hanging around there anymore.
Maybe you misread, the quote shows it is the intended behavior. They don't care about the number of votes, only if it is more up, equal, or more down. After that they care about age more. The article wants them to base it more on number of votes. You can tell from the parts about returning in whatever order the DB wants that they are going for speed and scalability as well. So more down voted posts being a certain way may not even matter to them as long as they are off the page.
The less hot posts according to reddit are the ones for which people fight each other while the most hot posts are the ones which make unanimity. What's wrong with that?
It adds to the hive mind effect you already have with online communities. This algorithm explains frustrations I've had in the past. In 4 years on the site, I've probably submitted 20 times, but the realization that I'd have to try submitting multiple times to get a fair shot at discussion makes it unpalatable for a user that only submits on occasion. Sure, any system can be gamed, but Reddit favors unity over diversity and karma whores over Regular Joe.
So, a massively, massively popular site that makes it business by ranking the user generated content on it by importance... is wrong.
> Maybe there is no moral. Reddit screwed up.
...or maybe, they know what they're doing.
Maybe not. ...but when you supply a bugfix, the onus is on the submitter to demonstrate that 1) the fix fixes the problem and 2) that it doesn't break anything else.
It would appear that no effort has been made at (2), to demonstrate that the proposed change would not have an adverse affect on other high-vote rankings.
To be fair, it would have been nice to see the pull request response (https://github.com/reddit/reddit/pull/583) mention that an alternative algorithm choice would have to be demonstrably better in a large scale analysis before they would even dream of changing their core ranking algorithm, but it's not unfair for them to take that stance.
It's like asking Google to change their page rank algorithm because you don't like it.
Google change pagerank all the time, so clearly they aren't terrified of changing it. Presumably they have ways such as testing methods to mitigate the risks. If Reddit don't, then they have a much bigger long term problem on their hands than this one glitch.
> the onus is on the submitter to demonstrate that 1) the fix fixes the problem and 2) that it doesn't break anything else.
No. The onus is on Reddit's test suite which, ostensibly, would cover voting (one of the core features/functionality of the site!) to demonstrate this. Or are you suggesting that he didn't run the full build?
tl;dr: Posts whose net score ever becomes negative essentially vanish permanently due to a quirk in the algorithm. So an attacker can disappear posts he doesn't like by constantly watching the "New" page and downvoting them as soon as they appear.
Maybe not the /all/new but I pretty much only visit /new on the subs I frequent. It's a much better way of using smaller subs from my perspective. Lots of things never make it fully to the front page of a smaller sub. The drawback is that subs which get caught in a filter appear to show sorted by their submission date and not the mod approval date so you can miss things on /new that might make it to /hot
This is the right answer. I do this too, because when everything is <10 votes with a few outliers, /new looks close to /hot, but there are more submissions. For subs where the front page changes not hourly, but weekly, it's the only way to get a little freshness.
Lots of people visit /new. They're called "knights of new". The total number of these people is probably far lower than the front page (where _hot is used), but it's how every submission gets its start.
Lots do. At least two reasons, I can thinks of,
1. If you are a regular to a sub, new will act like RSS feed. You know what to read and where to stop.
2. If you believe in reddit karma, better chance of your comments to be recognised. By the time it hits the front page, more or less it is a comment muddle.
... And the fix seems so minor, and the bug reporters are providing a patch for it, and they spent more energy arguing against the patch rather than including it. Classic example of developers not wanting to admit when a bug is a bug, and fixing it is easier than arguing not to.
"I found a recent post in a fairly inactive subreddit and downvoted it, bringing its total vote score negative. Sure enough, that post not only dropped off the first page (a first page which contained month-old submissions), but it was effectively banished from the “Hot” ranking entirely. I felt bad and removed my downvote, but that post never really recovered...
While testing, I noticed a number of odd phenomena surounding Reddit’s vote scores. Scores would often fluctuate each time I refreshed the page, even on old posts in low-activity subreddits. I suspect they have something more going on, perhaps at the infrastructure level – a load balancer, perhaps, or caching issues."
This is partially due to vote fuzzing. More to the point, votes go into a queue and the removal of the downvote might not cancel out the previous action for some time.
As a result, this suggested flaw will supposedly let somebody successfully snipe puffins from the new page of a small sized birdwatching subreddit before they ever get a fair shake. I think if somebody would attempt this sort of manipulation further they would find it an ineffective strategy, there have been (probably constantly are) attempts to game Reddit before and this seems like an excellent honeypot.
Beyond the narrow set of circumstances during a very small time window the flaw disappears, yet if you try to abuse this you'll stick out like a sore thumb.
The true horror expressed in the OP is that the ordering of posts in the purgatory is not strictly logical - the post ranked 10042 should really be ranked 10041. Gasp. Twitch.
This is a very lovable brand of OCD to my eyes. :)
I think you'd be surprised on how much on a knife edge the whole process works. It's perfectly plausible that something that ends up on the front page gets only a handful of upvotes in the first twenty minutes, or half an hour. Moreover, once an article has been submitted, you're not supposed to resubmit it, and many moderators will remove duplicates.
I'm actually involved in moderating a fairly large subreddit, and we have periodic waves of neo-Nazi posters gaming the subreddit, and they are surprisingly effective at altering the general mood. You can also see some genuinely shocking opinions as top posts on r/worldnews. These are subreddits with hundreds of thousands of daily visitors. If reddit is operating a system which can easily be gamed, it matters a lot.
In this case, with enough proxy accounts, and a modicum of programming experience, you could anonymously supress stories you don't like, with some ease. Do you not think that matters?
You're essentially describing the equivalent of online fascism, neo-Nazi down voting brigades sound suspiciously close to Meatspace Greece at Present.
Any system that mimics democracy, even with active moderators, will succumb to a large enough minority of trouble makers. If they really are a marginalized group that does not represent a significant percentage of the community - even with all the tricks and manual puppet accounts and all the real world parallels - they will remain marginalized. If things turn dark that easily one sadly suspects it has more to do with the flaw in the algorithm of the people rather than the system.
As for programmatically doing what you claim, that hasn't been demonstrated. I'm pretty sure spammers have even more incentive and resources and yet the volume of spam is manageable still.
It just shows the old principle that small, organised groups can impose preferences on a disorganised majority. It's a predictable phenomenon in collective decision-making systems. If I understand the economists correctly, it can't really be "solved".
The bigger danger is that it makes the whole community subconsciously downvote happy, because sometimes it's more effective to tune the site to what you want by downvoting things you don't like than by upvoting things you like.
If people are downvoting everything that doesn't fit their expectations, it creates a lot of cultural inertia.
(Eh, that's the worst impact I can come up with, and it's probably still not too big a deal.)
After thinking about it a bit, I this I disagree. Your fixed algorithm will change the meaning of controversy to depend too much on the number of votes (popularity), and too little on the fact that there is an even debate on either side of the article's subject. A post split 500/500 is certainly more controversial than one with 5000/750, the latter has just been seen by more people.
The question (both here and above the reddit comment I linked to) was effectively, "Isn't it pointless to spend time working on the controversy sort, because very few redditors ever sort by it?", and my reply is effectively, "No, it isn't pointless, because if it worked well, it could be used to surface controversial posts on the default view."
What ? Take a moment to convince yourself that sorting by sqrt(f(up,down)) will produce the same ranking as sorting by just f(up,down) (provided that f(up,down) is non-negative, as this particular function is )
Consider a set of points in the X/Y plane. If you want to find the one closest to the origin, you don't have to find min(sqrt(x^2+y^2)), only min(x^2+y^2), which is much cheaper.
I really really really wish there was a website that broke down code like this into explained text. I can grok a lot of code regardless of language somewhat intuitively because there is so much crossover - but I still have issues often when it comes to breaking down complex and unique segments.
This would really help the learning process but I appreciate how time intensive it is.
Have a look at literate programming (I think that's what it's called). It's not exactly what you're looking for (in that the project must be done that way from the beginning) but you'll find it interesting I think!
That is what it's called. A site [1] with some examples. Donald Knuth developed the term and concept. It seems fairly popular in the haskell community, and to a lesser extent the scheme/racket communities.
Yes, that would be a great place to find more like this. CoffeeScript has added support for literate programming, so that might be a good place to go for modern, web-oriented code samples.
This is a good example of where something that is fundamentally flawed becomes accepted and popular and then a huge amount of effort goes into rationalizing it.
Which goes to show you that things are the way they are not because that's the way things should be, but just because that's the way things are. Which is a very stupid way to run things, but that is the way our 'society' works.
1. If the material is newer and already has attracted the same amount of negative votes in shorter period than another one in longer period -- the first is worse. Push it down.
2. If people suddenly started hating something very much, that might mean the content is hot and attracts a lot of attention. So pull it up.
"thinking out of the... emm, where is my box???"
-------------
Imagine two submissions, submitted 5 seconds apart. Each receives two downvotes. seconds is larger for the newer submission, but because of a negative sign, the newer submission is actually rated lower than the older submission.
Imagine two more submissions, submitted at exactly the same time. One receives 10 downvotes, the other 5 downvotes. seconds is the same for both, sign is -1 for both, but order is higher for the -10 submission. So it actually ranks higher than the -5 submission, even though people hate it twice as much.
This got downvoted but I think it is essentially the right answer. I don't see how Reddit's success is based on the amazing efficacy of its algorithm. For example, I don't think fixing this bug and using the fixed code to launch a competing site would beat Reddit.
Its success is based on attracting an engaged audience, who participated heavily, in turn attracting a larger audience, whose participation further attracted even more people... etc.
The algorithm may have mattered very early, in the beginning, when it was first attracting people who were evaluating it for the first time. But even then, I think that the content that Reddit's staff continuously posted was a bigger factor than the algorithm.
And of course, if you would like more articles written by me and an extremely high signal-to-noise ratio (because I post so rarely...), consider subscribing: http://technotes.iangreenleaf.com. RSS is not dead, dammit.
An argument for the proposition that this behavior was intended is that if the purpose of sign was to get the sign for order, then it was actually entirely unnecessary and they could have just done something like this:
order = log(max(abs(s), 1)) * ((s) / max(abs(s), 1))
I'd prefer the benefit of the doubt, especially given their previous responses.
That's not a very good argument - they could just as easily have calculated sign as ((s) / max(abs(s), 1)) if this was intentional, so the fact they didn't probably just means whoever wrote the code didn't think of that trick.
I've been very interested in this problem. I ran a community for a few months that wound up being quite popular (40k uniques a day before I closed it down). My attempt to address with problem was to have a min + max time, but most interestingly count the number of posts as well. Even if an opinion was popular if it got a response out of people it would stay around longer before dropping of quicker.
I prioritised community engagement over the communities quality of content. This turned out to be slightly more effective way of ranking content.
Interestingly enough, the HN ranking algorithm takes a bit of an opposite stance, punishing posts that generate too much discussion in proportion to upvotes. Both approaches are valid, depending on what your goals for the community are.
HN's system would rather quickly derank a post that is potentially inflammatory and keep good content on top rather than using comments as a heuristic for community involvement.
s = score(ups, downs)
order = log10(max(abs(s), 1))
and the poster says that "order will always be positive". But that isn't true. It is the logarithm of a number in (0,1], and so is negative or zero. Since we cut the value off at 1 I assume that the score function does something to the votes beyond (ups - downs), scaling the value in a way that makes the logarithm of the score interesting.
`ups` and `downs` are whole numbers, so `abs(s)` will usually be >= 1, like 2654 or something. The log is there to reduce the influence of additional votes after a certain level of popularity. See footnote ^2: http://technotes.iangreenleaf.com/posts/2013-12-09-reddits-e...
The blog sketches out a corner case that maybe isn't handled well, but posts with net negative votes probably aren't "hot", and I'm pretty sure they have mechanisms in there to make sure that bad voters are at least eventually ignored.
I don;t think anyone really cares how submissions are ranked once they are obviously net negative. The concern is for the first few votes; a submissions should not be effectively discarded because the first person to look at it did not like it.
> And notably, they are sorted oldest first, just as I predicted.
This bit is actually misstated. Those posts all have a comparison value of 0 (assuming score is simplistic), and are not affected by the oldest-first ranking of negative submissions. The ordering here is likely insertion order, which just happens to be the same as oldest-first.
Just goes to prove you only need to get the parts users care about right. Treatment of some negative score posts just isn't too important and may even help remove spam ASAP at the cost of some good posts. If they had sacrificed some other aspect of the site to get this right, they probably would have been worse off.
The first downvote pushes the submission down below every positive submission ever, not just below the last 12 hours' worth. It can only recover if upvoted from other views.
Good enough to be plausible to a casual user is not good enough to prevent the algorithm from being damaging. For a lot of people, reddit is their only source of news, if that can be easily gamed, it has serious consequences. Developers have responsibilities beyond increasing traffic.
The allegation there, from what I can see, was that a someone had a bot controlling five accounts. That's enough to impact even the corrected version of this ranking, and as such is only moderately relevant.
Nope. This is just survivor ship bias. You look at one successful product and you propose a rule that technical implementation doesn't matter. You need to look at all the dead startups which had a good product but not good implementations (ex, friendster and other FB competitors which were ahead of the game but couldn't scale).
In reality, Reddit is successful by a pure chance. In its initial days it was pretty much barren wasteland for fringe people. Most people had written of it as another me-too without much of a differentiation and Digg was the place to be. Then Digg screwed up and people wanted alternative and suddenly Reddit was overnight lord of link submission evolving in to discussion forums.
http://www.reddit.com/r/programming/comments/td4tz/reddits_a...
And a quotation for those not wanting to click: