
Reddit’s empire is founded on a flawed algorithm - est
http://technotes.iangreenleaf.com/posts/2013-12-09-reddits-empire-is-built-on-a-flawed-algorithm.html
======
aston
There's a Reddit thread between various users and keltranis, (one of the more
senior coders Reddit ever had) explaining the code here:

[http://www.reddit.com/r/programming/comments/td4tz/reddits_a...](http://www.reddit.com/r/programming/comments/td4tz/reddits_actual_story_ranking_algorithm_explained/)

And a quotation for those not wanting to click:

    
    
      *ProfDrMorph* 2 points 1 year ago
      So that means all posts in all subreddits (when browsing
      'hot') are sorted this way:
      1. all posts with more upvotes than downvotes with the
      order determined by age (newer posts are preferred) and
      popularity
      2. all posts with the same number of up- and downvotes in
      whatever order the database returns them
      3. all posts with less upvotes than downvotes with the
      order determined by age (older posts are preferred) and
      popularity (posts with a lot more downvotes are preferred)
      Because that's what the _hot() function implies if the
      sorting algorithm uses it as a 'key'.
    
      *ketralnis* 2 points 1 year ago
      Yes that's accurate

~~~
ketralnis
I'm glad you found that thread and got to the top with it, I hate trying to
dive into ongoing conversations and try to change peoples' minds when the
hivemind has already made its decision. This comes up every 6 months or so,
always with some sensational title like this.

It feels a little weird quoting myself, but I also said:

> The thing is, the two most important pages are the front page (or a
> subreddit's own hot page) and the new page. The new page is sorted by date
> ignoring hotness, and if something has a negative score it's not going to
> show up on the front/hot page anyway. The two other main opportunities to
> get popular (rising and the organic box) don't really use hotness either.

> So when it comes down to it, what happens below 0 is pretty moot. Smoothness
> around the real life dates and scores on the site is more important than
> smoothness around 0, where we don't really have listings that will display
> it anyway.

In summary, there don't exist listings in which the discontinuities at 0
really matter

~~~
Camillo
That's begging the question. Posts with negative scores are completely banned
from the front/hot page _because_ of this bug/feature/discontinuity. You can't
justify it with itself.

What you can say is that you _want_ posts to disappear from the hot page as
soon as they go to -1, in which case I'll say that it's more than a little
weird for the first voter to hold so much power.

~~~
waylandsmithers
Maybe some civil disobedience will get their attention. What about a script
that automatically downvotes every new post?

~~~
aroch
That's a good way to get banned

~~~
flexd
A lot of accounts + a lot of different IPs would get around that.

~~~
jedberg
No, it wouldn't. Do you really think you're the first person to try and game
the system like that? You know who else tries that? Spammers who think that
down voting every submission but their own actually works.

~~~
flexd
I am not, and I have no plans to do that.

But for the particular problem, it would be a solution, provided Reddit did
not have any mechanics in place to prevent that exact thing. And it would be
stupid to assume you do not.

I have no interest in doing anything like it, I browse Reddit a lot.

PS: Totally unrelated to this, but please look into the API returning a ton of
HTTP 503/504 gateway timeouts. It's been happening to me across several
servers in different regions of the world.

~~~
jedberg
> Totally unrelated to this, but please look into the API returning a ton of
> HTTP 503/504 gateway timeouts. It's been happening to me across several
> servers in different regions of the world.

Ironically, that's probably the rate limiting blocking you. Are you hitting
the API more often than once every 30 seconds?

~~~
flexd
I have a cron job that looks at comments in a thread about every 2-5 minutes.
At most it should be making a call to find a subreddit and then a call to find
the comments.

------
shadowmint
So, a massively, massively popular site that makes it business by ranking the
user generated content on it by importance... is wrong.

> Maybe there is no moral. Reddit screwed up.

...or maybe, they know what they're doing.

Maybe not. ...but when you supply a bugfix, the onus is on the submitter to
demonstrate that 1) the fix fixes the problem and 2) _that it doesn 't break
anything_ else.

It would appear that no effort has been made at (2), to demonstrate that the
proposed change would not have an adverse affect on other high-vote rankings.

To be fair, it would have been nice to see the pull request response
([https://github.com/reddit/reddit/pull/583](https://github.com/reddit/reddit/pull/583))
mention that an alternative algorithm choice would have to be demonstrably
better in a large scale analysis before they would even _dream_ of changing
their core ranking algorithm, but it's not unfair for them to take that
stance.

It's like asking Google to change their page rank algorithm because you don't
like it.

~~~
lectrick
> the onus is on the submitter to demonstrate that 1) the fix fixes the
> problem and 2) that it doesn't break anything else.

 __No. __The onus is on _Reddit 's test suite_ which, ostensibly, would cover
voting (one of the core features/functionality of the site!) to demonstrate
this. Or are you suggesting that he didn't run the full build?

~~~
shadowmint
Did /r2/tests suddenly get a hundred new awesome tests in the last 6 months?
It's pretty... bare. Or it was last time I looked.

Pull request -> should include tests if relevant ones don't already exist.

~~~
lectrick
> It's pretty... bare. Or it was last time I looked.

Well then, that's running with scissors as far as any modern legit open source
project is concerned. That should be the VERY FIRST thing rectified.

------
randomwalker
tl;dr: Posts whose net score ever becomes negative essentially vanish
permanently due to a quirk in the algorithm. So an attacker can disappear
posts he doesn't like by constantly watching the "New" page and downvoting
them as soon as they appear.

~~~
raldi
No, they just vanish from the _hot_ page, and it's not necessarily permanent.
They're still visible on /new, which allows for upvotes-of-resurrection.

~~~
yen223
But as the author pointed out, nobody visits /new.

~~~
atwebb
Maybe not the /all/new but I pretty much only visit /new on the subs I
frequent. It's a much better way of using smaller subs from my perspective.
Lots of things never make it fully to the front page of a smaller sub. The
drawback is that subs which get caught in a filter appear to show sorted by
their submission date and not the mod approval date so you can miss things on
/new that might make it to /hot

~~~
tripzilch
> Maybe not the /all/new but I pretty much only visit /new on the subs I
> frequent. It's a much better way of using smaller subs from my perspective.

And one of the reasons might be that sorting doesn't work quite properly on
smaller subreddits?

~~~
Semaphor
Even if this features didn't exist, something I find interesting but most
people don't care about is still easier to find on new.

~~~
baudehlo
Not just easier, but actually only possible to find on /new.

------
recuter
So the gist of this is:

"I found a recent post in a fairly inactive subreddit and downvoted it,
bringing its total vote score negative. Sure enough, that post not only
dropped off the first page (a first page which contained month-old
submissions), but it was effectively banished from the “Hot” ranking entirely.
I felt bad and removed my downvote, but that post never really recovered...

While testing, I noticed a number of odd phenomena surounding Reddit’s vote
scores. Scores would often fluctuate each time I refreshed the page, even on
old posts in low-activity subreddits. I suspect they have something more going
on, perhaps at the infrastructure level – a load balancer, perhaps, or caching
issues."

This is partially due to vote fuzzing. More to the point, votes go into a
queue and the removal of the downvote might not cancel out the previous action
for some time.

As a result, this suggested flaw will supposedly let somebody successfully
snipe puffins from the new page of a small sized birdwatching subreddit before
they ever get a fair shake. I think if somebody would attempt this sort of
manipulation further they would find it an ineffective strategy, there have
been (probably constantly are) attempts to game Reddit before and this seems
like an excellent honeypot.

Beyond the narrow set of circumstances during a very small time window the
flaw disappears, yet if you try to abuse this you'll stick out like a sore
thumb.

The true horror expressed in the OP is that the ordering of posts in the
purgatory is not strictly logical - the post ranked 10042 should really be
ranked 10041. Gasp. Twitch.

This is a very lovable brand of OCD to my eyes. :)

~~~
Sous-Vir
I think you'd be surprised on how much on a knife edge the whole process
works. It's perfectly plausible that something that ends up on the front page
gets only a handful of upvotes in the first twenty minutes, or half an hour.
Moreover, once an article has been submitted, you're not supposed to resubmit
it, and many moderators will remove duplicates.

I'm actually involved in moderating a fairly large subreddit, and we have
periodic waves of neo-Nazi posters gaming the subreddit, and they are
surprisingly effective at altering the general mood. You can also see some
genuinely shocking opinions as top posts on r/worldnews. These are subreddits
with hundreds of thousands of daily visitors. If reddit is operating a system
which can easily be gamed, it matters a lot.

In this case, with enough proxy accounts, and a modicum of programming
experience, you could anonymously supress stories you don't like, with some
ease. Do you not think that matters?

~~~
recuter
You're essentially describing the equivalent of online fascism, neo-Nazi down
voting brigades sound suspiciously close to Meatspace Greece at Present.

Any system that mimics democracy, even with active moderators, will succumb to
a large enough minority of trouble makers. If they really are a _marginalized_
group that does not represent a _significant percentage_ of the community -
even with all the tricks and manual puppet accounts and all the real world
parallels - they will remain marginalized. If things turn dark that easily one
sadly suspects it has more to do with the flaw in the algorithm of the people
rather than the system.

As for programmatically doing what you claim, that hasn't been demonstrated.
I'm pretty sure spammers have even more incentive and resources and yet the
volume of spam is manageable still.

~~~
jacques_chester
It just shows the old principle that small, organised groups can impose
preferences on a disorganised majority. It's a predictable phenomenon in
collective decision-making systems. If I understand the economists correctly,
it can't really be "solved".

------
raldi
The real flawed reddit algorithm is "controversy". It's basically:

    
    
      SORT ABS(ups - downs) ASCENDING
    

...which means something with 1000 upvotes and 999 downvotes will be
considered _less_ controversial than something with 2 upvotes and 2 downvotes.

A much better algorithm for controversy would be:

    
    
      SORT MIN(ups, downs) DESCENDING

~~~
rm999
Am I looking at the wrong code? I see

    
    
      SORT (ups + downs) / max(1, ABS(ups - downs))
    

which would give a way higher score to the 1000/999 case than 2/2 - 1999 vs 4.

[https://github.com/reddit/reddit/blob/master/r2/r2/lib/db/_s...](https://github.com/reddit/reddit/blob/master/r2/r2/lib/db/_sorts.pyx)

~~~
raldi
Hmm, perhaps I slightly misremember, but even that algorithm still scores
+2/-2 as more controversial (result=4) than +1000/-500 (result=3).

~~~
icedog
And that's right. It's a controversy measure without taking in consideration
popularity. That can be applied to the final score separately...

------
AndrewKemendo
I really really really wish there was a website that broke down code like this
into explained text. I can grok a lot of code regardless of language somewhat
intuitively because there is so much crossover - but I still have issues often
when it comes to breaking down complex and unique segments.

This would really help the learning process but I appreciate how time
intensive it is.

~~~
girvo
Have a look at literate programming (I think that's what it's called). It's
not exactly what you're looking for (in that the project must be done that way
from the beginning) but you'll find it interesting I think!

~~~
Jtsummers
That is what it's called. A site [1] with some examples. Donald Knuth
developed the term and concept. It seems fairly popular in the haskell
community, and to a lesser extent the scheme/racket communities.

[1]
[http://en.literateprograms.org/LiteratePrograms:Welcome](http://en.literateprograms.org/LiteratePrograms:Welcome)

------
ilaksh
This is a good example of where something that is fundamentally flawed becomes
accepted and popular and then a huge amount of effort goes into rationalizing
it.

Which goes to show you that things are the way they are not because that's the
way things should be, but just because that's the way things are. Which is a
very stupid way to run things, but that is the way our 'society' works.

------
maskoliunas
If I was a salesman, I would explain that easily:

1\. If the material is newer and already has attracted the same amount of
negative votes in shorter period than another one in longer period -- the
first is worse. Push it down.

2\. If people suddenly started hating something very much, that might mean the
content is hot and attracts a lot of attention. So pull it up.

"thinking out of the... emm, where is my box???" \------------- Imagine two
submissions, submitted 5 seconds apart. Each receives two downvotes. seconds
is larger for the newer submission, but because of a negative sign, the newer
submission is actually rated lower than the older submission.

Imagine two more submissions, submitted at exactly the same time. One receives
10 downvotes, the other 5 downvotes. seconds is the same for both, sign is -1
for both, but order is higher for the -10 submission. So it actually ranks
higher than the -5 submission, even though people hate it twice as much.

------
jbigelow76
Reddit's algorithm is its community, the rest is just math.

~~~
snowwrestler
This got downvoted but I think it is essentially the right answer. I don't see
how Reddit's success is based on the amazing efficacy of its algorithm. For
example, I don't think fixing this bug and using the fixed code to launch a
competing site would beat Reddit.

Its success is based on attracting an engaged audience, who participated
heavily, in turn attracting a larger audience, whose participation further
attracted even more people... etc.

The algorithm may have mattered very early, in the beginning, when it was
first attracting people who were evaluating it for the first time. But even
then, I think that the content that Reddit's staff continuously posted was a
bigger factor than the algorithm.

------
youngian
Author here. I posted a quick follow-up with some corrections and other items
of interest that came out of the discussion:
[http://technotes.iangreenleaf.com/posts/2013-12-10-the-
reddi...](http://technotes.iangreenleaf.com/posts/2013-12-10-the-reddit-
algorithm-a-recap.html).

And of course, if you would like more articles written by me and an extremely
high signal-to-noise ratio (because I post so rarely...), consider
subscribing:
[http://technotes.iangreenleaf.com](http://technotes.iangreenleaf.com). RSS is
not dead, dammit.

------
dionidium
An argument for the proposition that this behavior was intended is that if the
purpose of _sign_ was to get the sign for _order_ , then it was actually
entirely unnecessary and they could have just done something like this:

    
    
      order = log(max(abs(s), 1)) * ((s) / max(abs(s), 1))
    

I'd prefer the benefit of the doubt, especially given their previous
responses.

~~~
makomk
That's not a very good argument - they could just as easily have calculated
sign as ((s) / max(abs(s), 1)) if this was intentional, so the fact they
didn't probably just means whoever wrote the code didn't think of that trick.

~~~
dionidium
That a good point. Of course, I'm reasoning backward to justify something
that's already there, so it started its life as a bad argument :)

------
pippy
I've been very interested in this problem. I ran a community for a few months
that wound up being quite popular (40k uniques a day before I closed it down).
My attempt to address with problem was to have a min + max time, but most
interestingly count the number of _posts_ as well. Even if an opinion was
popular if it got a response out of people it would stay around longer before
dropping of quicker.

I prioritised community engagement over the communities quality of content.
This turned out to be slightly more effective way of ranking content.

~~~
mafuyu
Interestingly enough, the HN ranking algorithm takes a bit of an opposite
stance, punishing posts that generate too much discussion in proportion to
upvotes. Both approaches are valid, depending on what your goals for the
community are.

HN's system would rather quickly derank a post that is potentially
inflammatory and keep good content on top rather than using comments as a
heuristic for community involvement.

------
quokka
I don't understand. The definition of _order_ is

    
    
      s = score(ups, downs)
      order = log10(max(abs(s), 1))
    

and the poster says that "order will always be positive". But that isn't true.
It is the logarithm of a number in (0,1], and so is negative or zero. Since we
cut the value off at 1 I assume that the _score_ function does something to
the votes beyond (ups - downs), scaling the value in a way that makes the
logarithm of the score interesting.

~~~
youngian
`ups` and `downs` are whole numbers, so `abs(s)` will usually be >= 1, like
2654 or something. The log is there to reduce the influence of additional
votes after a certain level of popularity. See footnote ^2:
[http://technotes.iangreenleaf.com/posts/2013-12-09-reddits-e...](http://technotes.iangreenleaf.com/posts/2013-12-09-reddits-
empire-is-built-on-a-flawed-algorithm.html#fn2)

~~~
quokka
Oh, good lord. The definition uses max, but I was thinking of min. Face palm.

------
fleitz
It's theoretically broken, in practice it works quite well.

~~~
maxerickson
Is it even true that it is theoretically broken?

The blog sketches out a corner case that maybe isn't handled well, but posts
with net negative votes probably aren't "hot", and I'm pretty sure they have
mechanisms in there to make sure that bad voters are at least eventually
ignored.

~~~
DrStalker
I don;t think anyone really cares how submissions are ranked once they are
obviously net negative. The concern is for the first few votes; a submissions
should not be effectively discarded because the first person to look at it did
not like it.

------
sesqu
> And notably, they are sorted oldest first, just as I predicted.

This bit is actually misstated. Those posts all have a comparison value of 0
(assuming _score_ is simplistic), and are not affected by the oldest-first
ranking of negative submissions. The ordering here is likely insertion order,
which just happens to be the same as oldest-first.

------
lnanek2
Just goes to prove you only need to get the parts users care about right.
Treatment of some negative score posts just isn't too important and may even
help remove spam ASAP at the cost of some good posts. If they had sacrificed
some other aspect of the site to get this right, they probably would have been
worse off.

------
rhizome
Wait, so the point is that negative vote-score articles don't show up in
"Hot?" Seems reasonable.

~~~
DrStalker
More that if the first vote a submission gets is negative it can't recover and
vanishes.

~~~
rhizome
I didn't get that `sign` was not ever re-evaluated from the writeup, can you
elaborate?

~~~
sesqu
The first downvote pushes the submission down below every positive submission
ever, not just below the last 12 hours' worth. It can only recover if upvoted
from other views.

~~~
rhizome
So, in the minimum case, two upvotes? Like any post? Seems there would have to
be an appearance cutoff _some_ where.

------
joseph_cooney
Seems about right vis a vis the need for algorithms to be correct. Good enough
is, by definition, good enough.

~~~
Brakenshire
Good enough to be plausible to a casual user is not good enough to prevent the
algorithm from being damaging. For a lot of people, reddit is their only
source of news, if that can be easily gamed, it has serious consequences.
Developers have responsibilities beyond increasing traffic.

------
interstitial
So relevant it hurts:
[https://news.ycombinator.com/item?id=5927904](https://news.ycombinator.com/item?id=5927904)

~~~
sesqu
The allegation there, from what I can see, was that a someone had a bot
controlling five accounts. That's enough to impact even the corrected version
of this ranking, and as such is only moderately relevant.

------
kylelibra
If the same post is at the top of reddit and hackernews at the same time I'm
pretty sure the internet implodes and life as we know it comes to an end.

------
woah
Maybe this flaw is, through some strange and circuitous social mechanism, the
precise thing that has made Reddit so popular in the first place.

------
davidgerard
I think the key point is: Reddit works well enough that they don't, and quite
possibly shouldn't, care.

------
Semaphor
For those of you who never are on Reddit, both keltranis and raldi posting
here are former Reddit admins.

------
arca_vorago
The real flaw with reddit is moderator abuse.

------
petepete
If it's stupid, but it works, it ain't stupid.

------
frozenport
This is a feature that promotes controversial posts.

------
amerika_blog
Reddit is designed for SEO gaming.

The point is to have your bots/friends downvote everything but your
submission.

It works every time.

------
benihana
> _Maybe it’s that a good technical implementation is a distant second to a
> good product_

This is what computer scientists should take away from this.

~~~
sytelus
Nope. This is just survivor ship bias. You look at one successful product and
you propose a rule that technical implementation doesn't matter. You need to
look at all the dead startups which had a good product but not good
implementations (ex, friendster and other FB competitors which were ahead of
the game but couldn't scale).

In reality, Reddit is successful by a pure chance. In its initial days it was
pretty much barren wasteland for fringe people. Most people had written of it
as another me-too without much of a differentiation and Digg was _the_ place
to be. Then Digg screwed up and people wanted alternative and suddenly Reddit
was overnight lord of link submission evolving in to discussion forums.

