
Reddit overhauls upvote algorithm to thwart cheaters - gk1
https://techcrunch.com/2016/12/06/reddit-overhauls-upvote-algorithm-to-thwart-cheaters-and-show-the-sites-true-scale/
======
rm999
It's pretty clear to me that Reddit has always lacked a solid experienced
science-driven data person. Someone who can pinpoint an issue, formulate the
issue into a product-driven concrete problem, come up with hypotheses on how
to fix the problem, then use their vast wealth of data and users to test
and/or implement them.

It's not just the ad hoc unintuitive "solution" they came up with to mask
votes (which most people agreed was a user-antagonist step in the wrong
direction). It's a lot of things: their inept response to spam/vote
brigading/other abuse; their lack of personalized content discovery (state of
the art from 10-15 years ago would probably suffice). I could go on, as a
product-driven data scientist who has been on reddit since the early days, I'm
constantly struck by how little they do with the amazing data they have.

~~~
clamprecht
Redditor of 11 years here. I don't want another filter bubble that only shows
me what it thinks I want to read (like Facebook). I want to see what other
people are seeing, and not just like-minded people either. For that I can find
subreddits and subscribe to them.

~~~
djsumdog
But it's already like that. They removed the API that allowed RES to show you
individual up/down votes per-post. They've banned tons of subs for varying
levels of controversy. They removed their warranty canary. Their CEO was
caught changing comments and they weren't asked to resign; got away with just
an apology?

I realize that Voat is a cesspool, but anonymity -- real anonymity -- tends to
let you see peoples' real dark opinions (hence all the chans .. and why Google
requires real profiles for YouTube comments now).

I think Reddit filters a lot more than people realize; more than just the
banned subreddits and edited comments. It's been a filter bubble for quite
some time.

In anycase, I've given up on Reddit .. and Voat. I guess my big addiction now
is Hackernews. :-P

~~~
pmarreck
Serious question- Do you think there's room for a new discussion site which
embraced all these lessons that other discussion sites seem to perpetually be
ignoring (perhaps due to simple entrenchment)?

Because I have oft floated the idea of starting just such a site.

Anyway, I somehow (as an avid Reddit user) missed the news that they lost
their warrant canary (probably because I unsubbed from almost all the default
subs due to low signal/noise ratio):
[https://www.reddit.com/r/worldnews/comments/4ct1kz/reddit_de...](https://www.reddit.com/r/worldnews/comments/4ct1kz/reddit_deletes_surveillance_warrant_canary_in/)

But in googling it, I did notice an interesting site related to warrant
canaries: [https://canarywatch.org/](https://canarywatch.org/) EDIT: crap, no
longer being updated: [https://www.eff.org/deeplinks/2016/05/canary-watch-one-
year-...](https://www.eff.org/deeplinks/2016/05/canary-watch-one-year-later)

~~~
jerf
"Do you think there's room for a new discussion site which embraced all these
lessons that other discussion sites seem to perpetually be ignoring (perhaps
due to simple entrenchment)?"

There's a _ton_ of discussion sites. It's not "reddit or HN or nothing"; it's
a question of which of the bajillion forums, blogs, link aggregators, email
lists, or other things you're interested in joining. There's room for all
kinds of more of them, too. There's so many choices it's more a discovery
problem than an availability problem.

However, if you go into it with the hopes of being the next thing that is the
_size_ of Reddit, statistically, you're going to be disappointed.

It isn't even clear to me that this is a very good goal, either. Running a
site the size of Reddit is an enormous, enormous hassle. The controversies
currently associated with it at the moment are merely specific instantiations
of the general fact that you can count on being vilified by a good 20%+ of the
community once you reach that size, that every change will be met with a huge
outpouring of hatred straight into your contact box, and so on. Goodness help
the poor person trying to start The Next Reddit (TM) with the naive idea that
it is a good idea to be very available to the community... all that does is
add to the inevitable flames some fresh flames when you have to reneg on that
promise too. Unless you've got a solid plan to monetize enough to make it
worthwhile, fast enough for it to be worthwhile, while still somehow
organically growing without that getting in the way, I'd suggest shooting for
starting a community for $X, for some value of $X of interest to you, rather
than trying to be The Next Reddit (TM).

~~~
pmarreck
fantastic points, and spot-on about aiming for a solid community (if small)
over "the next Reddit"

------
jedberg
Since I can't really say more than this: there's a lot of misinformation and
misunderstanding in these comments. Don't believe everything you read on the
internet and don't believe the conspiracy theories that this has anything to
do with censorship or surpressing political (or any other) discourse.

It's really just making the math less complicated.

~~~
HCIdivision17
The post where they describe what they're doing sure makes it look like that.
From what was described, this is a massive refactoring and recalculation
effort to pay down years of accumulated technical debt. Occam's Razor is
probably good enough: they had _tons_ of complicated rules in the system that
goobered up their calculations, and they needed to get the large amount of
ineffectual cruft out at some point.

There's even a clear note that the /top stuff should go back to normal after
the recalc is finished and it's back up to speed.

Frankly, I was a bit shocked by the scope of it. Seems like a huge and
terrifying update.

~~~
jonlucc
I'm not sure if there's a commonly accepted way to handle the logistics of
this, but I'd love to see a blog post from them about how they manage this on
a site with so many active users at all times.

I guess you start with backing up the production DB to an offline location.
The fact that votes can be placed on old posts makes it difficult, I'd assume.

~~~
HCIdivision17
I'll see if I can find it, but there was a comment in an AMA on the topic
where someone asked if they have a test server and they said no. Not because
they're goofy noobs at the business, either, but apparently it's just too
impractical. A lot of the testing really needs the load to be properly
exercised, and simulating that is sufficiently hard to probably not be worth
the effort.

So this update? Updating on Production, baby!

Edit: I bet they _do_ have test servers, but nothing full scale. Which really
may emphasize your point more: I would love to know how they tested it and
merged it in when there's only so much they can do offline!

------
minimaxir
This is mostly another fix of bad design that was added in Reddit's early
years, but people have started to notice the aggressive vote fuzzing as it was
calibrated for a much smaller userbase.

At the least, this will break statistical analysis of Reddit data for awhile
since the public datasets will not have their scores updated, which I in
particular am not happy about. :p

~~~
notatoad
Complaints like this are the reason companies are reluctant to release public
datasets. They don't have any obligation to release data, but they do. It's a
gift. If they have to consider "how will this change affect the consumers of
our public data releases?" every time they make a change, they're going to
stop releasing public datasets.

~~~
minimaxir
For clarity, the Reddit datasets are not released by Reddit itself, but
scraped through the API. (More context/examples of what I do with the data:
[http://minimaxir.com/2015/10/reddit-
bigquery/](http://minimaxir.com/2015/10/reddit-bigquery/) )

~~~
vosper
Does this mean you (or I if I want to do some analytics on Reddit data) will
need to completely rescrape the site after scores are recomputed?

~~~
minimaxir
If you wanted to compare raw scores for submissions before the change to those
after the changes, yes.

Otherwise, it shouldn't matter.

------
pselbert
Evan Miller did some very interesting analysis of the historical Reddit hot
formula. It doesn't address cheating, but it does identify the details of the
algorithm and some inherent faults.

[http://www.evanmiller.org/deriving-the-reddit-
formula.html](http://www.evanmiller.org/deriving-the-reddit-formula.html)

There is also the followup about ranking news items with upvotes:

[http://www.evanmiller.org/ranking-news-items-with-
upvotes.ht...](http://www.evanmiller.org/ranking-news-items-with-upvotes.html)

------
jordigh
Can someone explain to me the relationship between their public source code
and what's actually running on their servers? I thought their voting
algorithms were all public, but there are no commits here indicating any
voting changes.

[https://github.com/reddit/reddit](https://github.com/reddit/reddit)

~~~
rjbrock
Their anti-spammer code is the only code that is not published as open source

~~~
talmand
Gotta keep the secret blacklist secret.

~~~
robwilliams
It would be trivial to test whether a site is blacklisted, that's not why the
anti-spam code is private.

~~~
talmand
Why does it seem like people are assuming negativity in my post? I do think
the secret blacklist should be kept secret.

------
gohrt
Silly that the announcements give so much attention to scores going up (due to
a change in scale) and not to the content of the change. It would be trivial
to keep the scale magnitude the same by multiplying by a factor ("old average
frontpage post" / "new average frontpage post") or applying a logarithmic
transform.

Makes me think that they are intentionally kicking up dust around the scale
change, to draw attention away from the substantive changes to
weighting/ranking.

[https://www.reddit.com/r/modnews/comments/5goxk4/upcoming_ch...](https://www.reddit.com/r/modnews/comments/5goxk4/upcoming_change_to_vote_scores/)

[https://www.reddit.com/r/announcements/comments/5gvd6b/score...](https://www.reddit.com/r/announcements/comments/5gvd6b/scores_on_posts_are_about_to_start_going_up/)

~~~
notatoad
Neither of those links seem to indicate any changes other than the big one.
What exactly are they trying to bury?

------
EJTH
I wonder how they are pulling this off. I guess it takes stuff like account
age into consideration. But I fear it is more based on what sub it originates
from, context of the post etc.

It will be their go to excuse for why content is censored off the front page
now. It has been problematic to excuse away the censorship lately, so this is
a nice catch all excuse they just made here.

~~~
colonelxc
They didn't "just make" anything. They've had vote fuzzing, shadow banning and
more from early days.

As both the techcrunch post and the actual reddit post say, the rules around
it were complicated and hard to reason about. They've basically refactored and
come up with a simpler set of rules.

~~~
chrismarlow9
I even recall a vulnerability write up where someone figured out that by using
the "Show me posts with more than X upvotes" feature and a counter you could
determine what a posts real score was.

------
fabian2k
One aspect I don't understand is why Reddit allows new accounts to vote from
the beginning. You don't even need to verify an email address, you can just
start voting (unless they're doing something sneaky and ignore those votes).

I'm comparing this to Stack Overflow, where you need 10 reputation to vote.
This is a pretty trivial barrier, but it does mean that you need to put more
effort into each sock puppet to be able to commit vote fraud. You also can't
have "invisible" socks all that easily, you generally need to post something
to get the upvotes for your socks, which results in more obvious patterns that
users, moderators and automated systems can detect.

Fighting vote fraud when it is that easy to create new socks that can vote
seems like a pretty hard problem to me, especially on the scale of Reddit.

~~~
scott_ni
On the other hand, requiring reputation could incentivize spamming and
worthless posts. For example, if you already have an automated voting bot, why
not make it upvote your new bot accounts to transfer "rep" and allow more
upvotes?

~~~
fabian2k
That is one of the things I meant by "more obvious patterns". That works of
course, but you also create connections in the other direction now as well for
your vote fraud.

There is another effect that probably doesn't apply to Reddit, but on Stack
Overflow it can be pretty noticeable if you upvote crappy posts by your socks
too much, and someone will investigate.

------
polysaturate
Does it really thwart cheaters actions or just stop them from having a clear
indication of their efforts?

~~~
landr0id
This change isn't to thwart cheaters... not sure where the article got that.
It's a side affect of legacy code that was used to thwart cheaters that hasn't
worked very well since Reddit has grown.

Some additional info about it here:
[https://reddit.com/r/modnews/comments/5goxk4/upcoming_change...](https://reddit.com/r/modnews/comments/5goxk4/upcoming_change_to_vote_scores/)

------
komali2
I don't understand the idea of "preventing a cheater from seeing votes." The
only way I can think to "cheat" reddit votes is to just throw a bunch of bots
at a post. Presumably, one knows how many bots one has. If the idea is you
don't know which bots are shadowbanned and which aren't, it would be trivial
to just have your own private subreddit that the bots post in regularly, and
see whose posts become visible to other accounts.

------
xnull2guest
It would be nice to have a clearer understanding of how the algorithm works,
and how it purports to stop vote manipulation from state actors, like the
capabilities of the GCHQ and NSA from the Snowden documents or the
astroturfing contractor centers for the DoD (Earnest Voice, etc).

I'm thinking maybe the algorithm changes slightly alter some brigading from
communities, but probably can't handle state-sponsored manipulation.

------
kbenson
> Admin KeyserSosa

Is it me, or is it weird to see what are likely a names chosen when someone is
adolescent or at least interacting in a non-professional capacity making their
way into professional contexts, especially when they refer to another person,
real or fictional (even if slightly changed)?

Are the corporate executives of the next generation going to go by the aliases
the likes of BillGaytes and BarackOsama?

~~~
matt4077
It's mostly you.

Or maybe it's me. I hate nothing more than the complaint that something is
"unprofessional". It's a completely meaningless concept. If there's something
actually wrong in how someone is acting, it's completely ok to call them out.
Example:

"'BillGaytes' and 'BarackOsama' are insulting to Bill Gates, Barack Obama, and
the gay community. It inflicts harm on people without justification, and I
expect you to change these aliases"

Note how none of that applies to "KeyserSosa".

~~~
kbenson
I think I just chose some poor examples and used some an unfortunate set of
words, because that's not really along the lines of what I was talking about.

I actually wasn't focusing on the names being insulting, and wasn't meaning
BillGaytes to have any gay commentary, I just thought of a famous name, and
then an alternate spelling that sounded the same, and didn't think of that
affiliation at all at the time (I noticed a few minutes later when re-reading
it, and almost edited it to point out that it wasn't my intention).
BarackOsama I thought of just because Osama sounds like Obama, and then it
seemed like something someone might use _because_ of the affiliation so seemed
a more realistic choice, but I didn't really intend for the focus to be on the
names having negative connotations to be part of my point.

Really, it's more that it feels weird to have someone go by someone else's
name, or some other artist's creation, and that's only a little alleviated by
an alternate spelling. It's not about being "professional", I think, but about
it being public. It's like the person is laying some claim to that other
person's identity or creation through the name, and while in a limited context
that feels acceptable, it feels weird to me when it's more public, as in a
news post, or expressed to millions (or billions!) of people through that
person's prominance.

I'm using the word "weird" because I can't really pin down what's causing my
feelings about this. That may point towards it being some misfiring emotional
attachment to a concept that doesn't really have a rational explanation, or
there may be a rational objection that I'm having trouble formulating but my
gut is catching. I'm not sure, which is why I'm trying to explore the idea
here. where I might get some input to solidify my thoughts.

------
brilliantcode
The real urgent problem is sockpuppeteers using _purchased_ accounts which
makes it difficult to see who is legit or not.

They can influence and censor people they don't like. You see a lot of this
going on over at /r/ethereum, where certain submissions receive high number of
upvotes but little to no comment activity apart from obvious sock puppetry
creating fake conversations. Comments pointing this out are quickly downvoted
to oblivion or flagged.

The anxiety from Reddit is obviously clear. Their user base premium is losing
it's value from an investors point of view. Mindless banters you see on
Youtube are what makes up Reddit's audience with occassional anecdotal
experiences that show up.

Pornhub has ton of users but are they valuable as Facebook users?

I feel like there's going to be a market correction in the quantity over
quality type of social network websites who've ramped up userbase but have not
extracted tangible value apart from delivering shaky metrics that advertisers
are now beginning to raise questions.

It seems like the only ad platform that stood test of time is Adwords but as a
small advertiser I'm turned off by the prospect of paying $1+ _per click_.

------
twblalock
Does anyone believe the vote fuzzing ever really worked?

It seems to me that people who wanted to mass upvote/downvote a post would do
so despite the vote fuzzing.

~~~
Terr_
I thought a large part of fuzzing and vote-detail hiding was to deny
bots/users any immediate/unambiguous feedback about whether they'd been
detected and had their influence secretly neutered.

------
toxican
It doesn't help that the CEO admitted to editing Trump supporter comments. I
mean I couldn't care less about those people, but that _is_ censorship and
suppression and I can't blame people for viewing anything reddit does from now
on with a bit of skepticism. But obviously it's not that bad because they've
still chosen to make reddit their home.

~~~
talmand
Why the downvotes? The CEO did admit to committing such anti-user actions.

~~~
minimaxir
At _best_ , it's off-topic.

~~~
talmand
How is it off-topic in this thread?

~~~
jakebasile
It isn't. It's a relevant demonstration of how/why some users are very
distrustful of Reddit's actions.

------
mark242
I still don't understand why, after having grown to a sufficient size, Reddit
doesn't just enforce two-factor at account creation time. Tie a user to a
phone number at the very least.

~~~
binarymax
Because having anonymity on public forums is important.

~~~
smrtinsert
These days it seems like an illusion with respect to government oversight, and
having a username is a simple way to anonymize yourself against your peers.

I'm all for it since it seems like it would reduce trolling.

------
drops
Upboat algorithms should be the least of their worries right now.

