

Using HyperLogLog to detect voting rings (2013) - sanj
http://opensourceconnections.com/blog/2013/10/14/detecting-reddit-voting-rings-using-hyperloglog-counters/

======
ecesena
If you use this sole technique, you allow an attacker to "DoS" a honest user
by creating n fake accounts and upvoting all users' posts, thus increasing his
global_total_upvotes with a small increase of his estimated_unique_upvotes.

~~~
Cakez0r
They could ban only the accounts issuing the votes and revoke the fake votes
when the account is deleted.

~~~
ecesena
Yes, but this is not the sole hyperloglog

------
gingerlime
I'm not sure I fully "get" it. Wouldn't more popular posters (like PG here on
HN) be immediately considered in a voting ring, because the same people upvote
their posts regularly?

or an even more basic / stupid question - what exactly is a "voting ring"?
does it mean people having some kind of a pact to always vote for each other?

~~~
tlb
It's a spectrum, there's no black-and-white definition. On one end there are
posts that get votes from people who visit the front/new page, vote for
multiple things and make comments that get upvoted, and on the other end there
are posts that get direct visits from accounts that only ever view and vote
for posts from that one author. In the middle, there's a weighting function
with empirically tuned parameters.

While there are a few clever voting rings, the vast majority are just one guy
with many sockpuppet accounts upvoting self-serving crap.

HN doesn't need anything as fancy as HyperLogLog. Most voting rings are
apparent within the first 20 votes so elementary data structures work fine.

(I know how the HN system works and anecdotally how Reddit's used to work, but
I'm being a bit generic here to avoid revealing that One Weird Trick to Get to
the Top of HN.)

~~~
cyphunk
if a ring introduces noise (meaning they attempt to avoid the "not clever"
trap you describe) how does this effect the measurement? I can assume that
introduction of noise is somewhat counter to the greed and objective of the
ring leader. But now the objective would change to be meerly introducing
enough noise to putting the metric on this ring in a state of question.

With that though, of course this is better than a model that requires no
noise. Forcing noise from the edges reduces the set of entries requiring human
review and, still, human review might reveal with sufficient weight the
existence of a _possible_ ring. But, this is an evolution to the cat+mouse
stage. Hence, rings could still survive if clever?

With that question (ring survival with measured noise) set aside, it seems if
the cat+mouse were to come in earnest that reputation dependent models would
be a driver for mapping user "types" (social, political, behavioral models) in
a much more accurate manner than the current driver of advertising (google
wanting to increase price of adds)

------
IgorPartola
So wouldn't this be trivially defeated by not always posting as the same user?
If you control N users, then to promote a story:

1\. Pick U out of N. Submit the story as U.

2\. Pick [U1, U2...UX] out of N, and upvote the story enough to carry it to
the front page. Let the random front page upvotes do the rest.

3\. ...

4\. Profit.

Also, why not write a bot that randomly upvotes stores for your N controlled
users (presumably, they are not real people)?

------
elihu
I don't think this would work very well on Reddit unless you applied some sort
of correction factor to account for the size of the subreddits that the user
posts to. For instance, someone who posts regularly to /r/luthier would have
far fewer unique users upvoting their posts than, say, /r/pics, which is a
much larger community.

------
nostromo
I doubt this would get you far in practice. Many spammers are sophisticated
enough to know that they should disguise their accounts.

For example: if you buy likes on Facebook, the likebots won't just like your
page and other spammers, but will also like many innoculous pages as well
(Coke, Obama, "Facebook Needs a Dislike Button!1!", etc.)

~~~
jhugg
I think this would still work for that scenario. To beat it, you'd need like-
bots that only liked/upvoted one or two sponsored things, then never did
again. An algorithm like this would at least prevent you from re-using like-
bots.

------
meatcar
Another solution with similar properties (approximate votes, etc) would
involve bloom filters on each post, and the vote counter would only be
incremented when a user is __not __present in the bloom filter, and is
immediately added to it.

------
emmelaich
The linked javascript demo of hyperloglog is broken. I found it here:

    
    
       http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

