

Fighting spam with BotMaker - kolev
https://blog.twitter.com/2014/fighting-spam-with-botmaker

======
gingerlime
What I'm missing the most is any concrete information on false positives. They
state the want near zero false-positive, which is great, but what's the actual
rate? and how is this measured?

It's easy to reduce spam by X% if you don't care about increasing your false-
positives rate.

I'm also curious what ways users have to flag false-positives on twitter. It's
easy to 'block or report', but is there a spam box to inspect and mark things
that were wrongly classified? If there isn't an _easy_ way, then it's going to
be much harder to even measure false positive rates, let alone reduce it.

~~~
jqueryin
I can myself attest to the fact that Twitter has plenty of false positives. I
use a custom url shortener url and on several occasions I have been flagged
for spam.

What I'd really like to see is the option to submit a report for the false
positive. I'm sure the data mining is picking up on not only my initial
submission, but also reattempts with altered content to try and bypass their
greedy algos.

Needless to say it's a bit of a bummer when you curate a tweet only to be
denied sending. I think they have it out for .CO url shorteners :)

~~~
QuantumGood
What reason was given you in the spam flag notification?

~~~
jqueryin
I haven't had one trigger in a week or two so I can't recall the exact
response. I've had it happen on separate accounts using url shorteners as well
as when attempting to send DMs containing shortened urls.

If it's of any use, I use Tweetdeck.

------
struct
"Key reduction in a spam metric". Stupid question: if they know what spam is
well enough to chart it, then why not just use that to fight it?

~~~
pornel
They're fighting spam in realtime, but they can analyze their accuracy later.

One example: if you have a rule "sending the same URL 100 times in a row is
spamming" then you'll _let 99 spams through_ before you identify that they
were spams.

------
jonaldomo
I was hoping to hear more about what is considered spam. High scores being
posted from a game through the API at a high level? What about mass favoriting
tweets to get more followers? Or are we just talking about 'Sex pills, free
rolexes, getting girls' links to malware sites?

------
HnHandle
Would be interesting if they would have share some more details on the rule
engine/framework. How does it compare with drools (apart from it being
probably faster).

------
llasram
I wonder how these rules came to be known as "bots" within Twitter. It has an
nice symmetry to it though: use bots to fight bots.

------
brianpetro_
TLDR

Twitter built a DSL for writing spam detection algorithms.

------
junto
To publish 'a how we did it' article like this suggests to me that the Twitter
engineering team are either justifiably confident in their spam fighting
creation (because they are confident it is infallible), or they are naively
supremely over-confident; thus publicizing how this works will come back to
bite them in the butt.

Only time will tell I guess. From the article it would appear that they aren't
doing anything magically different with regards to classification of twitter-
spam, but they have found a way to deal with the volume of classification
tasks in a pertinent manner. It also gives them a way to quickly respond to
new types of spam attacks.

Very interesting. They should consider opening it up as an Askimet competitor.
The difference between a blog comment and a twitter post is negligible.

~~~
possibilistic

      > publicizing how this works will come back to bite them in the butt
    

I didn't see any rules or heuristics published. Merely that they employ a
multi-stage filter (as any engineer could imagine), and that they codify their
ruleset with a human-readable DSL (which is kind of interesting, but also kind
of weird).

    
    
      > The difference between a blog comment and a twitter post is negligible.
    

The difference in textual content between a tweet and blog comments falling
within a similar character length are perhaps negligible, though there is
often no such length constraint in place for blogs. Comment spam strategies
are free to vary string length to optimize for evasion, click through,
proliferation, and other criteria. I would also argue there is a different
demographic distribution between Twitter and blogs with regard to readership
and participation.

That said, Twitter just outlined a number of reasons why their case is
special. They are a high-availability, high-volume, low-latency service.
Twitter's spam solution was designed to handle their very special set of
constraints. I think a multi-stage filter complete with asynchronous post-
processing jobs would be a bit much for your average Wordpress blog. People
just starting out with (say, PHP) probably can't fathom a multiprocess
deployment architecture. Not to say they couldn't, but the journey is a long
one.

~~~
benaiah
> I think a multi-stage filter complete with asynchronous post-processing jobs
> would be a bit much for your average Wordpress blog. People just starting
> out with (say, PHP) probably can't fathom a multiprocess deployment
> architecture. Not to say they couldn't, but the journey is a long one.

That's why he suggested an Akismet competitor - in case you haven't used it,
Akismet is a SAAS solution that filters your comments on their servers, not on
your own, so you don't have to worry about the architecture or deployment.

~~~
possibilistic
My mistake. Thanks for the correction. :)

