- Okay, you want to detect bots. Well, "good" bots usually have a user agent string like, "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)." So, let's just block those.
- Wait, what are "those?" There is no normalized way of a user agent saying, "if this flag is set, I'm a bot." You literally would have to substring match on user agent.
- Okay, let's just substring match on the word "bot." Wait, then you miss user agents like, "FeedBurner/1.0 (http://www.FeedBurner.com)." Obviously some sort of FeedBurner bot, but it doesn't have "bot" or "crawler" or "spider" or any other term in there.
- How about we just make a "blacklist" of these known bots, look up every user agent, and compare against the blacklist? So now every single request to your site has to do a substring match against every single term in this list. Depending on your site's implementation, this is probably not trivial to do without taking some sort of performance hit.
- Also, you haven't even addressed the fact that user agents are specified by the clients, so its trivial to make a bot that identifies itself as "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1." No blacklist is going to catch that guy.
- This is smarter than just matching substrings, but this means you may not catch bots until after the fact. So if you have any sort of business where people pay you per click, and they expect those clicks not to be bots, then you need some way to say, "okay, I think I sent you 100 clicks, but let me check if they were all legit, so don't take this number as holy until 24 hours have passed." This is one of the reasons why products like Google AdWords don't have real-time reporting.
I'm seriously only scratching the surface here. That being said, I'm not saying, "this is a hard problem, cut Facebook some slack." If they're indeed letting in this volume of non-legit traffic, for a company with their resources, there is pretty much no excuse.
Even if you don't have the talent to preemptively flag and invalidate bot traffic, you can still invest in the resources to have a good customer experience and someone that can pick up a phone and say, "yeah, please don't worry about those 50,000 clicks it looks like we sent you, it's going to take us awhile but we'll make sure you don't have to pay that and we'll do everything we can to prevent this from happening again." In my opinion this is Facebook's critical mistake. You can have infallible technology, or you can have a decent customer service experience. Not having either, unfortunately, leads to experiences exactly like what the OP had.
Too often I've heard "how could XYZ not have done this, fixed that?" Then have that same party sign on board to fix this "easy" problem and get themselves in a world of hurt.
* Accounts subsisting of an unusual proportion of ad clicking activity compared to their other activities are likely bots.
* New accounts with no history but high ad clicks are likely bots.
Before a click is counted, check the source account for it's human rank (activity-to-clicks). If it's below a certain threshold (tuned over time) than don't bill your customer for it.
The algo's to determine human rank can start fairly simple and become more complicated and accurate over time.
In terms of their current sophistication I would definitely call this an easy problem from a tech standpoint. Scaling to millions of concurrent users? Hard. Attracting and keeping millions of users? Hard.
Then they make some bundled crapware browser plugin that gets installed by legit facebook users that will click ads at that interval + some random time. Possibly do the ad clicks in background so that the user does not know.
Take a minute to think about the effort a bot account has to go to in order to even come close to gaming a system that tracks their behaviour.
Let's assume that in order to be deemed human you need at least a few characteristics like "account older than 1 day", "friends from other non-suspicious accounts", "a photo or at least comments on a friends photos", and a "certain amount of click activity".
Now let's take that criteria and make it difficult for a bot to even know if they're being flagged as a bot where the only indication is a reduced bill at the end of the day / week / month to the advertiser. So bot makers would have to pay to test and would at best receive delayed feedback on their success.
Isn't this starting to seem extremely difficult, and generally unscalable from a bot perspective? It doesn't have to be impossible, there just needs to be lower hanging fruit, of which there are plenty should FB implement a system like this.
And should this fairly simple solution still have a few holes that a very elite few bot coders are able to game, you've still got tons of data and time to use to tune your algos.
It just doesn't make sense that one of their advertiser's is being charged for 80% bot clicks, until you start to question Facebooks motivations.
The bots will not be able to get accurate statistics without paying, but they might be able to get "good enough" statistics. For example, is the system moving in a direction we like.
If you do need to pay , a good bot herder probably has enough stolen credit cards to use.
Yes, there are certainly mitigation techniques. But like other spam/bot related problems it becomes an arms race.
You only need 1 good bot programmer to release his bot online for thousands of budding script kiddies to take advantage of.
So basically we've narrowed down the potentially bot makers to:
1) Has a way to hijack thousands of facebook user's browsers, and is more interested in spiking some companies ad buys than other profitable activities they could be doing.
2) Has access to stolen credit cards so they can spend the money required to debug and test their bot's humanity against the defensive algorithms. Spending money in order to indirectly make money because some company is paying them to spike a competitor's ad buys. Obviously this is going to cut into their margins, so the amount they spend should be less than they're being paid.
3) Has the patience and dedication to analyze the bills when they come in to determine how much of their activity is considered botlike and then tweak their behaviour accordingly.
4) Thousands of budding script kiddies are also being payed by companies to spike their competitor's ad buys.
5) Rides a unicorn?
Don't forget there also needs to be companies willing to pay someone to make these bots and don't mind how incredibly difficult it would be to measure it's success. In fact, if there are companies willing to pay for this, then you're going to have a whole bunch of scammers who simply claim they're able to do it that the companies will have to filter out.
We're incredibly far into the realm of unprofitability and edge-casery don't you think? An alien spaceship is going to land in my backyard before this happens.
You are forgetting that there are growing numbers of programmers in developing nations for whom doing this sort of thing can be an attractive way to earn a living (or at least they are willing to try it). Look on any "hire a freelancer" type sites and you will find hundreds of postings for tasks that sound awfully like scamware development (usually at a couple of dollars an hour).
Browser hijacking is not so uncommon either, I've installed legit software in the past that has come bundled with some crapware that injects extra adverts into web-pages and other things.
The analysis would not be too difficult either, simply set a "click rate" and adjust it downwards until you are being charged > $0. Do facebook offer any method to get free trail credits? If so this could be abused, heavily.
The economics are roughly the same as with spam, if you do enough of it your "hit rate" doesn't matter so much.
Real human social activity is a lot more random than real informational content in blogs - so fb's problem seems harder.
Also, facebook banning accounts is a much stronger action than google just lowering the pagerank of suspected splogs.
So overall I think fb's problem is harder than a currently google-unsolved problem.
The detection part aside, it's a simple fix. I just wonder why FB hasn't already. (Is it because it'd hurt their revenue when they need it the most?)
Now of course I'm simplifying this to a nearly absurd degree, but I do think there are reasons for Facebook to want to ignore these issues and that even a minimal step up in bot-prevention could go a long way when it comes to their advertisers.
But on the other hand, since valid human behavior includes much more senseless, random, and repetitive actions (compared to blogging), a well-done bot would be harder to detect.
Like I said, I absolutely believe there will be bots that are incredibly sophisticated that might be difficult to detect, but I still think more could be done to stop the less sophisticated ones.
However, even if you assume it is a hard problem - thats more reason that they should have listened (and responded!!) when a company identified this problem with some hard data to back it up.
Sure, its a big data problem, but I can imagine that Facebook has solved these types of scenarios many times over.
Moreover, Amazon is always buying new IP subnets.
Large numbers of real looking fake accounts should be hard to keep up.
DeviceAtlas and WURFL are both products that I have used that attempt to have complete coverage of all UA strings in use in the wild even, the ones that attempt to spoof legitimate UA strings.
While their focus is on device detection and the properties of these devices for mobile content adaptation, knowing which connections are from bots is just as important if you are going to alter what content you serve up based on UA.
These product actively seek out new UAs and add them to their rule sets in a way that I can only assume will be more accurate over time than a home grown solution.
FWIW I know that FB use at least one of these products for their device detection, so if you want to match what FB thinks (knows!?) this is where you would start.
This definitely takes non-trivial effort, but it's also definitely doable if you have a serious interest in perpetrating click fraud. While Facebook hypothetically could fight this form of click fraud, it'd likely be difficult and the method would be fragile/easily circumvented. Like most things, PPC advertising can only function in a world primarily populated with positive actors.
1. Make the split uneven. For example for every dollar advertiser pays, only 30 cents goes to the page owner who received the "Likes"
2. Lobby to make these kinds of activity illegal. They are after all spamming activities
3. These kinds of fake account farms will usually have very few connections to outside world. That is, accounts that do not obey the "Six Degrees of seperation" should be flagged for human review.
4. Once reviewed enlist law enforcement help to track down the owners of the all the accounts in these farms.
Painstaking work, but I guess the point is that advantage can lie with Facebook as opposed to the spammer in this game. I am not sure about the ethical implications though.
Maybe this is why Google wants to succeed badly in the social space : extra spam fighting abilities.
I couldn't find any resources on how Facebook handles this.
ignore this is fb already does it
What's the grey area look like? What's the chance of falsely identifying a real user as a fake one?
Rhetorical question. Not sure FB could even answer them accurately.
Instead of that, we'll watch for patterns of use in which high volumes of traffic come from unrecognized non-end-user network space.
"Unrecognized" means that it's not one of your usual search vendors: Google, Yahoo, Bing, Baidu, Yandex, Bender, AddThis, Facebook, etc.
"Non-end-user network space" means a netblock which is clearly not a residential or commercial/business networking provider.
For this you'll mostly want to eyeball, though when we saw very high volumes of search coming at us from ovh.net (a French low-cost hosting provider), it was pretty obvious what the problem was.
Using an IP to ASN/CIDR lookup (such as the Routeviews project: http://www.routeviews.org/) and a local ASN -> name lookup. One source of this is http://bgp.potaroo.net/cidr/autnums.html
It's not perfect for a variety of reasons, but I found it to be a useful input for a similar bot detection problem. That said, this was a long time ago so I'd need to re-run some experiments to see if the hypothesis remained valid.
For those not familiar with OS fingerprinting, it's a method of looking at the details of a connection and determining which operating system is most likely to have created packets with those options and flags. In BSD, it's built into pf.
Just that if your bot runs from, say, Windows 7 and it spoofs an IE8 user-agent header, or even runs by directly automating IE itself, how do you detect it then? Both of those scenarios are not unlikely at all.
I can't speak to Java classes, but they'd need to be dealing with raw packets which isn't an option in a typical stack.
As in your case, it's nearly impossible to detect click fault real time as the clicks streaming in. We batch-processed the clicks throughout the day and do daily billing settlement at the end of the day. It's easier to detect click fault over a period of time. Obvious bots are filtered out, like the search engine crawlers.
For click fault, it's impossible to distinguish one-off click as a user click or fault click. We tried to catch repeated offenders. we looked for duplicate clicks (same ip, same agent, other client attributes) against certain targets (same product, product categories, merchants, etc) over a period of times (1-minute, 5-minute, 30-minute, hours, etc). We also built historic click stats on the targets so we can check the average click pattern against historic stats (yesterday, week ago, month ago, year ago).
We ran quick checks every 1 minute and 5 minutes to detect unusual click patterns. These were mainly for detecting DOS attack against our site or against certain products. We actually had a realtime reaction mechanism to deal with it: slow down the requests, throw up CAPTCHA, redirect to a generic page, or block ip at routers.
Every couple hours, we batch processed the clicks so far and ran the exhausting checks to flag suspicious clicks. At the end of the day, definite fault clicks were thrown out, moderate suspicious clicks were grouped and reviewed by human. It's a fine balance to flag less suspicious clicks since there's money involved.
The sophisticate click fault I've seen involved a distributed botnet clicking from hundreds of IP all over the world doing slow clicks during the day. We found out because it's targeting a product category of a merchant and causing unusual click stats compared to historic pattern.
What you are getting at however is very important to ad-driven internet business models. You make the empathetical deduction that if you are there for one purpose, you are impervious to advertising.
Now, if you are still with me after that statement, then there are larger impications of online advertising than are actually being measured. For instance, if the intent of the view mattered, would it be worth more? I think Google has answered that question rather clearly.
Measuring "why someone is visiting a URL" (and thus generating an ad-view) is a larger technical problem than could be accomplished with some fancy algorithms. That is, unless the service knew all of your browsing history...
In any case, you raise a good point about the intent of the page-view and how that should be "credited" as advertising. To do this however is to challenge existing 'fundamental' theories.
Also, there is "brand marketing" and "performance marketing." I'm not sure they have classic definitions, but generally brand marketing is based on "mindshare" and tough to quantify. This is basically any TV commercial -- Budweiser doesn't expect you to jump off your couch the second you see a Bud Light commercial, but maybe a week later when you're in the supermarket, you spot that new brand of beer and recall the commercial with Jay-Z rapping and grinding with some models, and think, "eh, maybe I'll try Bud Light Platinum, I will purchase this and consume it later because I like the idea of rapping with models," even at a subconscious level.
Performance marketing is all about ROI. You spend $100 marketing dollars, and you want to make at least $X back. Typically $X = 100 in this example, although in some cases you may want to adjust profit margins or even run at negative profit (for example, if you want to maximize revenue even at an unprofitable rate). Almost any advertiser on the internet has a performance marketing mentality, mainly because it's much easier to actually measure performance. This is why advertisers like the OP are frustrated with Facebook, because paying for bot traffic makes it extremely hard to hit their performance marketing goals.
You are right though, in that this is a huge gray area. To give two extremes, let's say you run an ecommerce site but you'd also like to monetize with ads. So you reach a deal with a shopping comparison site to display their products on your pages in your footer. If you buy AdWords to your site, and those visitors from AdWords scroll down and click on the shopping comparison ads, they are probably fine with paying you for that traffic, although you are probably paying something like $2/click. If you make a page with JUST the shopping comparison ads, and buy traffic from someone who basically infects computers with trojans and overwrites their Google search results to be your site (say at $0.01 per click), then you'll get clicks on the shopping comparison ads and make lots of money, except the shopping comparison site will realize these clicks don't lead to any actual sales for their merchants (since this is extremely poor quality traffic since all the users that end up there have basically no intent) and will kick you off. Somewhere in between adveritsing for "first page Google AdWords" traffic and "trojan/botnet" traffic is basically every other form of internet advertising, and how gray that area is, is typically completely defined by your business partners.
Advertising and marketing are generally very intentionally tailored. You're not trying to reach any human, you're trying to reach a human in your target demographic.
Skewing results (and wasting resources) by paying people who are entirely outside your marketing effort to click through links (or take other similar resource-costing / data-skewing actions) is not the desired result.
See, the essence here is not to come up with a great algorithm. This is not a programming contest. But to create a system where everyone (publisher, advertiser, ad-network and customer) wins.
In this very early stage, the success of Facebook's Ad network should be measured only by one metric - ROI of Advertisers. And that number has to be consistently better than Google's.
Why's that? This isn't a zero sum game. You don't advertise only on Google or only on Facebook; as much as your budget allows, you advertise on every channel you can achieve a positive ROI on.
For any advertiser looking for a direct response, like a signup or purchase, I doubt anything Facebook can do would make the ROI better than advertising on search. No amount of demographic and interest targeting will make advertising to people as effective as advertising to intent. You can pay Google to send you people while they're in the process of looking for your product; Facebook doesn't do that.
A combination of tech expertise and marketing savvy are needed to reduce this problem. More analytic tools are available for Google Adwords than are available for Facebook ads.
Rather than trying to explicitly filter bot traffic, they should try to quantify the value of particular "likes". For instance, a user who demonstrably clicked through and purchased something is much more valuable than one who "likes" 10,000 companies and has no friends, whether or not they're a bot.
And heck; a bot that has thousands of followers who all buy lots of things is valuable too. Maybe someone created a particularly effective meme-generator bot, and people who follow that bot are likely to buy ironic T-shirts. Whether or not a user is a bot isn't all that relevant; it's whether that user's activity correlates with you making sales.
???! Excuse me? Are you programmers? Efficient substring matching is a solved problem. How many entries in that blacklist are you looking at then? Can't be more than a few thousand to catch 99.9% of the ones where a blacklist would work.
If done right it's easily faster than
It's also definitely faster than
> some sort of system that analyzes those clients and finds trends (for example, if they originate from a certain IP range)
> This is smarter than just matching substrings
No. It's smart to grab the 99.9% of bots with a blacklist of substrings, most importantly because it has a very very low false positive rate (unlike checking for JS support) because any human user that goes through the trouble of masking their UA as a known bot certainly knows to expect to get blocked here and there.
After that you can use more expensive checks, but at least you can bail out early with on the "properly behaving" bots with a really cheap string matching check (seriously why do you think that's an expensive operation--ever check how many MB/s GNU grep can process? just an example of how fast string matching can be, not suggesting to use grep in your project. Your blacklist is (relatively) fixed, you can use a trie or a bloomfilter it'll be faster than Apache parsing your HTTP headers)
I mean it's a good (and cheap) first pass to throw out quite a number of true positives.
I'm sure facebook engineers could come up with something way more sophisticated.
Each iteration of training that produces improved behavior can then be used to generate a new, more comprehensive training set. When I wrote a bayes engine to classify content, my first training set was based on a corpus that I produced entirely heuristically (not using bayes). I manually tuned it (about 500 pieces of content) and from that produced a new training set of about 5000 content items.
Eyeballing it for a couple of months initially, manually retraining periodically, spot checking daily as well as watching overall stats daily for any anomalous changes in observed bot/non bot ratios will work wonders.
Build a finite state machine which only accepts the terms in the blacklist. That should be a one-time operation.
Then feed each request into the FSM and see if you get a match. Execution time is linear in the length of the request, regardless of the number of terms in the blacklist.
There's no need to program like it's 1985 any more.
This. This is the problem -- pay-per-click is completely and utterly broken, and has been ever since people figured out how to use bots effectively. For one, PPC is simply not compatible with real-time reporting, but there are myriad other problems with it.
And honestly, I don't know if it's even really a problem worth trying to solve, since there are better models out there already. "CPA" ads  are by no means perfect, but they obviate almost all of the problems you mention simply by their definition: the advertiser does not pay unless whoever clicked on the ad does something "interesting". Usually, this can mean anything from signing up, to buying something, to simply generating a sales lead, but the important thing is that "interesting" is defined by the advertiser themselves. Not by, say, Facebook.
Full disclosure: I'm an engineer at a startup that does, among other things, CPA-based ad targeting.
 For those of us who don't work in ads, http://en.wikipedia.org/wiki/Cost_per_action has a decent explanation.
Separate out and distinguish possible humans from possible bots using a set of known humans and known bots.
Assign a human likelihood probability to each user and pay out the click based on the probability.
for eg. this user is 90% likely human, therefor the cost of the 20 cent ad click is 18 cents.
This doesn't have to wait on Facebook, somebody could build this as a third-party app and then send refund requests to Facebook on clicks that are likely bots.
Go create a really, really smart bot, you will instantly know the limitations they have. It's pretty trivial with enough data and honeypots to separate them from actual people.
Keeping those lists maintained would be a bit of fairly constant effort, unless you could come up with a self-training mechanism, or a self-validating mechanism.
Killing Google's ability to crawl your site is pretty much as bad or worse than blocking bots. Though you should also be able to ID bad guys by noting which IPs the spoofed user-agents are coming from to, say, some honey-pot links specifically disallowed for that user-agent in your robots.txt file.