Hacker News new | comments | show | ask | jobs | submit login

My startup is essentially an advertising aggregator (pooling traffic from a variety of publishers and routing it to advertisers) and dealing with things like bot detection is a HUGE chunk of what we work on, technology-wise. Let me try and give you an idea of how deep the rabbit hole can go.

- Okay, you want to detect bots. Well, "good" bots usually have a user agent string like, "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)." So, let's just block those.

- Wait, what are "those?" There is no normalized way of a user agent saying, "if this flag is set, I'm a bot." You literally would have to substring match on user agent.

- Okay, let's just substring match on the word "bot." Wait, then you miss user agents like, "FeedBurner/1.0 (http://www.FeedBurner.com)." Obviously some sort of FeedBurner bot, but it doesn't have "bot" or "crawler" or "spider" or any other term in there.

- How about we just make a "blacklist" of these known bots, look up every user agent, and compare against the blacklist? So now every single request to your site has to do a substring match against every single term in this list. Depending on your site's implementation, this is probably not trivial to do without taking some sort of performance hit.

- Also, you haven't even addressed the fact that user agents are specified by the clients, so its trivial to make a bot that identifies itself as "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1." No blacklist is going to catch that guy.

- Okay, let's use something else to flag bot vs. non-bot. Say, let's see if the client can execute Javascript. If so, let's log information about those that can't execute Javascript, and then built some sort of system that analyzes those clients and finds trends (for example, if they originate from a certain IP range).

- This is smarter than just matching substrings, but this means you may not catch bots until after the fact. So if you have any sort of business where people pay you per click, and they expect those clicks not to be bots, then you need some way to say, "okay, I think I sent you 100 clicks, but let me check if they were all legit, so don't take this number as holy until 24 hours have passed." This is one of the reasons why products like Google AdWords don't have real-time reporting.

- And then when you get successful enough, someone is going to target your site with a a very advanced bot that CAN seem like a legit user in most cases (ie. it can run Javascript, answer CAPTCHAs), and spam-click the shit out of your site, and you're going to have a customer that's on the hook to you for thousands of dollars even though you didn't send them a single legit user. This will cause them to TOTALLY FREAK THE FUCK OUT about this and if you aren't used to handling customers FREAKING THE FUCK OUT, you are going to have a business and technical mess on your hands. You will have a business mess because it will be very easy to conclude you did this maliciously, and you're now one Hacker News post away from having a customer run your name through the mud and for the next several months, 7 out of the top 10 results on any Google search for your company's name will be that post and related ones. And you'll have a technical mess because your system is probably based on, you know, people actually paying you what you think they should, and if you have no concept of "issuing a credit" or "reverting what happened," then get ready for some late nights.

I'm seriously only scratching the surface here. That being said, I'm not saying, "this is a hard problem, cut Facebook some slack." If they're indeed letting in this volume of non-legit traffic, for a company with their resources, there is pretty much no excuse.

Even if you don't have the talent to preemptively flag and invalidate bot traffic, you can still invest in the resources to have a good customer experience and someone that can pick up a phone and say, "yeah, please don't worry about those 50,000 clicks it looks like we sent you, it's going to take us awhile but we'll make sure you don't have to pay that and we'll do everything we can to prevent this from happening again." In my opinion this is Facebook's critical mistake. You can have infallible technology, or you can have a decent customer service experience. Not having either, unfortunately, leads to experiences exactly like what the OP had.

I understand that it is challenging for a third party like you to identify the bots. But Facebook has more than enough signal on each user clicking on an ad (friends, posts, location, 'non ad-click activity' to 'ad-click activity' ratio) to determine if that user is a bot. Looks like they are choosing to not do anything with this information. "FB is turning a blind eye to keep their revenue" seems to stand up to Occam's Razor.

Couldn't agree more. It should be almost trivial for them to track down bots and flag bad accounts. I definitely wouldn't classify it as one of their harder problems.

Not saying that's not the case, but you have no data or knowledge to back any of that up. I wouldn't assume a problem is easy or hard until I've got sufficient info on it.

Too often I've heard "how could XYZ not have done this, fixed that?" Then have that same party sign on board to fix this "easy" problem and get themselves in a world of hurt.

Well I do have at least some data to back that up. I know that they are a walled garden that keeps a ton of user data for every account (even the subset represented in Open Graph is significant). And it's with this knowledge that I make my conclusion.

* Accounts subsisting of an unusual proportion of ad clicking activity compared to their other activities are likely bots.

* New accounts with no history but high ad clicks are likely bots.

Before a click is counted, check the source account for it's human rank (activity-to-clicks). If it's below a certain threshold (tuned over time) than don't bill your customer for it.

The algo's to determine human rank can start fairly simple and become more complicated and accurate over time.

In terms of their current sophistication I would definitely call this an easy problem from a tech standpoint. Scaling to millions of concurrent users? Hard. Attracting and keeping millions of users? Hard.

The bot masters then just have to figure out roughly what your activity/ad clicks rate is. They can do this over time by looking at statistics.

Then they make some bundled crapware browser plugin that gets installed by legit facebook users that will click ads at that interval + some random time. Possibly do the ad clicks in background so that the user does not know.

How exactly are they going to know they're being flagged as bots? When all you're doing is discounting those bot clicks to the advertiser it becomes far more difficult to test your limits as a bot (you would need to be a paying advertiser, just to analyze the human to bot threshold). It's similar to hell-banning, in that the user's generally unaware they've been hell-banned.

Take a minute to think about the effort a bot account has to go to in order to even come close to gaming a system that tracks their behaviour.

Let's assume that in order to be deemed human you need at least a few characteristics like "account older than 1 day", "friends from other non-suspicious accounts", "a photo or at least comments on a friends photos", and a "certain amount of click activity".

Now let's take that criteria and make it difficult for a bot to even know if they're being flagged as a bot where the only indication is a reduced bill at the end of the day / week / month to the advertiser. So bot makers would have to pay to test and would at best receive delayed feedback on their success.

Isn't this starting to seem extremely difficult, and generally unscalable from a bot perspective? It doesn't have to be impossible, there just needs to be lower hanging fruit, of which there are plenty should FB implement a system like this.

And should this fairly simple solution still have a few holes that a very elite few bot coders are able to game, you've still got tons of data and time to use to tune your algos.

It just doesn't make sense that one of their advertiser's is being charged for 80% bot clicks, until you start to question Facebooks motivations.

I agree. This is an easy problem, considering FB's resources. Only a logged in FB user can click on ads. Not random crawlers. That alone makes this problem 100 times easier than what Google has to face with invalid clicks. No excuses.

It depends whether the bots are creating accounts or hijacking existing ones (via browser plugins or whatnot).

The bots will not be able to get accurate statistics without paying, but they might be able to get "good enough" statistics. For example, is the system moving in a direction we like.

If you do need to pay , a good bot herder probably has enough stolen credit cards to use.

Yes, there are certainly mitigation techniques. But like other spam/bot related problems it becomes an arms race.

You only need 1 good bot programmer to release his bot online for thousands of budding script kiddies to take advantage of.

Hijacking browsers? We're getting pretty sophisticated and far fetched. If you've hijacked someone's browser there are many better ways to make a profit than spiking a competitors ad buys.

So basically we've narrowed down the potentially bot makers to:

1) Has a way to hijack thousands of facebook user's browsers, and is more interested in spiking some companies ad buys than other profitable activities they could be doing.

2) Has access to stolen credit cards so they can spend the money required to debug and test their bot's humanity against the defensive algorithms. Spending money in order to indirectly make money because some company is paying them to spike a competitor's ad buys. Obviously this is going to cut into their margins, so the amount they spend should be less than they're being paid.

3) Has the patience and dedication to analyze the bills when they come in to determine how much of their activity is considered botlike and then tweak their behaviour accordingly.

4) Thousands of budding script kiddies are also being payed by companies to spike their competitor's ad buys.

5) Rides a unicorn?

Don't forget there also needs to be companies willing to pay someone to make these bots and don't mind how incredibly difficult it would be to measure it's success. In fact, if there are companies willing to pay for this, then you're going to have a whole bunch of scammers who simply claim they're able to do it that the companies will have to filter out.

We're incredibly far into the realm of unprofitability and edge-casery don't you think? An alien spaceship is going to land in my backyard before this happens.

Not really, this is just off the top of my head and I'm sufficiently unfamiliar with bot programming and FB advertising that I'm sure that plenty of better solutions must exist.

You are forgetting that there are growing numbers of programmers in developing nations for whom doing this sort of thing can be an attractive way to earn a living (or at least they are willing to try it). Look on any "hire a freelancer" type sites and you will find hundreds of postings for tasks that sound awfully like scamware development (usually at a couple of dollars an hour).

Browser hijacking is not so uncommon either, I've installed legit software in the past that has come bundled with some crapware that injects extra adverts into web-pages and other things.

The analysis would not be too difficult either, simply set a "click rate" and adjust it downwards until you are being charged > $0. Do facebook offer any method to get free trail credits? If so this could be abused, heavily.

The economics are roughly the same as with spam, if you do enough of it your "hit rate" doesn't matter so much.

Where there is an affiliate program there will be bots. Insofar as Facebook creates a way for developers to monetize their apps through a revenue share on ad revenue, developers will have an incentive to falsify app usage and ad clicks.

Compare it to google's problem filtering out automatically generated splogs, though - it took them years to make progress against that.

Real human social activity is a lot more random than real informational content in blogs - so fb's problem seems harder.

Also, facebook banning accounts is a much stronger action than google just lowering the pagerank of suspected splogs.

So overall I think fb's problem is harder than a currently google-unsolved problem.

Facebook has total control of the users very action since joining Facebook. Much easier than coming across a random HTMl page and deciding on if it's real (whatever real means, YouTube comments are real but are very low signal but a database that runs itself could provide lots of signal).

He never said to ban the suspicious accounts, just don't charge companies for their clicks.

The detection part aside, it's a simple fix. I just wonder why FB hasn't already. (Is it because it'd hurt their revenue when they need it the most?)

But in the same respect, the fact that real human social activity is much more random is a reason why this should be easier. Bots are inherently pattern based, and while there are bots that could work to maintain fake interactions in order to pass a wide-sweep analysis of low-socializing accounts with a high ad response rate, it would do quite a bit to take a first attack on some of the more obvious bots, which are the ones that many of the people who had these problems noticed.

Now of course I'm simplifying this to a nearly absurd degree, but I do think there are reasons for Facebook to want to ignore these issues and that even a minimal step up in bot-prevention could go a long way when it comes to their advertisers.

I suppose a naive bot would not be "random" enough, and would obviously not be human.

But on the other hand, since valid human behavior includes much more senseless, random, and repetitive actions (compared to blogging), a well-done bot would be harder to detect.

Oh, absolutely! I have no doubts that Facebook will always be a step behind the bots, and I don't think anyone feels or even expects otherwise. That's just part of the social networking climate right now, and all major (and even many minor) social networking sites have to deal with it. However, in many cases that were pointed out in both the article and the comments here, these people weren't having their pages liked and ads clicked through by advanced bots, but instead by obviously shill accounts. Now of course there is always the option that these are legitimate accounts with weird use patterns, but regardless there are methods of doing at least a significant sweep and remove some of the more obvious bots or at least offering them a chance of legitimacy confirmation (even something similar to CAPTCHAs for non-standard account usage that messes with advertisers campaigns).

Like I said, I absolutely believe there will be bots that are incredibly sophisticated that might be difficult to detect, but I still think more could be done to stop the less sophisticated ones.

I come down on the side that it shouldn't be so difficult for Facebook to at least spot some of the 4 bots per human.

However, even if you assume it is a hard problem - thats more reason that they should have listened (and responded!!) when a company identified this problem with some hard data to back it up.

data would be * IP addresses, one can assume a bot would only have a set of addresses they could use, barring botnets. * request patterns, ie: did the bot request css/js, etc * request timeframes * UA strings

Sure, its a big data problem, but I can imagine that Facebook has solved these types of scenarios many times over.

What if you start a new Amazon EC2 spot instance (netting you a new IP address), start up Chromium in headless mode (say, using Xvfb), navigate to the website of choice, use mouse automation to start clicking around, click the ad, spend 5 minutes clicking around in a semi-choreographed pattern on the advertisee's website, and then shut down the instance -- only to repeat?

Moreover, Amazon is always buying new IP subnets.

It sounds like you don't need to go to that much hassle currently, but even that rigmarole is simple enough to combat. The user account should be real, the usage real (comments, photos, messages back and forth) and the friends also real. False positive spam Ids are OK, that will lower your revenue but won't constitute fraud with your customers. Put up a test for uses you think are spamming, the test they already do of identifying photos of your friends would be a good one.

Large numbers of real looking fake accounts should be hard to keep up.

The user from that new IP won't have any real human like history - photos shared and commented on over time etc.

But why would you go through all that to click on ads? If I click on an ad for "Some Record Company" how does that make me money?

It doesn't always need to make you money, sometimes you might just want it to cost your competitor money.

I dont count clicks from any amazonaws or ec2 hostnames on my site.

Then pay Amazon for a list of their EC2 IPs, or obtain that information from a public source (i.e. RIPE, ARIN).

This is a mostly solved problem though. Or at least there are people and products out there that do this specifically for you, not with a focus on bots alone as that is a bit of a edge case but for mobile devices.

DeviceAtlas and WURFL are both products that I have used that attempt to have complete coverage of all UA strings in use in the wild even, the ones that attempt to spoof legitimate UA strings.

While their focus is on device detection and the properties of these devices for mobile content adaptation, knowing which connections are from bots is just as important if you are going to alter what content you serve up based on UA.

These product actively seek out new UAs and add them to their rule sets in a way that I can only assume will be more accurate over time than a home grown solution.

FWIW I know that FB use at least one of these products for their device detection, so if you want to match what FB thinks (knows!?) this is where you would start.

It's just as easy to invest in several fake profiles that act just like real users. Pay people to work from home maintaining fake accounts for you, posting stuff that seems real, interacting with your other fake accounts and celebrity/business pages, etc. Then, run bots with these accounts, click around on a subset of good links randomly, and throw in a reasonable ratio of ad clicks per account, then move on to the next shill account.

This definitely takes non-trivial effort, but it's also definitely doable if you have a serious interest in perpetrating click fraud. While Facebook hypothetically could fight this form of click fraud, it'd likely be difficult and the method would be fragile/easily circumvented. Like most things, PPC advertising can only function in a world primarily populated with positive actors.

I am unsure if Facebook provides a service like AdSense to motivate this kind of behavior. It would be interesting if to see how they could fight this type of behavior though.

1. Make the split uneven. For example for every dollar advertiser pays, only 30 cents goes to the page owner who received the "Likes"

2. Lobby to make these kinds of activity illegal. They are after all spamming activities

3. These kinds of fake account farms will usually have very few connections to outside world. That is, accounts that do not obey the "Six Degrees of seperation" should be flagged for human review.

4. Once reviewed enlist law enforcement help to track down the owners of the all the accounts in these farms.

Painstaking work, but I guess the point is that advantage can lie with Facebook as opposed to the spammer in this game. I am not sure about the ethical implications though.

Maybe this is why Google wants to succeed badly in the social space : extra spam fighting abilities.

You're basically saying 'there is a point where the line between actual user and paid click-frauder becomes blurry'. Which is true, but that point is way beyond what is profitable, which is enough - you 'only' need to make your system hard enough to beat to make it not worth the effort.

Sure, I agree that your summary is correct. I don't really think the system I described is necessarily untenable, though. If you make 10k+/mo from ads I think it could be plenty profitable.

Google seems to handle click fraud pretty well: http://www.google.com/intl/en_ALL/ads/adtrafficquality/index...

I couldn't find any resources on how Facebook handles this.

It's much easier to guess a weak password than create a fake profile. As for everything else, I totally agree with you.

FB turning a blind eye towards shady a advertising practice because it generates revenue? See: Scamville http://techcrunch.com/2009/10/31/scamville-the-social-gaming...

Just as a thought - facebook must be in a unique position to provide almost unbeatable captchas - just put up images of a known friend and four / forty random pictures - the one who chooses correct is likely the person logged in

ignore this is fb already does it

They do in fact do that (or something like it) when you need to recover an account/password.

They also did this the first time I logged in from a foreign country.

I suspect even that signal is not trustworthy. Given that its not 100%, what fraction of the 900+ million Facebook users are actually users? Twitter seemed to be totally plagued with that at one point anyway.

What's the grey area look like? What's the chance of falsely identifying a real user as a fake one?

Rhetorical question. Not sure FB could even answer them accurately.

We've found that identifying sources of traffic and patterns of usage is superior to simple user-agent detection, particularly as, since you've noted, user-agent spoofing is trivial, and/or some bot traffic is driven through user-based tools (say, script-driven MSIE bots).

Instead of that, we'll watch for patterns of use in which high volumes of traffic come from unrecognized non-end-user network space.

"Unrecognized" means that it's not one of your usual search vendors: Google, Yahoo, Bing, Baidu, Yandex, Bender, AddThis, Facebook, etc.

"Non-end-user network space" means a netblock which is clearly not a residential or commercial/business networking provider.

For this you'll mostly want to eyeball, though when we saw very high volumes of search coming at us from ovh.net (a French low-cost hosting provider), it was pretty obvious what the problem was.

Using an IP to ASN/CIDR lookup (such as the Routeviews project: http://www.routeviews.org/) and a local ASN -> name lookup. One source of this is http://bgp.potaroo.net/cidr/autnums.html

One as-yet-unmentioned technique is to adjust a bot score by taking an OS fingerprint, and comparing that to the listed user agent.

It's not perfect for a variety of reasons, but I found it to be a useful input for a similar bot detection problem. That said, this was a long time ago so I'd need to re-run some experiments to see if the hypothesis remained valid.

For those not familiar with OS fingerprinting, it's a method of looking at the details of a connection and determining which operating system is most likely to have created packets with those options and flags. In BSD, it's built into pf.

On the packet level ... hm that's pretty clever, definitely not trivial to fake, either.

Just that if your bot runs from, say, Windows 7 and it spoofs an IE8 user-agent header, or even runs by directly automating IE itself, how do you detect it then? Both of those scenarios are not unlikely at all.

If the bot spoofs a user-agent that is viable for the OS, then you need to rely on other signals. It only works as a component of a defense in depth strategy.

Interesting. Any Linux equivalents or Java classes which offer similar capabilities (we're using a Java-based application server / webserver).

It works off of checking the SYN packet [0], which happens at the transport layer [1]. Once it gets to the application layer the information is not available unless the transport layer stores the information and provides a method to query a given connection.

[0] http://www.openbsd.org/faq/pf/filter.html#osfp

[1] http://en.wikipedia.org/wiki/OSI_model

I suspected something like this, but was hoping enough of the signature might survive for a daemon (or Java class acting as one) to take a peek.

The linux equivalent is p0f (http://freecode.com/projects/p0f)

I can't speak to Java classes, but they'd need to be dealing with raw packets which isn't an option in a typical stack.

These are good info. In my past life, I oversaw the web monetization pipeline as part of my job. Click fault detection was part of it. It is a hard problem and there's no 100% foolproof solution.

As in your case, it's nearly impossible to detect click fault real time as the clicks streaming in. We batch-processed the clicks throughout the day and do daily billing settlement at the end of the day. It's easier to detect click fault over a period of time. Obvious bots are filtered out, like the search engine crawlers.

For click fault, it's impossible to distinguish one-off click as a user click or fault click. We tried to catch repeated offenders. we looked for duplicate clicks (same ip, same agent, other client attributes) against certain targets (same product, product categories, merchants, etc) over a period of times (1-minute, 5-minute, 30-minute, hours, etc). We also built historic click stats on the targets so we can check the average click pattern against historic stats (yesterday, week ago, month ago, year ago).

We ran quick checks every 1 minute and 5 minutes to detect unusual click patterns. These were mainly for detecting DOS attack against our site or against certain products. We actually had a realtime reaction mechanism to deal with it: slow down the requests, throw up CAPTCHA, redirect to a generic page, or block ip at routers.

Every couple hours, we batch processed the clicks so far and ran the exhausting checks to flag suspicious clicks. At the end of the day, definite fault clicks were thrown out, moderate suspicious clicks were grouped and reviewed by human. It's a fine balance to flag less suspicious clicks since there's money involved.

The sophisticate click fault I've seen involved a distributed botnet clicking from hundreds of IP all over the world doing slow clicks during the day. We found out because it's targeting a product category of a merchant and causing unusual click stats compared to historic pattern.

And then there are the "wet bots" - Amazon Turks and overseas humans (lowly) paid to use a regular browser and PC and manually click on stuff for whatever purpose (rig internet voting, drive fake advertising charges, steal data, skew hit counters - whatever).

Thank you for introducing me to the term "wet bots"!

Is an impression from a mechanical turk not a true impression? Its a real person, after all. If the premise of marketing and advertising work due to mere exposure, then intent of the view shouldn't be measured.

What you are getting at however is very important to ad-driven internet business models. You make the empathetical deduction that if you are there for one purpose, you are impervious to advertising.

Now, if you are still with me after that statement, then there are larger impications of online advertising than are actually being measured. For instance, if the intent of the view mattered, would it be worth more? I think Google has answered that question rather clearly.

Measuring "why someone is visiting a URL" (and thus generating an ad-view) is a larger technical problem than could be accomplished with some fancy algorithms. That is, unless the service knew all of your browsing history...

In any case, you raise a good point about the intent of the page-view and how that should be "credited" as advertising. To do this however is to challenge existing 'fundamental' theories.

The problem with mechanical turk impressions/clicks is typically the country of origin, since they usually involve paying people in third-world countries a fraction of a cent which is still worth some nominal amount to them. Say you are an advertiser who runs an ecommerce site that only ships within the US. You don't want to pay for impressions/clicks from people in India because you can't actually sell to them, even if they are "real" people and not a script.

Also, there is "brand marketing" and "performance marketing." I'm not sure they have classic definitions, but generally brand marketing is based on "mindshare" and tough to quantify. This is basically any TV commercial -- Budweiser doesn't expect you to jump off your couch the second you see a Bud Light commercial, but maybe a week later when you're in the supermarket, you spot that new brand of beer and recall the commercial with Jay-Z rapping and grinding with some models, and think, "eh, maybe I'll try Bud Light Platinum, I will purchase this and consume it later because I like the idea of rapping with models," even at a subconscious level.

Performance marketing is all about ROI. You spend $100 marketing dollars, and you want to make at least $X back. Typically $X = 100 in this example, although in some cases you may want to adjust profit margins or even run at negative profit (for example, if you want to maximize revenue even at an unprofitable rate). Almost any advertiser on the internet has a performance marketing mentality, mainly because it's much easier to actually measure performance. This is why advertisers like the OP are frustrated with Facebook, because paying for bot traffic makes it extremely hard to hit their performance marketing goals.

You are right though, in that this is a huge gray area. To give two extremes, let's say you run an ecommerce site but you'd also like to monetize with ads. So you reach a deal with a shopping comparison site to display their products on your pages in your footer. If you buy AdWords to your site, and those visitors from AdWords scroll down and click on the shopping comparison ads, they are probably fine with paying you for that traffic, although you are probably paying something like $2/click. If you make a page with JUST the shopping comparison ads, and buy traffic from someone who basically infects computers with trojans and overwrites their Google search results to be your site (say at $0.01 per click), then you'll get clicks on the shopping comparison ads and make lots of money, except the shopping comparison site will realize these clicks don't lead to any actual sales for their merchants (since this is extremely poor quality traffic since all the users that end up there have basically no intent) and will kick you off. Somewhere in between adveritsing for "first page Google AdWords" traffic and "trojan/botnet" traffic is basically every other form of internet advertising, and how gray that area is, is typically completely defined by your business partners.

It's at best a gamed impression.

Advertising and marketing are generally very intentionally tailored. You're not trying to reach any human, you're trying to reach a human in your target demographic.

Skewing results (and wasting resources) by paying people who are entirely outside your marketing effort to click through links (or take other similar resource-costing / data-skewing actions) is not the desired result.

All that is really no excuse for not developing a foolproof system. And if you know that you don't have a great system to catch fraud, then you should simply leave some money on the table.

See, the essence here is not to come up with a great algorithm. This is not a programming contest. But to create a system where everyone (publisher, advertiser, ad-network and customer) wins.

In this very early stage, the success of Facebook's Ad network should be measured only by one metric - ROI of Advertisers. And that number has to be consistently better than Google's.

> And that number has to be consistently better than Google's.

Why's that? This isn't a zero sum game. You don't advertise only on Google or only on Facebook; as much as your budget allows, you advertise on every channel you can achieve a positive ROI on.

For any advertiser looking for a direct response, like a signup or purchase, I doubt anything Facebook can do would make the ROI better than advertising on search. No amount of demographic and interest targeting will make advertising to people as effective as advertising to intent. You can pay Google to send you people while they're in the process of looking for your product; Facebook doesn't do that.

Who is the "you" that your post above mentions?

The advertiser.

ROI to the advertiser is a key metric. Smart marketers compare the LTV or Life time value of customers that come from specific campaigns. ROI metric does not have to be better than Google for an ad campaign on Facebook or Microsoft Adcenter to be profitable. I have worked with clients to use web analytics and cohort analysis to optimize both quality (higher conversion, better LTV ) and quantity of leads. Sometimes Microsoft Adcenter traffic converts better but does have sufficient quantity compared to Google Adwords. Also different ad platforms have different levels or competition as well as different levels of fraud detection and click quality. I am surprised that Facebook does not detect and suspend obviously spammy user accounts whether that account is bot driven or outsourced to humans overseas.Google and Facebook have financial incentives to not filter out all low quality ad traffic because many advertisers don't have expertise or time to closely examine traffic sources as the company mentioned in this news story.

A combination of tech expertise and marketing savvy are needed to reduce this problem. More analytic tools are available for Google Adwords than are available for Facebook ads.

Another great way of detecting bots is to redirect the visitor through an SSL connection and sniff their cipher suite. By comparing the list of ciphers the bot presents with a list of known ciphers of major browsers, you can accurately detect a majority of bots. Many bot authors use the same few libraries which never change their cipher suite.

This seems brilliant. Are there any downsides?

There is a slight performance hit, there is overhead added by creating SSL connections. However, using this method you can detect most cURL bots, practically all bots written in Ruby, and myriad other bots that are lying about their User-Agent. Combine it with other techniques for maximum effectiveness.

The big problem here isn't detecting bots. Sure, in this one company's case bots may be the problem; but in general, it's detecting behavior that you are paying for that doesn't make you any money. There may be real users who click "like" a lot; they may do it because they're paid, or they may do it because they're bored and trying to see how many "likes" they can collect. It doesn't really matter; none of them are very valuable to an advertiser.

Rather than trying to explicitly filter bot traffic, they should try to quantify the value of particular "likes". For instance, a user who demonstrably clicked through and purchased something is much more valuable than one who "likes" 10,000 companies and has no friends, whether or not they're a bot.

And heck; a bot that has thousands of followers who all buy lots of things is valuable too. Maybe someone created a particularly effective meme-generator bot, and people who follow that bot are likely to buy ironic T-shirts. Whether or not a user is a bot isn't all that relevant; it's whether that user's activity correlates with you making sales.

> How about we just make a "blacklist" of these known bots, look up every user agent, and compare against the blacklist? So now every single request to your site has to do a substring match against every single term in this list. Depending on your site's implementation, this is probably not trivial to do without taking some sort of performance hit.

???! Excuse me? Are you programmers? Efficient substring matching is a solved problem. How many entries in that blacklist are you looking at then? Can't be more than a few thousand to catch 99.9% of the ones where a blacklist would work.

If done right it's easily faster than

> see if the client can execute Javascript.

because that requires a whole extra request roundtrip before detection. Also it's no longer true that bots don't use Javascript, the libraries are freely available.

It's also definitely faster than

> some sort of system that analyzes those clients and finds trends (for example, if they originate from a certain IP range)

> This is smarter than just matching substrings

No. It's smart to grab the 99.9% of bots with a blacklist of substrings, most importantly because it has a very very low false positive rate (unlike checking for JS support) because any human user that goes through the trouble of masking their UA as a known bot certainly knows to expect to get blocked here and there.

After that you can use more expensive checks, but at least you can bail out early with on the "properly behaving" bots with a really cheap string matching check (seriously why do you think that's an expensive operation--ever check how many MB/s GNU grep can process? just an example of how fast string matching can be, not suggesting to use grep in your project. Your blacklist is (relatively) fixed, you can use a trie or a bloomfilter it'll be faster than Apache parsing your HTTP headers)

This is entirely honor system, though. All of the user-agent checking has to be layered with the other bot-removal techniques. Malicious users will always fake user agent strings.

Of course. But you also need to make sure you don't accidentally count the non-malicious bots that have known UA strings but just happen to crawl all over the place, including ads.

I mean it's a good (and cheap) first pass to throw out quite a number of true positives.

this is a great use case for bayesean filtering, and I bet you could even hook up an existing bayes engine to this problem space, if not just write one up using common recipes - I've done great things with the one described in "Programming Collective Intelligence" (http://shop.oreilly.com/product/9780596529321.do). Things like "googlebot", "certain IP range", "was observed clicking 100 links in < 1 minute", and "didn't have a referrer" all go into the hopper of observable "features", which are all aggregated to produce a corpus of bot/non bot server hits. run it over a gig of log files to train it and you'll have a decent system which you can tune to favor false negatives vs. false positives.

I'm sure facebook engineers could come up with something way more sophisticated.

Well yeah, getting a training data set, and then having your algo not be poisoned, seems like it is a challenge all around!

The algo can't be poisoned, if I understand the term correctly, if you only train it manually against datasets you trust.

Each iteration of training that produces improved behavior can then be used to generate a new, more comprehensive training set. When I wrote a bayes engine to classify content, my first training set was based on a corpus that I produced entirely heuristically (not using bayes). I manually tuned it (about 500 pieces of content) and from that produced a new training set of about 5000 content items.

Eyeballing it for a couple of months initially, manually retraining periodically, spot checking daily as well as watching overall stats daily for any anomalous changes in observed bot/non bot ratios will work wonders.

I'm in another field then advertising, but less then 5% of transactions require special handling, but they represent a significant investment in error handling, procedures, accounting and reversal mechanisms. Even then we are still manually dealing with things that represent 0.00001% of activity. Accounting correctly is always a deep rabbit hole, it comes down to how much you plan to right off inside your margins. :) Thanks for writing up this inside perspective of running this type of business.

> How about we just make a "blacklist" of these known bots, look up every user agent, and compare against the blacklist? So now every single request to your site has to do a substring match against every single term in this list. Depending on your site's implementation, this is probably not trivial to do without taking some sort of performance hit.

Build a finite state machine which only accepts the terms in the blacklist. That should be a one-time operation.

Then feed each request into the FSM and see if you get a match. Execution time is linear in the length of the request, regardless of the number of terms in the blacklist.

A perl regular expression will test for a match in less than a microsecond. Other scripting languages have a similar speed.

There's no need to program like it's 1985 any more.

Hey, you show me a performance problem, I show you a solution. Preferably one involving finite state machines.

> So if you have any sort of business where people pay you per click

This. This is the problem -- pay-per-click is completely and utterly broken, and has been ever since people figured out how to use bots effectively. For one, PPC is simply not compatible with real-time reporting, but there are myriad other problems with it.

And honestly, I don't know if it's even really a problem worth trying to solve, since there are better models out there already. "CPA" ads [1] are by no means perfect, but they obviate almost all of the problems you mention simply by their definition: the advertiser does not pay unless whoever clicked on the ad does something "interesting". Usually, this can mean anything from signing up, to buying something, to simply generating a sales lead, but the important thing is that "interesting" is defined by the advertiser themselves. Not by, say, Facebook.

Full disclosure: I'm an engineer at a startup that does, among other things, CPA-based ad targeting.

[1] For those of us who don't work in ads, http://en.wikipedia.org/wiki/Cost_per_action has a decent explanation.

This sounds like a problem that should be addressed with machine learning and AI rather than the wack-a-mole style of matching strings.

Separate out and distinguish possible humans from possible bots using a set of known humans and known bots.

Assign a human likelihood probability to each user and pay out the click based on the probability.

for eg. this user is 90% likely human, therefor the cost of the 20 cent ad click is 18 cents.

This doesn't have to wait on Facebook, somebody could build this as a third-party app and then send refund requests to Facebook on clicks that are likely bots.

I don't agree with this at all, user-agents...really?

Go create a really, really smart bot, you will instantly know the limitations they have. It's pretty trivial with enough data and honeypots to separate them from actual people.

Would it be useful to have a kind of open source database of "trusted" bots, just to get past that hurdle? I do realize there are many more issues apart from declared user agents.

Would "white listing" the dozen or so top agents that make up the vast majority of traffic work better?

You'd want to white-list both based on user-agent string and IP / CIDR origin.

Keeping those lists maintained would be a bit of fairly constant effort, unless you could come up with a self-training mechanism, or a self-validating mechanism.

Killing Google's ability to crawl your site is pretty much as bad or worse than blocking bots. Though you should also be able to ID bad guys by noting which IPs the spoofed user-agents are coming from to, say, some honey-pot links specifically disallowed for that user-agent in your robots.txt file.

In the general case it's important to let google crawl your site, but facebook would probably be fine without it.

Those are exactly the user agent strings the bot would spoof.

One quick way to detect bots is to look for a GET request of the robots.txt file

Sort of explained already, but "bad" bots would never go look for that file. And a good bot probably already identifies itself the the request, so no need to look through robots.txt.

That's it. A "bad bot" would not check robots.txt, but a legitimate user would check it. So looking for the software not checking robots.txt combined with user agent matching for good bots, you would have a good matching ratio.

could you just add a mouse-over event to each ad that sets an isHuman flag?

The bot could just fire the mouseover event.

That would break screenreaders, some touchscreen mobile devices, and probably other things I'm not thinking of.

IIRC, years ago, Google changed one parameter of it's AdSense click url based on the number of times the user hovered the ad.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact