While we were testing Facebook ads, we were also trying to get Facebook to let us change our name, because we're not Limited Pressing anymore. We contacted them on many occasions about this. Finally, we got a call from someone at Facebook. They said they would allow us to change our name. NICE! But only if we agreed to spend $2000 or more in advertising a month. That's correct. Facebook was holding our name hostage. So we did what any good hardcore kids would do. We cursed that piece of shit out! Damn we were so pissed. We still are. This is why we need to delete this page and move away from Facebook. They're scumbags and we just don't have the patience for scumbags.
Thanks to everyone who has supported this page and liked our posts. We really appreciate it. If you'd like to follow us on Twitter, where we don't get shaken down, you can do so here: http://twitter.com/limitedrun
That's actually an interesting proposal-- 100% of the time when I click on an ad, I didn't mean to and I try to close it before it has a chance to annoy me.
Probably doesn't make sense on a large scale, but it's something to consider.
Someone's grandmother miss-clicking and then mashing back is about as useful to the advertiser as a bot. conversions are where it's at. Poor quality traffic / bots / it's all the same. the advertiser quits paying
They reinvented the access log? By hand? Are people really that detached from how servers work?
Anyway, it's pretty irrelevant whether they added logging in their app (one line of code?) or enabled their webserver's built in logging.
Since the user is on facebook they'd have facebook unblocked, but noscript still blocking everything else.
How would you discern a noscript user from bot?
For example, target something for the Reddit.com users and you'll see 90%+ block ads.
With facebook granular targetting you could very specifically end up with a target segment with a large proportion of noscript type plugin users. Without more details on their campaign I would not be so quick to place blame purely on bots.
There are a few tricks to do that, described on the old ha.ckers.org weblog, or perhaps on the related sla.ckers.org forum. No idea if they still work, though ..
You could always see if you could trigger NoScript's built-in XSS or frame hijacking protection :) Bots will not notice, and users get a scary warning message and believe your ad is trying to hack them. Not good for your conversion rates, but a pretty sure way to tell them apart.
Or they might have indeed reinvented it. Kids these days and their Heroku! Back in my day we'd process each HTTP request by hand. And we'd do it uphill both ways in the snow.
Most of the parsing required for the logfile analysis I describe is written in awk, and goes through roughly 1 million lines of logfile in about 75s on a commodity 2Ghz Intel Xeon.
Pre-caching of host lookups is done via xargs and a suitably high '-j' value. Running on a pre-sorted and de-duped list of IPs, this takes another few minutes. Lookups are read into hash tables for faster processing in the main awk parser (avoids system calls, DNS latency, and especially negative result caching failures).
It's fast and simple. Not a 100% solution, but quick to add bits to over time as new needs arise.
It was the proper add-on to the previous comment kids today don't learn awk. Becuase they have Perl, Ruby, and Python which are supersets of the unix command line tool set. Awk was the king in the years before Perl.
I never really learned to use cut, so I sitll use awk for one-liners the deal with columns in delimited-data text-files.
Depends how far they went profiling their visitors, it clearly says in the article they were tracking if js was enabled, so yes, you'd need to roll something or another by hand. GA relies solely on js for example.
And then compare in the log.
I am not familiar with how things work on FaceBook, but I would think that FaceBook users do not have access to the logs that FaceBook's servers write.
And so it seems that Facebook could be proven to be the new groupon.
I would absolutely not be suprised if the hits were FB bots - however I'd expect they wouldn't be that directly stupid so as to have an Ad-bot net in house. Surely this is some service they outsource for plausible deniability.
Something they would have learned from their relationship handing all user data over to the NSA/CIA.
I'd like to make a prediction on how this could potentially play out:
1. FB ignores them. They have nothing to do with this issue and they know the bots are tainting traffic - but they can't/wont do anything about it because it boosts their revenues.
2. FB may or may not be involved in the ad-bots, but, they will drop their $ contingency on ad-rev for the name and payout these guys to shut them up. Their stock was improperly priced and taken a beating. They don't want this coming to light in a large way as it will have a negative affect on the perception of their business model. They hav to be kind of careful of the Barbara Streisand effect here too...
3. Many more people will reveal the same results with empirical testing and it will be revealed that FB earnings are 80% over inflated. Their stock will drop to the predicted real value of <$10... maybe hit that actual projected number of $7.00 - Zynga will BEG FB to purchase them as they are now a penny stock and they can't sustain.
Also given FB users' aversion to ads this whole issue could be explained by users just clicking the back button. Ad supported sites have trained users to do three things very well: install adblock, habitually ignore ads, quickly click away from any interstitial or unexpected ads like on Hulu.
Another note on the "back button": I recently discovered the awesome power of "three-finger swipe back button" on Macbook trackpad. So it's possible users can "click" back even faster than before. I can and before the swipe I was well-trained in the art of "CMD back arrow" :-)
Also, FB may not have much of a choice to comply with those agencies, or find themselves pissing off a lot of high level people by refusing to comply with government agencies. But outsourcing / tolerating a network of bots to vastly boost revenue is probably considered fraud, no?
It's not what people brag about, it's the social graph itself that's useful. All the other things like photo-tagging are extra candy.
More info at http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/al... and http://nplusonemag.com/leave-your-cellphone-at-home . Sorry it's late out here, or I'd copy/paste the relevant quotes for you--look for the bits about data mining and machine learning, especially in the 2nd article, everything is a signal and they do have the computing power to process it all, apply Bayesian belief nets, algorithms can pull intelligence information about you from the local shape of your social graph that you don't even realize can be inferred from it (humans are not very good at reading complex info from big graphs much further than simple "friend-of-a-friend" relationships).
And that they are doing this using experience from working with the NSA/CIA in order to cover up the fact that $800 million dollars of Facebook's yearly profits will disappear overnight once people realise the truth. And as such they are willing to pay out the ad-bot companies in order to hide this conspiracy.
I do not know is it applies to FB, but those who have like 100k bots have some power too.
 http://imgur.com/Vvkbx screenshot in russian
Edit2: proper link
What about the farking article? Where they proved that 80% of their clicks were from bots. Then several others confirm they ahve the same experience?
What about those facts?
I also speculated on what FB is doing. Them giving user data to the NSA/CIA - that is a supposed fact, sure, but I am not the only person who believes this.
Am I extremely distrustful of Facebook? Of course! Maybe you have not followed their track record or the character of their founder.
> I also speculated on what FB is doing. Them giving user data to the NSA/CIA - that is a supposed fact, sure, but I am not the only person who believes this.
Great argument! A lot of people believe it, so it must be true! Are you for real?!
And the backdoors in Cisco's equipment that are a requirement by the federal government?
Maybe you thin this shit is "conspiracy theory" and if you do, you're simply a naive fool.
Look around you at what governments are doing. The NSA has records on everything you do online. They had a system in 2005 which could trace communications between users to 6 degrees, automatically.
I know a couple of guys like you in my local activist community, who take a very hostile "I know the truth and you're all fools!" attitude, despite the fact that their audience is mostly very sympathetic to most of their assertions. We know spying on the Internet takes place; HN is full of cantankerous old Internet geeks who've seen, first-hand, plenty of examples of the state (whatever state you may choose, as it happens all over the world) behaving unethically on the Internet, spying on people and punishing people for things that shouldn't be crimes.
But, your paranoid approach is counter-productive. You might as well be working for the people you claim to be afraid of, for all the good you do (negative good; you're convincing people that the folks who believe the government is spying are all paranoid nutjobs who scream at anyone who has the gall to mention other possible explanations).
So, let's review:
1. Just because spying has taken place, and is currently taking place, and may even have the complicity of facebook in that spying, it does not mean that facebook is running a botnet to steal advertiser dollars. The simplest explanation is that facebook looks the other way while others run the botnets. facebook wins (a lot, as long as most advertisers don't know it's happening), botnet owner wins (a little), and the advertiser loses. But, there are other plausible explanations, including incompetence.
2. When you paint things in a "Either you accept my theory in its entirety, or you're all idiots", you force people to choose a side. Nobody wants to be on the same side as an asshole, so you force them to choose the other side. You make people who may even agree with you (to a greater or lesser degree) to begin to formulate plausible reasons for why you're wrong about the crazier stuff you're spouting...further convincing themselves that you're entirely wrong. The best you can hope for is people ignore you and don't have the chance to be inoculated against your ideas; having you as their first exposure to these concepts guarantees they will be less likely to believe them in the future, even if they come from a more credible source. Humans are funny creatures.
Thus, I would point out that there are some really naive and stupid people on HN that have zero grasp of effective argument, persuasion, and even basic logic.
You might be well-served by reading about non-violent communication: http://en.wikipedia.org/wiki/Nonviolent_Communication
Edit: Removed the word "schizophrenic" as it was an insensitive use of the term, and was counter-productive to making my point.
In my defense, schizophrenia runs in my family, and I'm very familiar with it...I don't think of it as an insult. But that's a local custom in my family that I shouldn't think follows in the rest of the world.
Really, now I am a paranoid schizophrenic?
Just because I make claims which are readily confirmed and were completely available in the media - even the EFF filed suit on the AT&T events...
Yet, for some reason, it is my responsibility to educate everyone every single time someone new comes along who hasn't been following these things closely.
Now I am a paranoid schizophrenic?
Also, its a strawman to focus on my tone, rather than content. You're trying really hard to be overly pedantic and, frankly, an asshole.
Thanks for the link on communication.
You obviously have a vested interest in seeing your "message" be received, so stop ignoring what people are telling you about your tone and change it. Figure out what it is that people do listen to, or else you have no one but yourself to blame when they take issue with your tone.
To wit: Your tone is defensive. Pejoratives and cursing undermine your expression. What you are effectively telling people is that your thoughts are not important enough to merit self-restraint. Detailed explanations following assertions are also helpful.
FWIW, I work in the performance marketing industry and, IMO, the most complicity facebook could be said to have in this issue is in not placing a sufficiently high priority on preventing bots from clicking on their ads. Even if it is a sizable project there, they represent such a large target that the difficulty of the job becomes much greater. I see no salacious story here, other than a company found itself unable to optimize a campaign into profitability, which I think says more about them than facebook. Ho hum, find a different traffic source and move on.
However, I will point out that with respect to your comment on FB's complicit actions due to the daunting nature of the problem, this does not take into account their other actions of a 24K ransom on the domain.
Everyone can argue in any direction they want - but neither me nor anyone else is really going to know until we get further down this path...
When I was talking about China hacking lockheed as far back as 2005, everyone said I was nuts! (I freaking worked at lockheed!)
There needs to be a better way to log and track this stuff so that we can point people to some sort of Tyranny Wiki.
I use noscript. When I enable it for visits it's usually by enabling the domain itself and leaving third-party scripts disabled. Depending on your area this could be significant.
Also the possibility of blocking of the scripts by other elements not loading - for example if the browser can't parallelise the requests for some reason and a preceding request can't be handled then the page might be displaying well before the script is loaded, especially if it's loaded at the bottom of the markup.
$ curl -s http://pastebin.com/raw.php?i=HHeY5nP1 | md5sum
The background. A lot of the web you see today is automated, which is to say that 'the web' can exert influence in the 'real' world, and influencing the web can be amplified (or leveraged) as traders would say, with automation.
In the second example we have an ambitious entrepreneur who wants to get the word out on their new thing. They create over time a few hundred thousand 'fake' twitter accounts, they construct a community of fake accounts following each other. Then they write a simple program to have their 'influential' tweeters push out the word and their 'followers' pick it up and retweet it. If they have let the accounts lay there for a bit they pick up their own share of robot followers (from other people doing this) and our ambitious entrepreneur creates what looks like a real groundswell of interest in what ever thing they are trying to push.
There are many rewards to automating web activities, for many different reasons. As a service provider its a full time job trying to insure I can identify the fake ones (I don't want to send ads to robots, they don't buy anything and they make me look bad to my advertising channels) That 80% of the 'clicks' this person got on Facebook fail a simple 'real user' test does not surprise me.
I expect there to be a rise of landing pages which are themselves pages requiring you to click somewhere else and then clients only paying out when a prospect clicks all the way through to some part of the web site. I know that some big names already do this where they bring up a page asking you to identify your location (which helps them bring you to the part of the site for your area) but they also don't pay for the click unless you make it through that page.
This is exactly what HBGary built. But it wasn't ads they wanted and it wasn't for money. They built sock puppet management apps for the DoD to influence public sentiment about various things the government wanted... except they got caught.
Generally I refer to any program which is using APIs intended for humans in order to achieve a result for the person who ran it, a 'bot'. Comes from folks who build scripted clients for computer games.
The more plausible reality is that there are just a lot of bots crawling FB fan pages, poorly programmed bots that will just follow every single link they come across. That's very much in line with my personal experience of FB fan pages.
But FB is certainly to blame for not detecting these as bots, and putting ads on the pages they serve them. Hopefully they'll take care of this issue soon.
This is very, very bad.
Yes, we dont need a conspiracy to explain stupidity. The stupidity is the conspiracy.
People arent ignorant, because they are handcapped or something. They are ignorant, because thats the behaviour and excuse we reward. Just be an idiot, dont put in the intellectual effort, and you will be allowed to speak lies and screw every one and every thing.
From a corporate point of view, its the golden ticket. Plausible deniability. We didnt know that our insurance helped finance weapon exports to dictators. We didnt know. We didnt know. Sorry this, sorry that. We're keeping the money though.
I dont think that provable intent should have legal meaning. Because all neglect is intentional, as is ignorance in a world, where every one has the choice to be well informed.
If I am writing a mechanism to charge for ads based on traffic, one of the first things to QA/write test cases for is fraudulent traffic.
The leap here is from facebook being remotely competent (which is likely), and willfully ignoring checks on certain types of traffic because it benefits them financially.
My assertion is that this is the same class of bad behavior as running your own bots. Note the distinction in 'willful' ignorance.
Much like the multiple account issue, it's beneficial for Facebook to turn a blind eye. That is, unless it escalates into a PR issue.
If there is a bot, it's likely one of their competitors, though even that seems unlikely. One must be logged into FB to see ads, and I'm positive Facebook has some click fraud detection. They'd almost have to notice a number of accounts clicking the same ads over and over. I guess maybe if Facebook's CAPTCHA is weak enough and a bot was creating an account, setting up the proper interests to see the ad, clicking it, then repeating... Seems unlikely though.
Also none of this explains why they're taking down their Facebook page. One does not need to buy ads to have a Facebook page. You can even have it automatically update from Twitter. They could leave it up there and get free publicity from it and simply cancel their ads.
This whole thing seems like a sham for attention to me.
I think it pretty clearly explains why they're taking down their Facebook page. Just because you don't agree with them doesn't mean it's not explained. They're shutting down their Facebook page because (according to them) it's sitting around with outdated branding but they can't change the name unless they spend $2000/mo. Also because their user experience with Facebook has been crappy, and they don't want to have anything to do with them anymore. You know. Principle.
Could be a sham for attention, but the article pretty clearly explains why they're shutting the page down.
- Okay, you want to detect bots. Well, "good" bots usually have a user agent string like, "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)." So, let's just block those.
- Wait, what are "those?" There is no normalized way of a user agent saying, "if this flag is set, I'm a bot." You literally would have to substring match on user agent.
- Okay, let's just substring match on the word "bot." Wait, then you miss user agents like, "FeedBurner/1.0 (http://www.FeedBurner.com)." Obviously some sort of FeedBurner bot, but it doesn't have "bot" or "crawler" or "spider" or any other term in there.
- How about we just make a "blacklist" of these known bots, look up every user agent, and compare against the blacklist? So now every single request to your site has to do a substring match against every single term in this list. Depending on your site's implementation, this is probably not trivial to do without taking some sort of performance hit.
- Also, you haven't even addressed the fact that user agents are specified by the clients, so its trivial to make a bot that identifies itself as "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1." No blacklist is going to catch that guy.
- This is smarter than just matching substrings, but this means you may not catch bots until after the fact. So if you have any sort of business where people pay you per click, and they expect those clicks not to be bots, then you need some way to say, "okay, I think I sent you 100 clicks, but let me check if they were all legit, so don't take this number as holy until 24 hours have passed." This is one of the reasons why products like Google AdWords don't have real-time reporting.
I'm seriously only scratching the surface here. That being said, I'm not saying, "this is a hard problem, cut Facebook some slack." If they're indeed letting in this volume of non-legit traffic, for a company with their resources, there is pretty much no excuse.
Even if you don't have the talent to preemptively flag and invalidate bot traffic, you can still invest in the resources to have a good customer experience and someone that can pick up a phone and say, "yeah, please don't worry about those 50,000 clicks it looks like we sent you, it's going to take us awhile but we'll make sure you don't have to pay that and we'll do everything we can to prevent this from happening again." In my opinion this is Facebook's critical mistake. You can have infallible technology, or you can have a decent customer service experience. Not having either, unfortunately, leads to experiences exactly like what the OP had.
Too often I've heard "how could XYZ not have done this, fixed that?" Then have that same party sign on board to fix this "easy" problem and get themselves in a world of hurt.
* Accounts subsisting of an unusual proportion of ad clicking activity compared to their other activities are likely bots.
* New accounts with no history but high ad clicks are likely bots.
Before a click is counted, check the source account for it's human rank (activity-to-clicks). If it's below a certain threshold (tuned over time) than don't bill your customer for it.
The algo's to determine human rank can start fairly simple and become more complicated and accurate over time.
In terms of their current sophistication I would definitely call this an easy problem from a tech standpoint. Scaling to millions of concurrent users? Hard. Attracting and keeping millions of users? Hard.
Then they make some bundled crapware browser plugin that gets installed by legit facebook users that will click ads at that interval + some random time. Possibly do the ad clicks in background so that the user does not know.
Take a minute to think about the effort a bot account has to go to in order to even come close to gaming a system that tracks their behaviour.
Let's assume that in order to be deemed human you need at least a few characteristics like "account older than 1 day", "friends from other non-suspicious accounts", "a photo or at least comments on a friends photos", and a "certain amount of click activity".
Now let's take that criteria and make it difficult for a bot to even know if they're being flagged as a bot where the only indication is a reduced bill at the end of the day / week / month to the advertiser. So bot makers would have to pay to test and would at best receive delayed feedback on their success.
Isn't this starting to seem extremely difficult, and generally unscalable from a bot perspective? It doesn't have to be impossible, there just needs to be lower hanging fruit, of which there are plenty should FB implement a system like this.
And should this fairly simple solution still have a few holes that a very elite few bot coders are able to game, you've still got tons of data and time to use to tune your algos.
It just doesn't make sense that one of their advertiser's is being charged for 80% bot clicks, until you start to question Facebooks motivations.
The bots will not be able to get accurate statistics without paying, but they might be able to get "good enough" statistics. For example, is the system moving in a direction we like.
If you do need to pay , a good bot herder probably has enough stolen credit cards to use.
Yes, there are certainly mitigation techniques. But like other spam/bot related problems it becomes an arms race.
You only need 1 good bot programmer to release his bot online for thousands of budding script kiddies to take advantage of.
So basically we've narrowed down the potentially bot makers to:
1) Has a way to hijack thousands of facebook user's browsers, and is more interested in spiking some companies ad buys than other profitable activities they could be doing.
2) Has access to stolen credit cards so they can spend the money required to debug and test their bot's humanity against the defensive algorithms. Spending money in order to indirectly make money because some company is paying them to spike a competitor's ad buys. Obviously this is going to cut into their margins, so the amount they spend should be less than they're being paid.
3) Has the patience and dedication to analyze the bills when they come in to determine how much of their activity is considered botlike and then tweak their behaviour accordingly.
4) Thousands of budding script kiddies are also being payed by companies to spike their competitor's ad buys.
5) Rides a unicorn?
Don't forget there also needs to be companies willing to pay someone to make these bots and don't mind how incredibly difficult it would be to measure it's success. In fact, if there are companies willing to pay for this, then you're going to have a whole bunch of scammers who simply claim they're able to do it that the companies will have to filter out.
We're incredibly far into the realm of unprofitability and edge-casery don't you think? An alien spaceship is going to land in my backyard before this happens.
You are forgetting that there are growing numbers of programmers in developing nations for whom doing this sort of thing can be an attractive way to earn a living (or at least they are willing to try it). Look on any "hire a freelancer" type sites and you will find hundreds of postings for tasks that sound awfully like scamware development (usually at a couple of dollars an hour).
Browser hijacking is not so uncommon either, I've installed legit software in the past that has come bundled with some crapware that injects extra adverts into web-pages and other things.
The analysis would not be too difficult either, simply set a "click rate" and adjust it downwards until you are being charged > $0. Do facebook offer any method to get free trail credits? If so this could be abused, heavily.
The economics are roughly the same as with spam, if you do enough of it your "hit rate" doesn't matter so much.
Real human social activity is a lot more random than real informational content in blogs - so fb's problem seems harder.
Also, facebook banning accounts is a much stronger action than google just lowering the pagerank of suspected splogs.
So overall I think fb's problem is harder than a currently google-unsolved problem.
The detection part aside, it's a simple fix. I just wonder why FB hasn't already. (Is it because it'd hurt their revenue when they need it the most?)
Now of course I'm simplifying this to a nearly absurd degree, but I do think there are reasons for Facebook to want to ignore these issues and that even a minimal step up in bot-prevention could go a long way when it comes to their advertisers.
But on the other hand, since valid human behavior includes much more senseless, random, and repetitive actions (compared to blogging), a well-done bot would be harder to detect.
Like I said, I absolutely believe there will be bots that are incredibly sophisticated that might be difficult to detect, but I still think more could be done to stop the less sophisticated ones.
However, even if you assume it is a hard problem - thats more reason that they should have listened (and responded!!) when a company identified this problem with some hard data to back it up.
Sure, its a big data problem, but I can imagine that Facebook has solved these types of scenarios many times over.
Moreover, Amazon is always buying new IP subnets.
Large numbers of real looking fake accounts should be hard to keep up.
DeviceAtlas and WURFL are both products that I have used that attempt to have complete coverage of all UA strings in use in the wild even, the ones that attempt to spoof legitimate UA strings.
While their focus is on device detection and the properties of these devices for mobile content adaptation, knowing which connections are from bots is just as important if you are going to alter what content you serve up based on UA.
These product actively seek out new UAs and add them to their rule sets in a way that I can only assume will be more accurate over time than a home grown solution.
FWIW I know that FB use at least one of these products for their device detection, so if you want to match what FB thinks (knows!?) this is where you would start.
This definitely takes non-trivial effort, but it's also definitely doable if you have a serious interest in perpetrating click fraud. While Facebook hypothetically could fight this form of click fraud, it'd likely be difficult and the method would be fragile/easily circumvented. Like most things, PPC advertising can only function in a world primarily populated with positive actors.
1. Make the split uneven. For example for every dollar advertiser pays, only 30 cents goes to the page owner who received the "Likes"
2. Lobby to make these kinds of activity illegal. They are after all spamming activities
3. These kinds of fake account farms will usually have very few connections to outside world. That is, accounts that do not obey the "Six Degrees of seperation" should be flagged for human review.
4. Once reviewed enlist law enforcement help to track down the owners of the all the accounts in these farms.
Painstaking work, but I guess the point is that advantage can lie with Facebook as opposed to the spammer in this game. I am not sure about the ethical implications though.
Maybe this is why Google wants to succeed badly in the social space : extra spam fighting abilities.
I couldn't find any resources on how Facebook handles this.
ignore this is fb already does it
What's the grey area look like? What's the chance of falsely identifying a real user as a fake one?
Rhetorical question. Not sure FB could even answer them accurately.
Instead of that, we'll watch for patterns of use in which high volumes of traffic come from unrecognized non-end-user network space.
"Unrecognized" means that it's not one of your usual search vendors: Google, Yahoo, Bing, Baidu, Yandex, Bender, AddThis, Facebook, etc.
"Non-end-user network space" means a netblock which is clearly not a residential or commercial/business networking provider.
For this you'll mostly want to eyeball, though when we saw very high volumes of search coming at us from ovh.net (a French low-cost hosting provider), it was pretty obvious what the problem was.
Using an IP to ASN/CIDR lookup (such as the Routeviews project: http://www.routeviews.org/) and a local ASN -> name lookup. One source of this is http://bgp.potaroo.net/cidr/autnums.html
It's not perfect for a variety of reasons, but I found it to be a useful input for a similar bot detection problem. That said, this was a long time ago so I'd need to re-run some experiments to see if the hypothesis remained valid.
For those not familiar with OS fingerprinting, it's a method of looking at the details of a connection and determining which operating system is most likely to have created packets with those options and flags. In BSD, it's built into pf.
Just that if your bot runs from, say, Windows 7 and it spoofs an IE8 user-agent header, or even runs by directly automating IE itself, how do you detect it then? Both of those scenarios are not unlikely at all.
I can't speak to Java classes, but they'd need to be dealing with raw packets which isn't an option in a typical stack.
As in your case, it's nearly impossible to detect click fault real time as the clicks streaming in. We batch-processed the clicks throughout the day and do daily billing settlement at the end of the day. It's easier to detect click fault over a period of time. Obvious bots are filtered out, like the search engine crawlers.
For click fault, it's impossible to distinguish one-off click as a user click or fault click. We tried to catch repeated offenders. we looked for duplicate clicks (same ip, same agent, other client attributes) against certain targets (same product, product categories, merchants, etc) over a period of times (1-minute, 5-minute, 30-minute, hours, etc). We also built historic click stats on the targets so we can check the average click pattern against historic stats (yesterday, week ago, month ago, year ago).
We ran quick checks every 1 minute and 5 minutes to detect unusual click patterns. These were mainly for detecting DOS attack against our site or against certain products. We actually had a realtime reaction mechanism to deal with it: slow down the requests, throw up CAPTCHA, redirect to a generic page, or block ip at routers.
Every couple hours, we batch processed the clicks so far and ran the exhausting checks to flag suspicious clicks. At the end of the day, definite fault clicks were thrown out, moderate suspicious clicks were grouped and reviewed by human. It's a fine balance to flag less suspicious clicks since there's money involved.
The sophisticate click fault I've seen involved a distributed botnet clicking from hundreds of IP all over the world doing slow clicks during the day. We found out because it's targeting a product category of a merchant and causing unusual click stats compared to historic pattern.
What you are getting at however is very important to ad-driven internet business models. You make the empathetical deduction that if you are there for one purpose, you are impervious to advertising.
Now, if you are still with me after that statement, then there are larger impications of online advertising than are actually being measured. For instance, if the intent of the view mattered, would it be worth more? I think Google has answered that question rather clearly.
Measuring "why someone is visiting a URL" (and thus generating an ad-view) is a larger technical problem than could be accomplished with some fancy algorithms. That is, unless the service knew all of your browsing history...
In any case, you raise a good point about the intent of the page-view and how that should be "credited" as advertising. To do this however is to challenge existing 'fundamental' theories.
Also, there is "brand marketing" and "performance marketing." I'm not sure they have classic definitions, but generally brand marketing is based on "mindshare" and tough to quantify. This is basically any TV commercial -- Budweiser doesn't expect you to jump off your couch the second you see a Bud Light commercial, but maybe a week later when you're in the supermarket, you spot that new brand of beer and recall the commercial with Jay-Z rapping and grinding with some models, and think, "eh, maybe I'll try Bud Light Platinum, I will purchase this and consume it later because I like the idea of rapping with models," even at a subconscious level.
Performance marketing is all about ROI. You spend $100 marketing dollars, and you want to make at least $X back. Typically $X = 100 in this example, although in some cases you may want to adjust profit margins or even run at negative profit (for example, if you want to maximize revenue even at an unprofitable rate). Almost any advertiser on the internet has a performance marketing mentality, mainly because it's much easier to actually measure performance. This is why advertisers like the OP are frustrated with Facebook, because paying for bot traffic makes it extremely hard to hit their performance marketing goals.
You are right though, in that this is a huge gray area. To give two extremes, let's say you run an ecommerce site but you'd also like to monetize with ads. So you reach a deal with a shopping comparison site to display their products on your pages in your footer. If you buy AdWords to your site, and those visitors from AdWords scroll down and click on the shopping comparison ads, they are probably fine with paying you for that traffic, although you are probably paying something like $2/click. If you make a page with JUST the shopping comparison ads, and buy traffic from someone who basically infects computers with trojans and overwrites their Google search results to be your site (say at $0.01 per click), then you'll get clicks on the shopping comparison ads and make lots of money, except the shopping comparison site will realize these clicks don't lead to any actual sales for their merchants (since this is extremely poor quality traffic since all the users that end up there have basically no intent) and will kick you off. Somewhere in between adveritsing for "first page Google AdWords" traffic and "trojan/botnet" traffic is basically every other form of internet advertising, and how gray that area is, is typically completely defined by your business partners.
Advertising and marketing are generally very intentionally tailored. You're not trying to reach any human, you're trying to reach a human in your target demographic.
Skewing results (and wasting resources) by paying people who are entirely outside your marketing effort to click through links (or take other similar resource-costing / data-skewing actions) is not the desired result.
See, the essence here is not to come up with a great algorithm. This is not a programming contest. But to create a system where everyone (publisher, advertiser, ad-network and customer) wins.
In this very early stage, the success of Facebook's Ad network should be measured only by one metric - ROI of Advertisers. And that number has to be consistently better than Google's.
Why's that? This isn't a zero sum game. You don't advertise only on Google or only on Facebook; as much as your budget allows, you advertise on every channel you can achieve a positive ROI on.
For any advertiser looking for a direct response, like a signup or purchase, I doubt anything Facebook can do would make the ROI better than advertising on search. No amount of demographic and interest targeting will make advertising to people as effective as advertising to intent. You can pay Google to send you people while they're in the process of looking for your product; Facebook doesn't do that.
A combination of tech expertise and marketing savvy are needed to reduce this problem. More analytic tools are available for Google Adwords than are available for Facebook ads.
Rather than trying to explicitly filter bot traffic, they should try to quantify the value of particular "likes". For instance, a user who demonstrably clicked through and purchased something is much more valuable than one who "likes" 10,000 companies and has no friends, whether or not they're a bot.
And heck; a bot that has thousands of followers who all buy lots of things is valuable too. Maybe someone created a particularly effective meme-generator bot, and people who follow that bot are likely to buy ironic T-shirts. Whether or not a user is a bot isn't all that relevant; it's whether that user's activity correlates with you making sales.
???! Excuse me? Are you programmers? Efficient substring matching is a solved problem. How many entries in that blacklist are you looking at then? Can't be more than a few thousand to catch 99.9% of the ones where a blacklist would work.
If done right it's easily faster than
It's also definitely faster than
> some sort of system that analyzes those clients and finds trends (for example, if they originate from a certain IP range)
> This is smarter than just matching substrings
No. It's smart to grab the 99.9% of bots with a blacklist of substrings, most importantly because it has a very very low false positive rate (unlike checking for JS support) because any human user that goes through the trouble of masking their UA as a known bot certainly knows to expect to get blocked here and there.
After that you can use more expensive checks, but at least you can bail out early with on the "properly behaving" bots with a really cheap string matching check (seriously why do you think that's an expensive operation--ever check how many MB/s GNU grep can process? just an example of how fast string matching can be, not suggesting to use grep in your project. Your blacklist is (relatively) fixed, you can use a trie or a bloomfilter it'll be faster than Apache parsing your HTTP headers)
I mean it's a good (and cheap) first pass to throw out quite a number of true positives.
I'm sure facebook engineers could come up with something way more sophisticated.
Each iteration of training that produces improved behavior can then be used to generate a new, more comprehensive training set. When I wrote a bayes engine to classify content, my first training set was based on a corpus that I produced entirely heuristically (not using bayes). I manually tuned it (about 500 pieces of content) and from that produced a new training set of about 5000 content items.
Eyeballing it for a couple of months initially, manually retraining periodically, spot checking daily as well as watching overall stats daily for any anomalous changes in observed bot/non bot ratios will work wonders.
Build a finite state machine which only accepts the terms in the blacklist. That should be a one-time operation.
Then feed each request into the FSM and see if you get a match. Execution time is linear in the length of the request, regardless of the number of terms in the blacklist.
There's no need to program like it's 1985 any more.
This. This is the problem -- pay-per-click is completely and utterly broken, and has been ever since people figured out how to use bots effectively. For one, PPC is simply not compatible with real-time reporting, but there are myriad other problems with it.
And honestly, I don't know if it's even really a problem worth trying to solve, since there are better models out there already. "CPA" ads  are by no means perfect, but they obviate almost all of the problems you mention simply by their definition: the advertiser does not pay unless whoever clicked on the ad does something "interesting". Usually, this can mean anything from signing up, to buying something, to simply generating a sales lead, but the important thing is that "interesting" is defined by the advertiser themselves. Not by, say, Facebook.
Full disclosure: I'm an engineer at a startup that does, among other things, CPA-based ad targeting.
 For those of us who don't work in ads, http://en.wikipedia.org/wiki/Cost_per_action has a decent explanation.
Separate out and distinguish possible humans from possible bots using a set of known humans and known bots.
Assign a human likelihood probability to each user and pay out the click based on the probability.
for eg. this user is 90% likely human, therefor the cost of the 20 cent ad click is 18 cents.
This doesn't have to wait on Facebook, somebody could build this as a third-party app and then send refund requests to Facebook on clicks that are likely bots.
Go create a really, really smart bot, you will instantly know the limitations they have. It's pretty trivial with enough data and honeypots to separate them from actual people.
Keeping those lists maintained would be a bit of fairly constant effort, unless you could come up with a self-training mechanism, or a self-validating mechanism.
Killing Google's ability to crawl your site is pretty much as bad or worse than blocking bots. Though you should also be able to ID bad guys by noting which IPs the spoofed user-agents are coming from to, say, some honey-pot links specifically disallowed for that user-agent in your robots.txt file.
More technical info and data would be nice!
Maybe if they have any employees who follow HN, someone would share more light on the technicals?
This whole thing seems like one big linkbait/PR play. It's a really big claim from a company that doesn't even have an about us page, has only 500 Facebook likes, and basically has 0 traffic.
I'd like to know if they have tried other paid traffic sources, or have concrete evidence. Inexperienced advertisers tend to spend $50-$100, expecting instant results. And when it inevitably doesn't work out, they blame anyone and everyone.
The second point seems way exaggerated. Facebook isn't "holding your name hostage" and being "scumbags". You didn't sign up with the right name from the beginning, and now you're complaining. You could have told the ad rep that you'll do $2,000 budget at some point, just change the url now. You could have also taken the 5 minutes to sign up for a Facebook Page with email@example.com and get the page you want. With only 500 facebook likes, the switching cost is low.
 http://siteanalytics.compete.com/limitedrun.com/ https://www.facebook.com/limitedpressing
I don't agree with the second point. Facebook isn't "holding your name hostage" and being "scumbags". You didn't sign up with the right name from the beginning, and now you're complaining. Instead of taking 5 minutes to sign up for a Facebook Page with firstname.lastname@example.org and get the page you want, you decided to sling mud with no real evidence. With only 500 facebook likes, the switching cost is low.
Is this similar, or related to other ad space inflation? I'm reminded of Google (and others)(1) sending out free $75 adwords coupons to easily allow the costs of bids to inflate since it's not my own real money (thereby increasing the percentage of real cash being made on the ads)?
(1)Edited and clarified
The relevance of such a comparison? It's not a one off, as uncommon as we think. Has nothing to do with pointing fingers.
Didn't mean to come across as off topic. To me, if someone upstream and in the past was doing this, chances are it might happen later in time downstream too, and there might be something to see in the forest instead of pointing at one tree or another.
I've used Google coupons, none with anyone else that I can remember. I also spent ads on google way before any other system and remember the 10 cent clicks going up to $1.00 for not much more traffic.
>80% of $FBs revenue is made from bots! OLOL
but what I actually think is
>They're just pissed their campaign didn't work out like they'd hoped
BUT I also think that it's equally likely that $FB would, if possible, defraud the clueless. How deep are [client]'s pockets? How literate are they in working to strict ROI goals?
I think this is probably a relatively new phenomenon. Facebook went public. Sales managers need to hit targets, they're thinking quarterly and they've got stock.
People know people and the rest is history.
If this has got legs then it's the perfect storm for facebook. :)
For the username, everyone (including users) is allowed one name change.
Presumably, Facebook can bypass these seemingly arbitrary limitations, but will essentially charge you for the right to do so (by requiring you to have purchased a certain amount of advertising with them).
I'm with you that F must have some restrictions on name change, but it does not seem reasonable to charge money for it.
 Imagine a page called "Sun Microsystems - Java" before being buyed by Oracle with a fer thousands fans. I'm sure a hight percent of the fans would not care of have a name change to "Oracle - Java" for example.
Well, apparently there is an exception from this rule. Obviously there is a huge potential for abuse when changing name of a page that has plenty of users. However, allowing only the "money making" pages to do so is a bit of an unfair move in my opinion. Reasonable but unfair.
I wonder if this situation will escalate and fb will publish some explanation. Nevertheless, it is still hard to believe for me that a company with such value would fake clicks at all or in a way that is easy to detect.
Anyways, I don't mind since I'm a small fish, but the longer this goes on, the more it will (hopefully, i might say) blow up into their face(book).
After that, it's unclear.
- Their ad is so awful that almost no people click it and a few bots make up 80% because the total is so low.
To me, something is off, whether with FB or these guys.
"I can confirm this. I used Facebook to advertise my page and went from ~2000 fans to over 6000 within several months. Awesome, right?
I thought it was weird that the new fans would never, EVER interact with the pages. So I started stalking their profiles. Guess what?
Bots/hijacked accounts/fake accounts. How do I know? Many of them have NO friends. Then I noticed something really scary...repeats. Actual pictures showing up more than once for new likes.
Very, very few of the accounts were from the USA.
I started casually messaging the accounts that looked suspicious. Not a single one with no friends responded. Out the ones that looked legitimate (although with a list of likes that is in the thousands), only a handful replied, and 100% of those people confirmed they found my page through affiliated sites, and not through Facebook itself.
I ended the ad program recently and the bots stopped.
tl;dr: Do not, under any circumstance, advertise with Facebook. They will feed you fake likes. If you want fake likes, use a site like fiverr to generate them, it's way cheaper and yields identical results."
The user movement maps we ended up with were straight lines of randomly spaced dots with sudden acute corners. A human could not replicate the pattern with a ruler and a pen pad, and we certainly didn't expect 90% of our human users going for that client's sites to be sitting around with rulers and pen pads drawing lines of randomly spaced dots. We checked the testing code and ran a dozen tests, and we couldn't figure out how a human could replicate it on any browsing device. Our conclusion is that somebody bothered to program a bot that would replicate mouse movement and we accidentally broke it by blocking programmatic access to the page, so it couldn't find a link to click and went crazy.
How do you tell if it's a bot? Just by user-agent string or repeated slamming by IP address, or is there some other technique?
A combination of things. Off the top of my head:
* IP address ranges. Are they all from, e.g., EC2 instances?
* Are images and other content (e.g. CSS) linked from the page loaded? How long does the client take to do so? Are other pages being viewed in line with what you've seen from other users?
* What combinations of user agents are you seeing?
* What's in the headers being sent by the browser? Any foreign language support, or oddities when it comes to not accepting compressed content? Is the referer header always the same? Does it always vary?
* Do you see "normal" user behaviour that indicates activity such as clicking the back button, then clicking the ad again a few seconds later once their brain has kicked in? (nb: not sure if this is possible for Facebook with the way their site works).
* Are there any patterns in the page access times? Is there anything to indicate a cron job is kicking off once an hour at the same time?
Generally speaking, in addition to the patterns you mentioned (invalid user-agent strings, rapid succession of same-IP access) there are also other patterns, such as failure to load referenced CSS or image files, or 100% failure on Captcha challenges.
"100% failure on Captcha challenges" -- are captchas generally deployed on click-through ads?
Very much so, after all your browser does without you having to tell it to do so. It all depends on whether the hassle is worth it for the bot writer.
Use http://phantomjs.org/ and the page will loaded by webkit.
Cloudflare has a mode (used in general for DDOS protection) which you can turn on and get stats on how many users pass a Captcha test.
The point was mostly that if it is not downloaded, then it's a bot (or any non-browser getter, at least), not that it's not a bot if it does.
Or maybe you were asking for a way to avoid bot detection ? :)
Although they stated they built their own analytics software, so there's also a chance that it wasn't working as they thought it was.
"So we did what any good developers would do. We built a page logger. Any time a page was loaded, we'd keep track of it. You know what we found? The 80% of clicks we were paying for were from bots. That's correct. Bots were loading pages and driving up our advertising costs"
I think they use a different definition of bot. Otherwise they wouldn't have to use a second logger ...
Finally, working from logs, I have managed to find 5643 total requests for the URL. If I filter for those that explicitly sent a referer from facebook.com/ajax, I see 4644; of course, many people have browsers that do not send referers due to various reasons.
Therefore, I feel safe concluding that, for my website, a year ago, I was not having "80% of ad clicks from bots". I present this evidence in order to help people attempting to understand this issue to provide the possible clue that this is either "user-specific" or "changed in the last year".