The Problem With Client-Side Analytics

lmkg · on Oct 21, 2011

...wut?

That's ridiculous. What evidence is there that there are groups of nefarious hackers out there spoofing analytics data on people's websites? I don't think there is a need for this solution because the problem doesn't exist. If I wanted to mess with someone's websites, there are much better ways than injecting some false data into their Google Analytics.

lamnk · on Oct 21, 2011

It's a problem of information asymmetry in the online (display) advertising industry, not a problem with hackers. Because advertisers don't know how many pageviews/visitors a website has, advertising agencies often have to make purchasing decisions based on numbers from Comscore, Quantcast, Google Analytics etc. Clearly, if the website owner can spoof his analytic data, he can sell his inventory at higher rates.

mpclark · on Oct 21, 2011

...but not for long, because his CTR will be down on the floor.

This feels to me like a clever solution for a problem that doesn't exist.

lamnk · on Oct 21, 2011

Sure, that's why I wrote "online (display) advertising", which often has its cost measured in CPM, not in CPC, so CTR doesn't matter. This shows advantages of CPC ads over traditional display advertising and partially explains the success of Google.

I do not have technical ability to judge whether his solution can really solve this problem, however the problem does exist, though not widespread, and who can solve it will have a great chance to earn a lot of money. AFAIK advertising agencies pretty much trust the data from the like of Comsore. It's not that they don't know the data is not really accurate, it's that they don't have anything else. How much does Comscore sell its packages? I don't know, but I'm sure it's not cheap and online advertisers/marketers still have to buy them.

earlyresort · on Oct 21, 2011

While I haven't seen any real, meaningful efforts at spoofing analytics data (I vaguely recall some grad student project), I've certainly heard of companies spoofing their own analytics data to appear bigger than they are to interested parties.

More than one unscrupulous publisher has gamed comScore and Neilsen to pump up their reach figures and get access to more attractive advertisers.

Of course, scammers aren't going to use this service. There's a market selling verification services to advertisers, but there's a ton of companies in this area already - from my limited vantage point, Double Verify looks like the market leader.

storborg · on Oct 21, 2011

I've often seen a different, but related effect of client side analytics where content thieves will "accidentally" spoof analytics data by simply copying a site verbatim, with the analytics tag included.

In this case it's usually relatively easy to filter out, because the analytics host will identify the fake requests as coming from pages served on a different domain. However, it is annoying, and a combination of content theft plus hijacked DNS could result in more sinister influences.

angelbob · on Oct 21, 2011

You don't think that, say, the same people who resent ads being shown and turn them off might think it was hilarious to send back bad analytics data to companies who are quietly profiting from same?

The same people, who, say, use a housemate's phone number for their Safeway card so that Safeway can't determine their shopping habits?

If there was an easy way to do it, many people would want to.

It's clear that there could be an easy way to do it if you put, like, two good hours of work into it.

earlyresort · on Oct 21, 2011

I co-founded and ran Pinch Media, a mobile application analytics company. We operated independently for around two years before selling. During those two years, I believe we got more bad PR than a typical analytics company. I certainly got my fair share of anonymous hate email.

During that time, we received exactly two easily-filterable attempts to spoof analytics traffic. Historically, anyway, this doesn't seem to be a real problem.

angelbob · on Oct 21, 2011

Fair enough. You're much more confident than I would be about knowing.

I work in the Analytics group at Ooyala, a video platform company -- among other things we handle all the analytics traffic for all videos on ESPN.com and its various subsidiaries. We get huge amounts of traffic, weird data mangling constantly, and a wide variety of bad responses -- and that's only the stuff that gets past and mostly checksums.

It would take a pretty significant spoofing attempt for us to even know.

earlyresort · on Oct 21, 2011

Nice - nothing but respect for Ooyala.

You're right - it's possible we got thousands of attempts to spoof traffic that we automatically dumped for being malformed. Like you, we also got a ton of weird data mangling and bad responses which we just ignored. Most were bad implementations or issues with the phones themselves, but some could've been spoofing attempts.

It's also possible that smaller attempts to spoof traffic went by absolutely undetected by both us and our clients. Like you, it'd take a pretty significant spoofing attempt for us to notice. That said, what's the point of an insignificant spoofing attempt?

angelbob · on Oct 21, 2011

The point of an insignificant spoofing attempt, if it isn't just ideological (and ignorable), would be to try to convince more people to do it (i.e. install the GreaseMonkey script).

Put it this way - say Richard Stallman suddenly decided that companies collecting analytics data and profiting by it (Ooyala doesn't sell it exactly, but we profit by it) was a bad thing and people needed to install a GreaseMonkey script, analogous to an ad blocker, that sent back bad data (wrong URLs, repeats, garbage, etc). That would be an individually-insignificant spoof which was potentially nasty in aggregate.

If designed well, it would also look an awful lot like a high level of background noise, but otherwise condition normal.

eps · on Oct 21, 2011

What evidence is there that there are groups of nefarious hackers intercepting my shopping session at ToysRUs online store? Why the hell should I be spending my hard earned milliamps on this SSL thing? Certainly if someone wanted to mess with me, they would just whack me on the head in the dark corner of the street.

ceejayoz · on Oct 21, 2011

Intercepting a shopping session gets you a credit card number. Spoofing analytics adds a pageview to a stat you don't even get to see.

huhtenberg · on Oct 21, 2011

You are missing the point. The probability of an exploit being ultra-low or hard to mount is not a valid justification for ignoring it altogether.

ceejayoz · on Oct 22, 2011

No, but it being a rather useless exploit is.

ismarc · on Oct 21, 2011

Our company has its own internal analytics system and, while their approach could technically work to prevent spoofing, there's other, simpler ways. The first is simple deduplication of received events. This will carve out a large portion of invalid requests, particularly if you have thresholds of time for how frequently a received event is considered valid. The second is to calculate the quartiles and outliers. This allows you to remove all but the most sophisticated spoofing and is good practice to remove ill-behaved browsers and filter out things like malware detection tools that duplicate browser requests if they haven't seen the site before. There's many operations you can do to determine the validity of data received, however who knows how much of this is actually done by analytics providers. We've built our own internal analytics system (and expose it to customers) because existing solutions weren't robust enough for our needs. The biggest lesson has been that trying to get higher than about 98% accuracy on delivered events actually lowered the accuracy of events and using calculations on the backend was more reliable, but requires specific knowledge of the type of events.

tptacek · on Oct 21, 2011

First, that's not a "digital signature", it's a MAC. It's the secret-suffix SHA1 MAC, to be precise.

Second, the secret-suffix SHA1 MAC isn't secure. Its insecurity is the reason we have HMAC.

This seems to me to be the kind of thing you'd want to get right if the whole value proposition of your solution was "verifying URLs with cryptography".

badclient · on Oct 21, 2011

1. Considering most client-side analytics are based on IP address, you will require a large number of IPs.

2. It should not be terribly hard to filter out known open proxies or sessions with a specific nefarious pattern.

Overall, I think this post addresses a problem that doesn't quite exist yet; and if/when it does, it can be addresses in many ways.

angelbob · on Oct 21, 2011

You wouldn't spoof it with an open proxy. This wouldn't be organized crime trying to screw up your A/B testing. This would be individual consumer advocates and reactionaries with a GreaseMonkey script that intentionally sent back wrong numbers or dupes.

And they'd be doing it because of a principle like "these companies don't tell us that they gather and make money off this consumer data. If they won't admit it up front, let's just not give the data to them."

Go ahead, tell me that won't happen at least a few times in the next 5-10 years.

webjunkie · on Oct 21, 2011

I also don't think this guy actually implemented the spoof he talks about..

bluesmoon · on Oct 21, 2011

We noticed this problem at Yahoo! (I worked on the web performance analytics). Approximately 2% (note, that's 2% of 200 million daily) of our beacons were "fake". Now there are two reasons for fake beacons.

1. (Most common) many small sites seem to really like the design of various Yahoo! pages, so they copy the code verbatim, and change the content, but they leave the beaconing code in there, so you end up with fake beacons.

2. (Less common) individuals trying to break the system. We would see various patterns including XSS attempts in the beacon variables, and also in the user agent string. We'd see absurd values (eg: load time of 1 week, or 20ms or -3s, or bandwidth of 4Tbps).

It's completely possible to stop all fake requests, provided you have control over the web servers that serve pages as well as the servers that receive beacons. It's costly though. Requiring you to not just sign part of the request, but also add a nonce to ensure that the request came from a server you control (avoid replays). Also throw in rate limiting for added effect (hey, if you're random sampling, then randomly dropping beacons works in your favour ;)).

It doesn't stop there though, post processing and statistical analysis of the data can take you further.

It gets harder when you're a service provider providing an analytics service to customers where you do not have access or control over their web servers.

At my new startup (lognormal.com) we try to mitigate the effect of fake beacons the best that we can.

posabsolute · on Oct 21, 2011

Well.. I can see that a problem for 0.5% of business's... maybe... I think he is over thinking this, most business do not need that kind of protection

There are better ways to "hack" a company that spoofing their websites analytic lol, people that got that large number of ips have better (worst) things to do than that..

Also how the f would you know they are ab testing something..

mdda · on Oct 21, 2011

Rather than signing requests for the (largish) javascript file (which would benefit most from being cached), it would make more sense for the signed-timestamp key to be passed as one parameter via the image grab. Or am I missing something?

nodirection · on Oct 21, 2011

Totally off topic but there is a bug in the PHP code example:

echo "<script src=\"http://example.com/analytics.js?ts=$ts&r=$r&ds=$ts\&...;

should be:

echo "<script src=\"http://example.com/analytics.js?ts=$ts&r=$r&ds=$ds\&...;

skeltoac · on Oct 22, 2011

In before solution waiting for a... oh, too late. It is a problem. However, signing resources means no HTTP caching of the most expensive resource we generate. That is not practical where I work. Guess the cache can be programmed to do the signing.

There are trade-offs just like every other CAPTCHA-class problem out there. Isn't that what you are after: an automated human detector?

ROFISH · on Oct 21, 2011

This is a solution looking for a problem.

I know my Google Analytics aren't 100% correct, but I don't think people are spoofing them. The differences lie more in people who click through faster than GA can load (which can be easily possible on those still on 56k), or have "privacy blockers" in their ad block to remove GA altogether.

youngtaff · on Oct 21, 2011

One of the problems with client side analytics is they don't give you the whole picture i.e 4xx and 5xx errors are missing from them