

The Problem With Client-Side Analytics - blahpro
http://spider.io/blog/2011/10/the-problem-with-client-side-analytics/

======
lmkg
...wut?

That's ridiculous. What evidence is there that there are groups of nefarious
hackers out there spoofing analytics data on people's websites? I don't think
there is a need for this solution because the problem doesn't exist. If I
wanted to mess with someone's websites, there are much better ways than
injecting some false data into their Google Analytics.

~~~
lamnk
It's a problem of information asymmetry in the online (display) advertising
industry, not a problem with hackers. Because advertisers don't know how many
pageviews/visitors a website has, advertising agencies often have to make
purchasing decisions based on numbers from Comscore, Quantcast, Google
Analytics etc. Clearly, if the website owner can spoof his analytic data, he
can sell his inventory at higher rates.

~~~
mpclark
...but not for long, because his CTR will be down on the floor.

This feels to me like a clever solution for a problem that doesn't exist.

~~~
lamnk
Sure, that's why I wrote "online (display) advertising", which often has its
cost measured in CPM, not in CPC, so CTR doesn't matter. This shows advantages
of CPC ads over traditional display advertising and partially explains the
success of Google.

I do not have technical ability to judge whether his solution can really solve
this problem, however the problem does exist, though not widespread, and who
can solve it will have a great chance to earn a lot of money. AFAIK
advertising agencies pretty much trust the data from the like of Comsore. It's
not that they don't know the data is not really accurate, it's that they don't
have anything else. How much does Comscore sell its packages? I don't know,
but I'm sure it's not cheap and online advertisers/marketers still have to buy
them.

------
ismarc
Our company has its own internal analytics system and, while their approach
could technically work to prevent spoofing, there's other, simpler ways. The
first is simple deduplication of received events. This will carve out a large
portion of invalid requests, particularly if you have thresholds of time for
how frequently a received event is considered valid. The second is to
calculate the quartiles and outliers. This allows you to remove all but the
most sophisticated spoofing and is good practice to remove ill-behaved
browsers and filter out things like malware detection tools that duplicate
browser requests if they haven't seen the site before. There's many operations
you can do to determine the validity of data received, however who knows how
much of this is actually done by analytics providers. We've built our own
internal analytics system (and expose it to customers) because existing
solutions weren't robust enough for our needs. The biggest lesson has been
that trying to get higher than about 98% accuracy on delivered events actually
lowered the accuracy of events and using calculations on the backend was more
reliable, but requires specific knowledge of the type of events.

------
tptacek
First, that's not a "digital signature", it's a MAC. It's the secret-suffix
SHA1 MAC, to be precise.

Second, the secret-suffix SHA1 MAC isn't secure. Its insecurity is the reason
we have HMAC.

This seems to me to be the kind of thing you'd want to get right if the whole
value proposition of your solution was "verifying URLs with cryptography".

------
badclient
1\. Considering most client-side analytics are based on IP address, you will
require a large number of IPs.

2\. It should not be terribly hard to filter out known open proxies or
sessions with a specific nefarious pattern.

Overall, I think this post addresses a problem that doesn't quite exist yet;
and if/when it does, it can be addresses in many ways.

~~~
angelbob
You wouldn't spoof it with an open proxy. This wouldn't be organized crime
trying to screw up your A/B testing. This would be individual consumer
advocates and reactionaries with a GreaseMonkey script that intentionally sent
back wrong numbers or dupes.

And they'd be doing it because of a principle like "these companies don't tell
us that they gather and make money off this consumer data. If they won't admit
it up front, let's just not give the data to them."

Go ahead, tell me that won't happen at least a few times in the next 5-10
years.

------
bluesmoon
We noticed this problem at Yahoo! (I worked on the web performance analytics).
Approximately 2% (note, that's 2% of 200 million daily) of our beacons were
"fake". Now there are two reasons for fake beacons.

1\. (Most common) many small sites seem to really like the design of various
Yahoo! pages, so they copy the code verbatim, and change the content, but they
leave the beaconing code in there, so you end up with fake beacons.

2\. (Less common) individuals trying to break the system. We would see various
patterns including XSS attempts in the beacon variables, and also in the user
agent string. We'd see absurd values (eg: load time of 1 week, or 20ms or -3s,
or bandwidth of 4Tbps).

It's completely possible to stop all fake requests, provided you have control
over the web servers that serve pages as well as the servers that receive
beacons. It's costly though. Requiring you to not just sign part of the
request, but also add a nonce to ensure that the request came from a server
you control (avoid replays). Also throw in rate limiting for added effect
(hey, if you're random sampling, then randomly dropping beacons works in your
favour ;)).

It doesn't stop there though, post processing and statistical analysis of the
data can take you further.

It gets harder when you're a service provider providing an analytics service
to customers where you do not have access or control over their web servers.

At my new startup (lognormal.com) we try to mitigate the effect of fake
beacons the best that we can.

------
posabsolute
Well.. I can see that a problem for 0.5% of business's... maybe... I think he
is over thinking this, most business do not need that kind of protection

There are better ways to "hack" a company that spoofing their websites
analytic lol, people that got that large number of ips have better (worst)
things to do than that..

Also how the f would you know they are ab testing something..

------
mdda
Rather than signing requests for the (largish) javascript file (which would
benefit most from being cached), it would make more sense for the signed-
timestamp key to be passed as one parameter via the image grab. Or am I
missing something?

------
krisneuharth
Totally off topic but there is a bug in the PHP code example:

echo "<script
src=\"[http://example.com/analytics.js?ts=$ts&r=$r&ds=$ts\&...](http://example.com/analytics.js?ts=$ts&r=$r&ds=$ts\\></script>);

should be:

echo "<script
src=\"[http://example.com/analytics.js?ts=$ts&r=$r&ds=$ds\&...](http://example.com/analytics.js?ts=$ts&r=$r&ds=$ds\\></script>);

------
skeltoac
In before solution waiting for a... oh, too late. It _is_ a problem. However,
signing resources means no HTTP caching of the most expensive resource we
generate. That is not practical where I work. Guess the cache can be
programmed to do the signing.

There are trade-offs just like every other CAPTCHA-class problem out there.
Isn't that what you are after: an automated human detector?

------
ROFISH
This is a solution looking for a problem.

I know my Google Analytics aren't 100% correct, but I don't think people are
spoofing them. The differences lie more in people who click through faster
than GA can load (which can be easily possible on those still on 56k), or have
"privacy blockers" in their ad block to remove GA altogether.

------
youngtaff
One of the problems with client side analytics is they don't give you the
whole picture i.e 4xx and 5xx errors are missing from them

