
Shadow traffic: site visits that are not captured by typical analytics providers - ahstilde
https://blog.parse.ly/post/9616/shadow-traffic-why-your-traffic-numbers-are-off-by-20/
======
ChuckMcM
Okay, the cynic in me wants to write "New Age web designers are stumped by
lack of analytics while still refusing to look at their HTTP server log data."

I remember when the _ONLY_ analytics were those you could derive by analyzing
your http logs. Which have useful information in them. Things like source IP
address (which can be geo tagged), a bunch of HTTP headers (which are full of
information too), and a timestamp which tells you when it came in and from
where. Not to mention session cookies which take zero javascript to implement.

I've been retooling my site slowly to _only_ use these analytics (less the
cookies) because I value people's privacy while browsing as much as my own.
During the transition I've been comparing what I can pull out of the logs vs
what Google's analytics gives me. Sure, Google can do wonders, especially if
the person is coming from a browser where they are logged into Google. But, as
the article points out, they miss everyone running noscript and/or other
privacy enhancers like Privacy Badger from EFF.

I don't feel like I'm going to miss the Google added insights.

~~~
neonate
This is pretty funny, buried in the middle of the article:

> Option 2 – Server-Side Tracking

~~~
LaundroMat
The article talks about sending first-party analytics events to an analytics
provider from your own servers. So the server-side tracking the article refers
to is similar to the server-side tagging Google recently announced[1], not the
analysis of server logs.

[1] [https://developers.google.com/tag-
manager/serverside](https://developers.google.com/tag-manager/serverside)

~~~
neonate
Ok, thanks for the correction.

------
DevX101
I recognized this a couple years ago at a startup I work with. Comparing
Google Analytics numbers to validated event logs, the numbers were off by
~20-30%. Surely there must be a quick workaround I thought, there's no way
there's an entire multi-billion $ industry of 3rd party analytics software
giving bogus numbers to websites?! But that's indeed the case. I immediately
made top priority building out an in-house analytics platform where event logs
were sent via the API and thus didn't get blocked.

And for those saying relative direction is all that matters, I guarantee you
the behavior of users with adblocker installed is very different from those
who can't be bothered or don't know how.

~~~
sharkweek
Let me tell you what was fun - trying to explain to a former boss why Facebook
ads showed one number for amount of traffic sent, Google Analytics showed a
different number for traffic from those ads, then lastly our server logs
showing an entirely different number!

------
dddddaviddddd
I had an article on the front page of Hacker News last year that had about
17,000 real visits, as determined by analysing my server log files. I was also
using Google Analytics at the time, which told me I had 10,000 visitors (of
which only 7 were using Firefox!).

Obviously there's a gap between what trackers say and reality, bigger for some
demographics than for others.

~~~
ta17711771
> of which only 7 were using Firefox

Were _reporting_ using Firefox.

Also, not surprising, Firefox security leaves a lot to be desired.

~~~
marcinzm
>Also, not surprising, Firefox security leaves a lot to be desired.

Huh? Blocking google analytics tracking is a positive, not a negative
regarding security.

~~~
vlovich123
He's talking about user agent spoofing.

~~~
marcinzm
Still not sure how that ties into Firefox having worse security. User agent
spoofing is a privacy item and a positive privacy item.

~~~
vlovich123
If Firefox users are disproportionately lying about the user agent for privacy
then counts are off. This would impact 1p telemetry as well. If Firefox users
are disproportionately running Adblock then they'll be undercounted (it's also
entirely possible that Chrome & GA have some kind of thing where even if
you're using Adblock they're able to correct the GA data they show you for
Chrome users).

------
ThePhysicist
Absolute numbers tend to be overrated in analytics. Often relative numbers,
like the number of conversions per tracked user matter more. Also, if your
product is targeted at privacy-savvy individuals like developers that often
use blockers you might be better off using server-side tracking. That seems to
have become a lost art though, especially since many sites use CDNs that hide
a lot of visits for cacheable content.

~~~
thekyle
If you use Cloudfront (AWS) they have the option for server-side tracking
built-in. You just tell them which S3 bucket to dump the logs into and you get
the raw HTTP requests with timestamps. I personally use a service called
s3stat which takes those dumps and turns them into pretty graphs.

------
ghgr
The other side of the coin are the inflated traffic statistics that include
all kind of bots & crawlers with spoofed user-agent, and for low traffic sites
like niche personal blogs with a custom domain it can be 90+% of the server-
side logged visits.

How proud was the 15-year old me with my first .com domain, having over 100
visitors per day. Little did I know that the actual number of visitors was
much, much less than that.

------
choeger
You should use Server-Side tracking.

There is no reason to design your website in a way that makes your legitimate
analysis use cases depend on Client-Side computations.

If Server-Side tracking looks too complex for you, you might want to
reevaluate the balance of technical knowledge in your enterprise.

~~~
XCSme
What if your site is a SPA? You would not know, for example, the time spent on
site, what pages are visited, where exactly users leave, if there are client-
side errors, right?

~~~
acdha
If you're using a SPA you need to build that instrumentation in to match the
native behaviour, along with robust error handling using something like
[https://github.com/getsentry/sentry/](https://github.com/getsentry/sentry/)
so you can tell when your code is broken client-side where you would otherwise
not have visibility.

This is much less likely to be blocked if you self-host it — breaking requests
to your server will break the app, whereas blocking common cross-site tracking
services is popular because there are few drawbacks for the user.

~~~
XCSme
You can self-host sentry?

~~~
acdha
Yes – it's pretty easy to run the open source app in your favorite container
runtime:

[https://hub.docker.com/_/sentry/](https://hub.docker.com/_/sentry/)

~~~
XCSme
Wow, I didn't know that. I remember using it at my last company and we always
kept receiving quota warnings, and their higher plans were really expensive.

------
jklinger410
This is pretty much a parsely ad. I think the HN crowd is pretty aware that ad
blockers, vpns, etc can break analytics.

~~~
acdha
I think you're too quick to dismiss it. It's one thing to know that it exists
and another to recognize that it's somewhere between 20-40% of your total
traffic, especially unprompted. I've had many conversations where people
assumed their traffic numbers was real until this point was raised, at which
point everyone metaphorically slapped their foreheads and realized that they
had forgotten to take this into account.

~~~
throwaway287391
When does this actually matter though? Isn't growth (or shrinkage) what you'd
normally really care about (e.g. this month we had 10% more DAUs than last
month)? I suppose if you changed something to attract tons of new users who
disproportionately use AdBlock (for example) this becomes an issue as it
wouldn't show up in your metrics, but is that sort of thing common?

I suppose if nothing else it's good to know so you can immediately beef up
your numbers +20% in your slide deck for VCs.

~~~
acdha
What if you're doing anything which doesn't involve logins — public
information, advertising, etc. – where users don't otherwise trigger something
like account creation / logins?

What if you're trying to get stats about people who don't convert or otherwise
give you a signal that they're using the site?

I've run into sites where things like signup or checkout are blocked behind an
analytics tracker (Adobe used to recommend running theirs in a synchronous
navigation-blocking mode) which meant that any problem with that service was
completely invisible unless they contacted you to complain.

I also remember people wondering why Firefox users stopped using their site
when they shipped the release which enabled tracking protection by default.

~~~
throwaway287391
Good points, thanks!

------
billyhoffman
I get wanting to create valuable thought leadership content, but this is the
worst example of Bad product marketing:

1- present a new concept to readers (shadow visitors)

2- show how this concept is scary and bad for your business (your analytics is
off by 20%!!!)

3- present 2 options, which by the way are free, but immediately shit all over
them (Server logs! But that’s hard and complicated! Edge logs! But getting
these are hard!)

4- present your company’s product as option 3, which surprisingly, have no
downsides and isnt shit upon

5- profit

What disingenuous garbage. You should be ashamed Parse.ly.

The right way to do this is do steps 1 and 2. Then show in detail step 3: how
to solve the problem with easy options, ideally with free and open software.
It’s ok to show edge cases, corner cases, or just shear scale issues that
makes these options challenging.

The difference is that good product marketing pieces show people how to solve
the problem and offer a solution to do that at scale or in a automated/hosted
way so the customer doesn’t have to deal with it.

If your product marketing content’s message is “you are screwed unless you buy
our product “ you are doing it wrong

~~~
MauranKilom
Yeah, this article just read super weird to anyone with even a basic
understanding of the field. It's very clearly targeted at managerial folk, and
even then it is as shallow in its arguments as it is transparent in its
motives.

------
indymike
Server side tracking is useful for logging http requests. Client side tracking
is useful for logging user interactions. Used to be there would be a small
difference between server and client due to caches and user settings... but...
Modern apps (i.e. React, Vue, Angular) often only load one page, and then all
interaction is managed by client side code, so often client side tracking is
the only thing that works.

------
gentleman11
Obvious question: how do you filter out bot traffic with server side logs?
What percent of visitors are bots anyway?

~~~
bleepblorp
Most legitimate bots identify themselves with specific user agent strings.

Script kiddie attack bots are generally fairly obvious as they hammer away at
things like /wp-login.php for days on end regardless of what error codes the
server returns.

Most other bots are pretty evident just by looking at access patterns. Just
identify their IPs and drop them from your analytics.

------
thinkloop
Does hosting client-side tracking on your own domain circumvent all the
problems? How come that hasn't become the standard and killed 3rd-party
trackers? If it's a question of having to manage an analytics platform, can't
that still be deferred to a 3rd-party but through your own subdomain?

~~~
derivagral
My interpretation of how Google Analytics does this is that pulling in a 3rd
party dep lets them keep a lot of control over versioning and similar things.
I assume they do this so out-of-date or vulnerable code (or extra tracking,
sure) can be cleanly and quickly applied through the network and not rely on
the clients to properly update "ga.js" a few times a day.

------
paulchap
Funnily enough, I got a warning telling me uBlock prevented this page from
loading...

------
pkaye
What does parse.ly do differently to account for this discrepancy in the
analytics?

~~~
pixelmonkey
Good question. I explain some of the technical approaches we've taken with
customers in another comment here:

[https://news.ycombinator.com/item?id=24205803](https://news.ycombinator.com/item?id=24205803)

------
pixelmonkey
I'm one of Parse.ly's co-founders. This post was written by one of our product
managers about a project and investigation we've been doing for the past few
months. It first got on my team's radar when I posted this set of tweets back
in 2019:

[https://twitter.com/amontalenti/status/1165262620959617025](https://twitter.com/amontalenti/status/1165262620959617025)

Specifically: I noticed a huge difference between the metrics we were
reporting on my blog post in Parse.ly, and the metrics being reported by my
personal blog's Cloudflare CDN (caching the content).

Ironically enough, this traffic was all coming from HN and the post was itself
about modern JavaScript[1].

Since then, we've also been hearing from a lot of customers about various
scenarios where traffic is either under-counted or mis-counted. For example,
something that has been tripping us up lately is that our Twitter integration
relies (partially) upon the official t.co link shortener[2], and yet, due to
modern browser rules related to W3C Referrer Policy[3], the t.co link's path
segment is often not transmitted to the analytics provider, and thus the
source tweet for traffic cannot be easily ascertained.

I firmly believe in privacy and analytics without compromise[4], so the team
is trying to come up with ways to at least quantify shadow traffic at an
aggregate level, and to ensure legitimate user privacy interests are honored,
while making sure they don't break legitimate privacy-safe first-party
analytics use cases.

As a developer, something that concerned me recently was realizing that
Sentry, the open source error tracking tool with a SaaS reporting frontend and
a JavaScript SDK[5], gets blocked in many conservative browser privacy setups.
Though the interest to user privacy is legitimate, I think we can all agree
it'd be better for site/app operators to know when certain browsers are
hitting JavaScript stack traces.

[1]:
[https://news.ycombinator.com/item?id=20785616](https://news.ycombinator.com/item?id=20785616)

[2]: [https://help.twitter.com/en/using-twitter/url-
shortener](https://help.twitter.com/en/using-twitter/url-shortener)

[3]: [https://www.w3.org/TR/referrer-policy/](https://www.w3.org/TR/referrer-
policy/)

[4]: [https://blog.parse.ly/post/3394/analytics-privacy-without-
co...](https://blog.parse.ly/post/3394/analytics-privacy-without-compromise/)

[5]: [https://sentry.io/for/javascript/](https://sentry.io/for/javascript/)

~~~
MauranKilom
So what do you actually (want to) do in regards to measuring shadow traffic?
The blog post tries to convince the reader they should care about shadow
traffic and then handwaves away existing solutions, but "we're developing a
solution" is as concrete as the post gets as to why I should turn to parse.ly.
Now you say "the team is trying to come up with ways to at least quantify
shadow traffic at an aggregate level", so it appears that you don't even have
a solution yet.

In light of that, presenting "existing analytics services like parse.ly" as
one of three solutions on "how to measure shadow traffic" seems borderline
disingenuous. If you can do it, why not say so plain and clear? If you can't
do it, why do you mention yourself as a solution? Or is it only other services
_like_ parse.ly that can do it?

It also rubs me the wrong way how both the blog post and your comment has an
undertone of "it would be better if users didn't have as much tracking
protection". Just take the framing of your last sentence as an example...

~~~
pixelmonkey
Actually we have a few accidental "solutions" to this problem already in
production and we are just trying to figure out which one meets the right
overlap of respecting privacy preferences and providing site/app owners with
visibility into shadow traffic.

Here they are:

\- Server-side proxy: The blog post I linked about The Intercept uses this
setup. Basically, a web server run by our customer captures all the traffic
inside their cloud or hosting environment. That traffic is then logged and
proxied to our data capture server, with some data scrubbed before we receive
it (e.g. IP address removed), with data sent via our server-side protocol.

\- First-party custom domain: We spin up a server and HTTPS certificate and
the customer points their own subdomain (via a DNS A or CNAME record) to that
server, which serves as a proxy. We originally built this facility to clarify
data ownership in a GDPR context -- where the customer is a data controller
and we are a data processor, so the controller owns the domain where data
ingest happens.

This and the prior solution have the side-benefit that Parse.ly couldn't do
any cross-site linking even if third-party cookies were enabled in the
browser. We never do this anyway, but both of these setups make it technically
impossible due to the browser rules around cookies and domains, which is a
nice security improvement. But it also raises other issues, like the fact that
the customer setup is more complex, with more moving parts.

\- API connection to CDN. This one is not actually productionized but was
merely prototyped. We'd pull basic per-day and per-page CDN server request
logs, and compare that to our pageview counts to understand the delta, which
is likely mostly shadow traffic. The upside of this solution is that it might
be pretty easy to setup for customers, the downside is that we'd have to build
connectors for a lot of CDNs, and through market research we have learned that
larger customers might use multiple CDNs at once (believe it or not).

\- "Fallback" logging of blocked page loads. This one was also just
prototyped, but the idea is that some JavaScript code would detect whether
Parse.ly JavaScript SDK was blocked from loading, and if so, a basic privacy-
safe "this page's analytics were blocked" event would be sent to a domain
owned by the customer, perhaps one that ensured scrubbing of all details other
than the "fact" that a block event happened at a particular timestamp. We
actually prototyped this particular idea on our own marketing site because we
ran into issues with Marketo & Parse.ly data vs our server logs and even our
lead capture forms. (That is, situations where a lead was captured for someone
with "zero pageviews", because their session was shadow traffic but their form
fill nonetheless happened.)

Re: your comment that you sense an undertone of, "it would be better if users
didn't have as much tracking protection", I have no such personal or
professional belief, and I can assure it isn't a view held by our
company/team. I understand the motivation for tracking protection and we even
suggest use of Mozilla Firefox's tracking prevention option in our privacy
policy.

But there's no doubt that it is leading to confusing data discrepancies for
site/app owners, and I think site owners have a right to a _basic_
understanding of how much of the traffic they are paying the hosting bills to
serve is actually perceptible to their observability/reporting, even if the
only detail they get about that visit is "the visit happened", similar to the
level of detail they get from server logs or CDN logs as a matter of course.

~~~
MauranKilom
Thank you for the detailed reply! I understand that the technical details
might not be of great interest for who you were primarily targeting with the
blog post, but I think the discussion on HN would be much enriched (possibly
including actionable-for-you ideas and suggestions) if these points were
mentioned in the submission itself.

> I have no such personal or professional belief, and I can assure it isn't a
> view held by our company/team

Fair enough, it might just be my interpretation that is colored by the context
of the content being written by an analytics business. Although I'm still at a
bit of a loss what the implication is in the sentence I was referencing. Could
you spell that out for me?

~~~
pixelmonkey
Yea, I think the point of the post was to "name a thing", not necessarily to
wade into the tech. (Like I mentioned, was written by a PM colleague who has
been discussing this issue with customers/prospects.)

That is, aim was to introduce this idea of "shadow traffic", since most of our
customers/prospects don't even know it exists! (Whereas, for example, "bot
traffic", which can inflate analytics numbers, is a well-known problem.)

The sentence you are referencing is about Sentry error tracking, right? All I
was intending to say is that sometimes, tracking protection throws the baby
out with the bathwater. The end user wants to avoid creepy ads and privacy
leaks. But, instead, they are blocking error tracking tools, whose primary
purpose is to catch and fix frontend coding bugs. And then when those same
blocking rules become browser defaults, it can end up in a situation where
whole classes of users don't have errors tracked/logged, merely because the
site owner (quite reasonably) chose to use a SaaS for error tracking/logging,
rather than, say, rolling one's own self-hosted system for that commodity use
case.

