
Self-host analytics for better privacy and accuracy - FiloSottile
https://blog.filippo.io/self-host-analytics/
======
pixelmonkey
Piwik is a great project, but it tends not to work well for handling sites
with millions of events per day. Your MySQL table starts to bust at the seams
pretty quickly.

For big sites, you'll want that event data in GiB's of plain raw logs that you
can bulk load into tools like BigQuery or Redshift for analysis.

My team has built/delivered a SaaS web content analytics platform for the past
few years called Parse.ly. We instrument pageview events (like Google
Analytics) automatically, and we also instrument time-on-page using heartbeat
events. We collect 50 billion monthly page events for over 600 top-traffic
sites, and display it all in a real-time dashboard.

To emphasize that our customers own the data they send us, we recently
launched a Raw Data Pipeline product:

[http://parse.ly/data-pipeline](http://parse.ly/data-pipeline)

Basically, we host a secure S3 bucket and Kinesis stream for customers, and
deliver their raw (enriched) event data there. From there, they typically load
it into their own BigQuery or Redshift instance, or they analyze pieces of it
directly with Python/R/Excel/etc.

Our customers tell us this strikes the right balance among data ownership,
query flexibility, and hassle-free infrastructure.

~~~
educar
Is their any difference in data ownership between GA and parsely?

~~~
pixelmonkey
Data ownership in GA is a "gray area" that becomes less gray if you pay
$150K/yr for "GA Premium".

Google has mixed incentives in running its free analytics service. It gets
web-wide analytics data, it uses data to help it sell more AdWords to
customers, and it integrates GA with other services, like their display
advertising products (DFP, etc.)

From a practical standpoint, you don't "own" analytics data when a) you can't
easily access it in raw form and b) the SaaS provider "leaks" your data to
dilute its value to you. We address (a) and (b) directly through our products
and public data privacy stance. See this blog post for our public view on
analytics data privacy:

[http://blog.parsely.com/post/3394/analytics-privacy-
without-...](http://blog.parsely.com/post/3394/analytics-privacy-without-
compromise/)

~~~
lewisl9029
While I find your stance on privacy very refreshing for an analytics company,
hiding your pricing info behind a sales rep is a huge turnoff for me. If you
feel your pricing is reasonable for the service that you provide, I really
don't see why you can't just display it proudly on your site.

If SpaceX can afford to not hide their pricing behind a sales rep, so can you:
[http://www.spacex.com/about/capabilities](http://www.spacex.com/about/capabilities)

~~~
pixelmonkey
Whether to display pricing on the website is something we debated in the past,
and continue to debate. (Your comment may wake up the debate for me.)

Pricing for analytics services (in the marketplace) is all over the map.
Google picked $150K/year as the price for GA Premium because that's the low
end of an Adobe Analytics contract, who is the market leader. We're typically
cheaper than existing Adobe/GA contracts. Non-competitive "event analytics"
companies like MixPanel and Heap have variable per-event pricing that would
break the bank for the customers we serve. We have a bit of an aversion to
per-event pricing because it feels like "punishing customers for success".

Meanwhile, per seat pricing, though attractive on the surface and popular in
the SaaS segment, has several concerns in our space. First, we want customers
to feel free to hand out access to our platform: part of our value proposition
is democratizing access to analytics data. So we don't want "stingy seat
quotas" typical with tools like Salesforce. Second, for an analytics tool,
seats are a bit easier to "hack" for a pricing model -- though our dashboards
can be customized per user, a single shared account can access all the data.
Meanwhile, our costs don't scale with seats, but with site traffic/users
instead.

For these reasons, and more, we've settled on "tiered pricing". Roughly
speaking, our service is offered in three tiers. Each tier supports a larger
class of site (more monthly uniques), which also bestows more features (e.g.
more data retention in higher tiers). To work within the budget constraints of
some companies, we will discount tiers while removing cost-affecting features,
e.g. maybe you are in the highest class of site, but we disable API access and
limit data retention. Because this is a tad more complex than a pricing page
could express easily, and also because we think the value of the product comes
through best in a guided demo, we made the decision to hide pricing and
instead responsively provide demos on-demand.

So, the tl;dr is, pricing, and the display of it, is definitely something we
think about, and we have (IMO valid) reasons for not displaying pricing right
now, but you make a fair point: if Musk can price his rockets publicly, maybe
we can figure something out, too :)

~~~
Scirra_Tom
I avoid all services with prices behind sales rep when possible. I always feel
I'm not an astute negotiator, the sales rep will be ergo he will see me as a
mug and take me for all he can get. If I do buy, even if I'm happy I'm always
left wondering if I'm paying 2x as much as other customers.

------
_jomo
Another very simple alternative is goaccess [0] which is purely server-side.
It gathers quite a lot of information by parsing the server log. It doesn't do
all the JavaScript stuff to track every single click, but it gives you stats
on how many users visit your site, which parts, when, which sites or which
domains they're coming from and also how much bandwidth they use. It also
shows which status codes are coming from which paths and a lot more. It
supports various output formats such as HTML or an interactive htop-like
terminal application. It's also being actively developed. I use it and find it
very useful.

0: [https://www.goaccess.io/](https://www.goaccess.io/)

~~~
herbst
GoAccess is really one of the greatest tools of these kind. I use it with GA
on all my servers.

------
shermozle
Piwik has the problem that it writes directly to MySQL as the activity
happens. If your database is down, you lose data. If you have a spike of
traffic above what your DB can handle in writes, you lose data.

Snowplow doesn't have this problem.

~~~
moehm
You can query your writes in Redis, so it won't be lost if your database goes
down.

[https://piwik.org/faq/how-to/faq_19738/](https://piwik.org/faq/how-
to/faq_19738/)

~~~
JohnBooty
Note to anybody reading this but not following the link: parent poster means
"queue your writes in Redis" and not "query."

~~~
moehm
I'm sorry, I am not a native speaker and it was already late.

------
Ileca
"Self-hosting analytics for better privacy and accuracy."

Then why using google fonts which is listed by Disconnect.me (used by Firefox)
as a tracking domain? Isn't that paradoxical?

~~~
FiloSottile
Oh, hey, good point. Let me fix that.

EDIT: Done. Not exactly straightforward to download Google Fonts but there are
great helpers around. Got rid of CDNjs as well, since CloudFlare has HTTP/2
now. No 3rd parties left except GA, which will go in a day or two.

~~~
tombrossman
If you just want to have nicer webfonts without the 'phoning home to Google'
issue try Brick webfonts at [http://brick.im/](http://brick.im/).

If the goal is to get rid of all third-party dependencies you can still use
the Brick repository on GitHub to download the better looking fonts and self-
serve them, just as you are now with the Google served ones.

The only 'gotcha' I found with using Brick is that NoScript users won't see
the fonts and won't see the Brick URL in the menu for whitelisting. They will
need to inspect the page and learn that they must manually add brick.im. Not
exactly user-friendly but then again NoScript breaks lots of things and users
are used to it.

------
manigandham
Google Analytics has Measurement Protocol which allows for posting data
server-side.

All you need to do is proxy the data collection and then send it to them,
taking advantage of all the scalability and features they have.

[https://developers.google.com/analytics/devguides/collection...](https://developers.google.com/analytics/devguides/collection/protocol/v1/)

------
viana007
The biggest problem with Piwik is that is not scalable and the cost ($) with
servers to store analytics data. I saw many cases that Piwik not supported the
big volume of data.

~~~
rhizome
Why isn't it scalable?

~~~
shermozle
MySQL

~~~
rhizome
I see. What is the upper bound of "big volume of data" possible under MySQL +
Piwik?

~~~
shermozle
That's a difficult thing to answer. However the more important problem is loss
of data whenever your DB isn't available due to downtime, upgrade etc. It
depends how important data loss is for your user case. I'm a data completist
but I'm in therapy for it ;)

It's definitely worth playing with, and trivially easy to spin up. Other self-
hosted options aren't anything like as simple to get up and running.

~~~
rhizome
Well, but the point was MySQL + Piwik "does not scale" and that it's
"expensive" besides, which doesn't comport with my experience and sounds like
received wisdom.

------
mwexler
There are many, many self-hosting analytics tools, from your own big data
pipes to tools like the aforementioned piwik, as well as Open Web Analytics. I
like Snowplow
([http://snowplowanalytics.com/](http://snowplowanalytics.com/)), but it's
currently hard-coded to AWS.

Hosting your own analytics data can be great, but there are lots of ways to
get better accuracy and control over your data without having to host
everything. Still, if you can, it's great.

~~~
yummyfajitas
Snowplow is only loosely hard coded to AWS. I'm using it and breaking it free
is only a few hundred lines of code.

For example, rerouting Snowplow's Kinesis collector into Kafka is 114 lines,
and that includes logging, metrics, etc - I basically just had to extend the
AbstractSink object in their scala collector. Reading from Kafka is another
couple of hundred lines, similarly writing to files.

~~~
alexatkeplar
Thanks for sharing yummyfajitas - expect official Kafka support for Snowplow a
little later this year [1] [2]; it's been long awaited! (Snowplow co-founder)

[1]
[https://github.com/snowplow/snowplow/milestones/Kafka%20%231](https://github.com/snowplow/snowplow/milestones/Kafka%20%231)
[2]
[https://github.com/snowplow/snowplow/milestones/Kafka%20%232](https://github.com/snowplow/snowplow/milestones/Kafka%20%232)

~~~
yummyfajitas
Nice. If you build some sort of native maxmind or other geotargeting into the
scala collector, that would also be cool.

(Not that it was difficult to roll my own - so far snowplow is perfect for my
needs - but obviously I'd rather use an official one.)

~~~
alexatkeplar
We do all enrichments like MaxMind, weather, arbitrary JavaScript etc
downstream of collection, in our enrichment phase - the list of configurable
enrichments is here: [https://github.com/snowplow/snowplow/wiki/Configurable-
enric...](https://github.com/snowplow/snowplow/wiki/Configurable-enrichments)

------
davidw
One potential problem: if you, say, test out self-hosted alongside GA, and get
different numbers, people are going to question the value of the self-hosted
thing.

~~~
soared
They should question both, but maybe GA more. GA is an approximation and
absolutely not completely correct.

~~~
davidw
Try selling that to a business guy or client or someone. Possible, but not
easy, especially if there's ever a large discrepancy.

~~~
soared
Haha oh I know. Tell a client, "Yeah this is the data, it says this, BUT this
is also how its wrong" and they always ignore the last part. Its data, that
stuff can't be wrong!

------
__jal
I've done my own analytics since before there were hosted services.

Still use an old-school, proprietary tool called Sawmill[1]. One very nice
aspect (out of many) about it is that it handles hundreds of other log formats
out of the box, so it can report sensibly on switches, email, firewall logs...
just about anything with very little effort.

No connection, other than being a long-time happy customer.

[1] [http://sawmill.net/](http://sawmill.net/)

~~~
blacksmith_tb
I haven't used Sawmill in ages, but it wasn't bad, once upon a time. There's
also GoAccess, which is a snazzy ncurses logfile analyzer:

[https://www.goaccess.io/](https://www.goaccess.io/)

~~~
mattab
Piwik also handle dozens of Log Formats, check it out: [http://piwik.org/log-
analytics/](http://piwik.org/log-analytics/)

Github project: [https://github.com/piwik/piwik-log-
analytics](https://github.com/piwik/piwik-log-analytics)

------
buremba
Shameless plug: We're also working on an on-premise custom analytics platform
that can be deployed either Heroku or AWS with Cloudformation.
[https://github.com/rakam-io/rakam](https://github.com/rakam-io/rakam)

------
d0ugie
To mitigate security loss (Piwik is complex), run Piwik and serve its gif on
another machine.

~~~
ljoshua
Can you elaborate a bit more?

~~~
SwellJoe
I'd assume the prior poster is suggesting that running a third party tool on
your primary web server(s) would increase the surface area for attacks by some
(possibly not small) amount. e.g. if piwik is compromised, an attacker would
have some sort of user access on your web server(s), which is generally a bad
thing. I suppose some people also use the same database user for all web
applications, which would potentially be disastrous.

There are mitigations one can implement without going to that length (run
piwik under a different user than any of your other web applications, using
suexec, use a different database and user, etc.). At the least, putting Piwik
in a container or VM makes sense, if any data on your web server(s) is
critical or sensitive.

I suspect for very large deployments this would go without saying. But, for
users with only one web server, it might seem reasonable to drop it into the
same virtual host and run it all as the same user (and it's probably safe
enough to do so for many users, as long as they stay on top of updates). But,
any web application you run adds surface area for attackers. Might as well
isolate them as well as your skills and resources allow.

~~~
mattab
See also some of the official Piwik security tips:
[https://piwik.org/docs/how-to-secure-piwik/](https://piwik.org/docs/how-to-
secure-piwik/)

------
Animats
If you're serving the pages yourself (not, as this site does, through
Cloudflare's caches), what does this tell you that isn't in the server logs?

~~~
FiloSottile
Unless you are tracking events, outlinks, time spent on page or something
similar, nothing as far as data goes. But obviously server logs lack
statistics and aggregation. PIWIK actually supports using the server logs as
its data source instead of the javascript tag.

------
tirus
For those of us that prefer not to be tracked by self-hosted Sandstorm apps, 2
new ublock rules (they're very rough, but I am no ublock wizard):

||sandcats.io/embed.js$script

view.gif?page=$image

Note that you would have to change the filters if the scripts and tracking
pixel are renamed of course but this should catch a majority of the push
button installs.

~~~
dragonne
Why do you object to self-hosted analytics? I understand blocking centralized
trackers (I do so myself), but self-hosted doesn't seem problematic in the
same way GA being present on half the pages on the Internet is.

It also strikes me as an unwinnable battle for all but the largest sites.

~~~
angry-hacker
Because OP is against all kind of tracking? And because he can...

~~~
Programmatic
I can't claim to speak for OP, but am also against most tracking. I would also
tend to think that being against first party tracking would be an unwinnable
battle. It also leaks less data than third-party tracking, since the third
party can see your activity across multiple sites whereas first party can only
see your activity on that site unless it's aggregated through a backend
service (another poster mentioned the ability to upload server logs to GA). No
matter what, they can see what you load from their site.

Getting to first party hosting of more intrusive analytics (scroll location,
etc) I think rather than disallowing certain scripts/URLs to run, you have to
get back to behavioral-based blocking. Doing that in an environment that you
allow any JS to execute seems tough since sandboxing something that can update
the page based on location can "talk" to another part that can report back to
the server.

If you don't like intrusive first party analytics, just stop all JS.

------
philliphaydon
Off topic a bit. But with GA being blocked, all these blocks work by blocking
URLs and domains at the time they are loaded, but not requests made after load
of the requested resource. Couldn't you just proxy the URL to GA so it's not
blocked?

~~~
dest
By "requests made after load of the requested resource", you mean JS XHR
requests for example? I'd guess they are also filtered by adblockers, aren't
they?

------
daedalus_j
Not quite true though... Piwik, by default, includes EasyList, which has the
following block rule:

/piwik.$domain=~piwik.org,script

This will block your self-hosted piwik.js file, unless you perform some
redirection trickery.

~~~
FiloSottile
The default Piwik Sandstorm install has "piwik" nowhere in the URL.

~~~
tirus
Yeah, unfortunately it shows up as
[https://ls4an735rucvfa6ps6bb.filippo.sandcats.io/embed.js](https://ls4an735rucvfa6ps6bb.filippo.sandcats.io/embed.js)
\- it's getting to the point where I need to start performing actual content
inspection.

~~~
aorth
I run Piwik on my server log files. This is my right as a website owner /
system admin. :)

------
aluhut
I would love to have Sandstorm on my RasPi but they don't support ARM.

It sounds like a great home intranet.

~~~
ocdtrekkie
Just notes on this for the curious:

1\. Sandstorm doesn't support ARM currently because Sandstorm apps run native
Linux binaries, and every app would have to be compiled for each architecture.

2\. I honestly think you'd be running pretty crippled trying to do Sandstorm
on a RasPi. It's a bit smaller scale than Sandstorm seems targeted for. Each
open Sandstorm grain commonly uses 100 MB of RAM or more (on top of the RAM
used by Linux and the Sandstorm server itself, of course), so with just a
couple of Sandstorm grains running simultaneously, you can max out a RasPi
pretty quickly.

~~~
MustardTiger
>Sandstorm doesn't support ARM currently because Sandstorm apps run native
Linux binaries, and every app would have to be compiled for each architecture.

That's true of any linux distro providing binary packages. They all support
arm anyways, it is trivially simple to compile packages. Even small projects
like openbsd compile tens of thousands of packages for a dozen arches.

~~~
kentonv
Yes but distros accomplish that by being highly opinionated on the build
process you use to build packages whereas Sandstorm tries to be unopinionated
on this point.

Sandstorm will support ARM someday but it's going to require a large
investment in tooling in order to be painless for developers.

~~~
MustardTiger
>Yes but distros accomplish that by being highly opinionated on the build
process you use to build packages whereas Sandstorm tries to be unopinionated
on this point.

How's that exactly? I looked it over and can't see any difference at all.

