
How we built analytics.usa.gov - taylorbuley
https://18f.gsa.gov/2015/03/19/how-we-built-analytics-usa-gov/
======
err4nt
This is truly beautiful! I have also been experimenting with building HTML
pages that load all content from a BaaS provider. I can host a single static
HTML page on S3 and include a login view and content editor view in the same
file to let authenticated users make changes to the content that loads into
the page for public users.

I was so impressed by your dashboard earlier, to read that you built it using
the same techniques (empty page,populated by JSON) makes me really happy!

------
gandalfu
Why will the US government send all their traffic stats to Google? Why not use
piwik? [http://piwik.org/](http://piwik.org/)

Im I being too paranoid? This is after all third party software served by the
government.

~~~
mlissner
I think it's a reasonable question and I'm a big fan of piwik, but running it
at this kind of scale is very hard. We get about 10k hits/day on the piwik
instance we run and it's consistently taking more resources than the
application it's tracking.

~~~
bjelkeman-again
Are there any other viable open source traffic analysis tools than Piwik? I'd
hate to have to roll my own.

~~~
alexatkeplar
Snowplow (I'm a co-founder) can happily scale to billions of events per day:
[https://github.com/snowplow/snowplow](https://github.com/snowplow/snowplow)

~~~
e12e
Judging from the repo: "Collectors receive Snowplow events from trackers.
Currently we have three different event collectors, sinking events either to
Amazon S3 or Amazon Kinesis" (etc) -- it's still not viable to self-host
snowplow on own hardware/internal cloud etc? Or is it possible, but you need
to run a full cloud? (I understand why one would want a setup that runs on
Amazon, if one uses amazon, but when you host your own infrastructure, a self-
host option would be nice ... if viable).

Without an option to self-host, snowplow isn't really an alternative to pwiki.

~~~
alexatkeplar
Hey e12e! It's a great question. You are right - at the moment Snowplow is
still tied to the AWS cloud; we use a variety of AWS services which support
massively horizontal processing, including Elastic MapReduce, Kinesis and
Redshift. We are working on a Kafka+Samza version of Snowplow which we will
release later this year, most likely running on a Mesos cluster that you can
deploy where you want.

~~~
bjelkeman-again
We have to move away from US hosted services, so we have to wait for the
Kafka+Samza version if we go that route. Thanks!

------
smacktoward
_> This data comes from a unified Google Analytics profile_

I wish my tax dollars were not being spent to help a for-profit entity track
me across the Web.

~~~
wpietri
If you'd like to opt out, try this:

[https://tools.google.com/dlpage/gaoptout](https://tools.google.com/dlpage/gaoptout)

It says, "Available for Microsoft Internet Explorer 8-11, Google Chrome,
Mozilla Firefox, Apple Safari and Opera".

Personally, I prefer this to having my tax dollars spent on reinventing web
analytics. Especially given that it wouldn't be cheap; Google bought what is
now Google Analytics in 2005 for circa $30 million [1] and has put a lot into
improving it since. And now that I think about it, I'm not sure that tax
dollars are, net, being spent here. GA is very easy to install, and good
analytics typically help save money through better decisions and focusing
effort on what's actually being used.

[1] [http://searchenginewatch.com/sew/news/2062461/google-to-
acqu...](http://searchenginewatch.com/sew/news/2062461/google-to-acquire-
urchin-web-analytics-firm)

~~~
teacup50
Opt-out, really? We shouldn't have to opt-out of something that shouldn't be
happening in the first place.

The government shouldn't be inserting a 3rd-party -- Google -- into my
communications with them!

As for this:

> _... and good analytics typically help save money through better decisions
> and focusing effort on what 's actually being used._

Pffft. That's only true for startups whose sole goal is maximizing
traffic/retention/whatever while absolutely minimizing the amount of
creativity they expend on moving those "key performance metrics".

~~~
nemothekid
> _The government shouldn 't be inserting a 3rd-party -- Google -- into my
> communications with them!_

I'd think you would be surprised to discover that the government has been
using 3rd parties for years to manage various services. Did you know the
government also uses Oracle databases to store your information?

~~~
coroutines
Similarly, we should stop caring about domestic surveillance by the gov
because the gov has been doing it for years and we've only learned of it now.
/s

Btw, using Google Analytics is a lot different than using an Oracle database -
Oracle isn't privy to what you're storing.

~~~
nemothekid
> _Oracle isn 't privy to what you're storing._

Do you work at Oracle? Are you 100% sure this is true? Have you audited all
the hosted 3rd party solutions for the Government, and interviewed every
employee to ensure that they aren't privy to your data? Have you ensured that
all code written for the US govt has no cryptographic errors?

The claim that the US Govt. shouldn't use google analytics because Google can
read their data is hardly any different than arguing they shouldn't use
managed Oracle solutions because some Oracle exec has access to their data.
Where does it end? Should the US Govt. now spend and build their own analytics
infrastructure? Should the US Govt. write their own databases? Print their own
silicon? And what about everyone else? If a US company isn't trusted by the US
Govt to keep data secure, what about insurance companies? Or Healthcare
companies? Should these companies too, be also printing their own silicon? How
is this a solution?

If the problem is privacy, the solution is to create processes that respect
privacy. Sitting in the dark because electricity might burn your house down is
not a solution.

And God forbid that the Govt coerces Google to share data on _how people are
using their own servers and that they are already making public_.

------
orasis
Dang. I wistfully hoped this would be drill down real time analytics for the
U.S. economy and government operations.

------
vijayr
_Testers often assumed any three-digit number was actually short for a six-
digit number._

This is interesting, I haven't come across this before. What is the best way
to display big numbers? 550K or 550,000? 5M or 5,000,000?

------
diminoten
Just a technical question, but is '90daysago' and 'yesterday' a common way of
representing time in the node world, or are they just trying to show a generic
case?

Also I know there's some turmoil in the Node.js/IO.js fork, what kind of
practical implications does this have on projects like these?

~~~
tobz
There's at least one other language (PHP, [1]) that accepts these natural
language versions of times. It's not quite the same syntax, but I don't think
it's necessarily uncommon for some libraries to support these sorts of
queries, for lack of a better word. On the flipside, there's definitely a lot
of libraries that support displaying times in this way (like moment.js, [2]).

[1]
[http://php.net/manual/en/function.strtotime.php](http://php.net/manual/en/function.strtotime.php)
[2] [http://momentjs.com/](http://momentjs.com/)

~~~
emmelaich
The GNU versions of _find_ and _date_ accept natural language for times. For
_find_ in the --newermt and similar predicates and for _date_ in the the
--date option.

~~~
tobz
Whattttttttt? This is amazing.

------
gesman
>> visitor IP addresses are anonymized before they are ever stored in Google
Analytics..

That applies to data available to GA user. Google of course has access to more
detailed data.

~~~
konklone
Google claims not to, that they anonymize the IP before it ever hits disk:

[https://support.google.com/analytics/answer/2763052?hl=en](https://support.google.com/analytics/answer/2763052?hl=en)

You have to trust Google to follow through, but my understanding is this was
originally implemented to satisfy German privacy laws, so there's reason to
think they take it seriously.

How that impacts logging at their outermost network edge, and whether logs at
that level could be feasibly correlated with anything at the application
layer, I have no idea. It'd be nice if they clarified that.

------
snarkyturtle
Also, they used Neat/Bourbon as their css framework!

------
gagabity
Can you really overload S3? Also how long is the expiration time of
CloudFront?

~~~
chrisan
Cloudfront's default is 24 hours unless you specify otherwise

~~~
konklone
As of launch day, we started specifying a 5 minute cache time:

[https://github.com/GSA/analytics.usa.gov#deploying-the-
app](https://github.com/GSA/analytics.usa.gov#deploying-the-app)

We might increase that over time, but having fixes take effect within 5
minutes while the site is under the spotlight is a nice thing.

------
nbevans
TL;DR is "we just used AWS in a completely unexciting way".

~~~
snarkyturtle
Speed is sexy...

~~~
chrisan
And simplicity!

