Hacker News new | comments | ask | show | jobs | submit login
How we built analytics.usa.gov (gsa.gov)
158 points by taylorbuley on Mar 20, 2015 | hide | past | web | favorite | 44 comments

This is truly beautiful! I have also been experimenting with building HTML pages that load all content from a BaaS provider. I can host a single static HTML page on S3 and include a login view and content editor view in the same file to let authenticated users make changes to the content that loads into the page for public users.

I was so impressed by your dashboard earlier, to read that you built it using the same techniques (empty page,populated by JSON) makes me really happy!

Why will the US government send all their traffic stats to Google? Why not use piwik? http://piwik.org/

Im I being too paranoid? This is after all third party software served by the government.

I think it's a reasonable question and I'm a big fan of piwik, but running it at this kind of scale is very hard. We get about 10k hits/day on the piwik instance we run and it's consistently taking more resources than the application it's tracking.

Are there any other viable open source traffic analysis tools than Piwik? I'd hate to have to roll my own.

Snowplow (I'm a co-founder) can happily scale to billions of events per day: https://github.com/snowplow/snowplow

Judging from the repo: "Collectors receive Snowplow events from trackers. Currently we have three different event collectors, sinking events either to Amazon S3 or Amazon Kinesis" (etc) -- it's still not viable to self-host snowplow on own hardware/internal cloud etc? Or is it possible, but you need to run a full cloud? (I understand why one would want a setup that runs on Amazon, if one uses amazon, but when you host your own infrastructure, a self-host option would be nice ... if viable).

Without an option to self-host, snowplow isn't really an alternative to pwiki.

Hey e12e! It's a great question. You are right - at the moment Snowplow is still tied to the AWS cloud; we use a variety of AWS services which support massively horizontal processing, including Elastic MapReduce, Kinesis and Redshift. We are working on a Kafka+Samza version of Snowplow which we will release later this year, most likely running on a Mesos cluster that you can deploy where you want.

We have to move away from US hosted services, so we have to wait for the Kafka+Samza version if we go that route. Thanks!

https://github.com/divolte/divolte-collector is quite nice and can handle extreme loads

That is interesting too for us, as Kafka is possibly in our future too. Thanks!

I building something based on Splunk. Open source + free Splunk license.


Especially interesting would be a Go or Node.js project that use a caching layer (like Memcached) to scale better than writing directly to a SQL/NoSQL database.

Having worked with the US government, I'm guessing that they don't really know what 18F is doing, or the implications.

The implications are really scary. Im the first one to applaud the new digital office initiative and the talent behind, but when it comes to third party software the government should trust no one, not matter their competence.

Scary scenario #1: All of my interactions with the government are known to google.

Scary scenario #2: Google CDN is compromised and malware is served to everyone!

> This data comes from a unified Google Analytics profile

I wish my tax dollars were not being spent to help a for-profit entity track me across the Web.

If you'd like to opt out, try this:


It says, "Available for Microsoft Internet Explorer 8-11, Google Chrome, Mozilla Firefox, Apple Safari and Opera".

Personally, I prefer this to having my tax dollars spent on reinventing web analytics. Especially given that it wouldn't be cheap; Google bought what is now Google Analytics in 2005 for circa $30 million [1] and has put a lot into improving it since. And now that I think about it, I'm not sure that tax dollars are, net, being spent here. GA is very easy to install, and good analytics typically help save money through better decisions and focusing effort on what's actually being used.

[1] http://searchenginewatch.com/sew/news/2062461/google-to-acqu...

Having to install software so that a government doesn't give your details for free to a for-profit company is actually awful.

30M isn't much, considering the scale at which they run, and considering that it would be something freely available to the public as well. There's also the fact that open source alternatives exist today.

Opt-out, really? We shouldn't have to opt-out of something that shouldn't be happening in the first place.

The government shouldn't be inserting a 3rd-party -- Google -- into my communications with them!

As for this:

> ... and good analytics typically help save money through better decisions and focusing effort on what's actually being used.

Pffft. That's only true for startups whose sole goal is maximizing traffic/retention/whatever while absolutely minimizing the amount of creativity they expend on moving those "key performance metrics".

>The government shouldn't be inserting a 3rd-party -- Google -- into my communications with them!

I'd think you would be surprised to discover that the government has been using 3rd parties for years to manage various services. Did you know the government also uses Oracle databases to store your information?

Similarly, we should stop caring about domestic surveillance by the gov because the gov has been doing it for years and we've only learned of it now. /s

Btw, using Google Analytics is a lot different than using an Oracle database - Oracle isn't privy to what you're storing.

>Oracle isn't privy to what you're storing.

Do you work at Oracle? Are you 100% sure this is true? Have you audited all the hosted 3rd party solutions for the Government, and interviewed every employee to ensure that they aren't privy to your data? Have you ensured that all code written for the US govt has no cryptographic errors?

The claim that the US Govt. shouldn't use google analytics because Google can read their data is hardly any different than arguing they shouldn't use managed Oracle solutions because some Oracle exec has access to their data. Where does it end? Should the US Govt. now spend and build their own analytics infrastructure? Should the US Govt. write their own databases? Print their own silicon? And what about everyone else? If a US company isn't trusted by the US Govt to keep data secure, what about insurance companies? Or Healthcare companies? Should these companies too, be also printing their own silicon? How is this a solution?

If the problem is privacy, the solution is to create processes that respect privacy. Sitting in the dark because electricity might burn your house down is not a solution.

And God forbid that the Govt coerces Google to share data on how people are using their own servers and that they are already making public.

The use Oracle Database because it's the provides the best solution.

You'd rather them pay {m,b}illions more making their own system?

I'd rather they spend nothing.

So now they're spending nothing they have no way to know if people are using their sites. They have no idea what sites are useful and what sites are not. Do they assume sites are not being used and have just a "Hi we are the USA" homepage or do they spend money on sites that are never visited

Even the government needs analytics friend

I'm all for the government not spending any money on any sites.

I think using a pre-existing software service as a base instead of a massive handmade distributed system may not be a bad idea for the USG. Remember the healthcare.gov rollout? Although it is certainly a government contract to a private corporation, the cause is warranted since it helps the federal government better allocate resources and make funding decisions. Your comment does raise the question, though, of whether the Digital Analytics Program managing the account requested any safeguards for user privacy or are just using the standard free or premium service. There was an article earlier around the launch of healthcare.gov that raised the same privacy issues you are now relating to Google Analytics, but I can't remember who published it.

Dang. I wistfully hoped this would be drill down real time analytics for the U.S. economy and government operations.

Testers often assumed any three-digit number was actually short for a six-digit number.

This is interesting, I haven't come across this before. What is the best way to display big numbers? 550K or 550,000? 5M or 5,000,000?

Just a technical question, but is '90daysago' and 'yesterday' a common way of representing time in the node world, or are they just trying to show a generic case?

Also I know there's some turmoil in the Node.js/IO.js fork, what kind of practical implications does this have on projects like these?

There's at least one other language (PHP, [1]) that accepts these natural language versions of times. It's not quite the same syntax, but I don't think it's necessarily uncommon for some libraries to support these sorts of queries, for lack of a better word. On the flipside, there's definitely a lot of libraries that support displaying times in this way (like moment.js, [2]).

[1] http://php.net/manual/en/function.strtotime.php [2] http://momentjs.com/

The GNU versions of find and date accept natural language for times. For find in the --newermt and similar predicates and for date in the the --date option.

Whattttttttt? This is amazing.

That's actually the Google Analytics Reporting API's formal syntax:


It's pretty handy, since queries are so commonly exactly that: "the last 30 days" or whatever, and it saves the client from having to perform date calculation code to figure out just what that is.

>> visitor IP addresses are anonymized before they are ever stored in Google Analytics..

That applies to data available to GA user. Google of course has access to more detailed data.

Google claims not to, that they anonymize the IP before it ever hits disk:


You have to trust Google to follow through, but my understanding is this was originally implemented to satisfy German privacy laws, so there's reason to think they take it seriously.

How that impacts logging at their outermost network edge, and whether logs at that level could be feasibly correlated with anything at the application layer, I have no idea. It'd be nice if they clarified that.

Also, they used Neat/Bourbon as their css framework!

Can you really overload S3? Also how long is the expiration time of CloudFront?

I'm guessing they misspoke or misattributed their usage of CloudFront to protect S3. It's almost certainly used for predictable performance, since S3 is measurably worst in that respect.

Cloudfront's default is 24 hours unless you specify otherwise

As of launch day, we started specifying a 5 minute cache time:


We might increase that over time, but having fixes take effect within 5 minutes while the site is under the spotlight is a nice thing.

TL;DR is "we just used AWS in a completely unexciting way".

Speed is sexy...

And simplicity!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact