Hacker News new | comments | ask | show | jobs | submit login
We ditched Google Analytics (spideroak.com)
490 points by felipebueno on Dec 8, 2015 | hide | past | web | favorite | 262 comments

    It took us only a few weeks to write our home-brew 
    analytics package. Nothing super fancy yet now we have 
    an internal dashboard that shows the entire company much
    of what we used analytics for anyway - and with some 
    nice integration with some of our other systems too.
I never quite grasp how the above isn't just a matter of intuition to anyone working in the tech sector. Google Analytics thrives on developers' laziness in my opinion.

And to echo other posters: SpiderOak deserve thanks. If I find myself with any need for a service like theirs, I know I'll be looking at them.

>>> I never quite grasp how the above isn't just a matter of intuition to anyone working in the tech sector. Google Analytics thrives on developers' laziness in my opinion.

Ah, the "not invented here" syndrome!

There are tons of things that you could do "in a couple of weeks" that more or less work. However, it doesn't mean you have to or even that it would be a good idea.

If all developers adopted the attitude that you have expressed, there would be thousands of sad sad developers who need to maintain shitty in-house analytics system because someone once said "I could do it in a week". There are tons of awful CMSes already because someone once said "I could do better than wordpress" / "I could create a better framework" / etc.

In a lot of the cases, GA is just good enough. Sure, you might need to spend some time to explore its features (custom dimensions, etc), there's more to GA than a number of pageviews for a given day. There are cases when GA is not enough. Fair enough. But it's definitely not the majority of the cases.

Sure, it makes sense for SpiderOak given it's target audience. However, there's no need to make such a generic statement about 'anyone working in the tech'.

The answer is open source. Where is the wordpress to google's analytics product?

Piwik, Snowplow, etc. They do exist.

Then the question is do you really want to maintain the infrastructure required to run the analytics smoothly? Especially if your company has dozens of millions of pageviews a month and depends on the real time needs (extra infrastructure to support that).

Are you familiar enough with the stack so you could have a high degree of confidence that you can fix productions issues which are inevitable? Quite often, an honest answer here is 'no'. Then can you afford to lose a few hours/days/weeks (whatever it would take to fix the issue) of data? Again, often the answer here is 'no'.

Of course, you have hosted solutions. But they are no better than GA in terms of privacy.

Paid support exists too but the cost can skyrocket pretty quickly, on top of paying for the infrastructure and maintaining it.

Processing logs is a lot cheaper than the javascript download and other additional http requests needed for google analytics, not to mention the privacy costs. Cheaper for the website, the user, and the web in general.

Not to mention you get perfectly accurate analytics, with no loss due to request blockers or disabled javascript.

The code for this is generic. An open source solution costs nothing beyond some CPU to process the logs and a database to store the analytics.

It's been a while since I've used GA but being able to segment into age, gender, and interests(1) are things that you can't do without paying a marketing aggregator hundreds of thousands of dollars a month or using GA. You can do some geolocation classification and things like campaign effectiveness, bounce rate, etc, but since Google has so much aggregate data off-hand the value of being able to classify user-x as "Male, 40s, Interests-similar-to-demographic-we-sell-to"(2) is invaluable whether you're selling seats of enterprise software, high-fashion luxury items, or cheapo stoner knick-knacks. You can't really market segment with your own software.

(1) https://support.google.com/analytics/answer/3125360?hl=en&re... (2) https://support.google.com/analytics/answer/2819948

Sure. So now the question is, why would Google offer all of this for "free"? Is it really free? Who pays, and in what ways?

Obviously, they're using the same information that's helping you calibrate your campaigns to add to the hive-mind, so they can further data-mine. You're sacrificing the anonymity of your end-users in doing so. Obviously they're offering it so that they can refine their profile of you more accurately to sell ads / direct more relevant traffic to you better. I'm not an industrial engineer but I've been reading about it for the last few weeks. I turned off Adblock for a while and even with my Opt-out plugins(a,b) I started getting ads for $4,500 Fluke multimeters. The combination of one's search history plus a fairly comprehensive history of the sites you visit(b) to a terrifying degree, but at the same time, the average business with only a few million dollars a year going towards both sales and marketing can't really approach Quantcast and ask for access to their API.

a: https://tools.google.com/dlpage/gaoptout b:https://chrome.google.com/webstore/detail/do-not-track/ckdcp... b: I don't have the study off-hand, but IIRC some guy after finishing his masters from Stanford wanted to assess how much information Google had re: an average users browser history. The findings, based off Common Crawl data of the top 100k sites + presence of GA.js yielded something like ~> 75% of the web was tracked (not to be confused with how much of an end-user's traffic is tracked, that number will be far higher) based on sites with a GA.js history factoring in Referer tags. Those were unweighed numbers, i.e., I bet more than one out of two 45 year old woman's traffic can be analyzed to a 95% degree of completely entirely based off of Pinterest, Facebook, search history and the outbound links from her e-mail.

Interesting points. I think there are many ways to use Google Analytics that go beyond what many people want from "visitor data". [Some of t]he kind of questions GA can answer is only possible if one is willing to collude in destroying (meaningful) privacy.

I've had "simple foss analytics" on my todo-list for quite some time. I'm hoping one can build on what piwik have collected wrt bot agent strings, ips etc - and combine with a simpler collector (adding php to the stack just for analytics isn't very appealing, never mind a php codebase of somewhat questionable quality).

Snowplow looks good, but I'm not sure if they have a supported "self-host" stack yet (they started out very awz/s3 centric).

I actually think there's room for a new product, that puts a little bit more thought into what questions it makes sense to ask, and how best to answer them (eg: does collecting metrics on every visitor even make sense if you can answer the same wuestions just as well by doing random sampling? You might want to quantify where your bandwith goes - but simple log analysis might do that easily enough - and it might have very little to do with your human visitors etc).

If you make decisions with money riding on the answers, it costs a lot more than CPU and DB.

Perhaps systems administration is somehow very cheap for you, but I'm willing to bet it is still not "nothing" - even if the cost is you personally not watching a TV show you like because you're patching the web server on your analytics box for your personal vanity domain, that's still a cost.

For most operations, sysadmins are somewhat expensive, and because of that, busy. This is why Urchin was such a good idea, and why Google bought them - the proposition is to trade your users' privacy for the admin time it takes to support another internal app. There's an absolute no-brainer, assuming you don't care about your users' privacy (IIRC, they were going to sell the service before Google ate them, but that's ancient and trivial history).

>because you're patching the web server on your analytics box

If you're business is so small that an additional low-volume web server just to display your analytics (you don't need one for the actual tracking) is a big deal, then the same web server that serves your product can serve your analytics. Not a big deal.

I'm glad we both agree that analytics for a vanity domain is not a big deal. It also was a bounding example for my argument, not my argument.

I don't think it's a matter of laziness. More so where is it best to spend your expensive/valuable developer resources, on the product or some home-baked analytic's framework?

I applaud SpiderOak, but they are much different from most other sites. They have privacy conscious customers to begin with, this is something that is good press for them and probably a net positive on their bottom line for doing it, not the case with most other sites. Also it's something they are doing after having a very mature product for many years, clearly not the first or most important thing they needed to tackle as a company.

Agreed - for some cases just pasting the GA snippet onto a site is sufficient. For others you should add events and such. For others you must roll your own.

If it's worth A/B testing your site, it's worth doing it with a tool that understands your costs and revenue structure.

GA is mostly used by people that don't need it, yet want to pretend they get actionable data out of it.

It's not laziness, it's opportunity cost. For SpiderOak, it makes sense to spend a few weeks of a few developers' time to roll their own analytics. For me, it doesn't. Our customers aren't privacy-focussed. In fact, our app depends on them explicitly sharing [quite a lot of very personal] data with us. I would rather spend that time building something that delivers value to them and us than indulging my personal beliefs about privacy.

Aren't there self-hosted analytics anyway? Piwik[1] comes to mind first, but I'm sure there are many.

1. https://piwik.org/

Piwik is incredible. But it should be noted that it does provide a scaling challenge for high traffic use cases (> hundred million actions per month), and hosting your own analytics is expensive.

I bring this up because people had been slamming moot for using GA on 4chan instead of piwik without understanding why.

We have much lower traffic than that and our Piwik servers, with paid support from the Piwik team, often struggles to generate reports etc. Not convinced Piwik is that easy to scale.

People have scaled it to over a billion actions per month. No clue how much of that includes customizations though... It sounds way past the out of box limit.

Look at the comments from sandfox and afterlastangel in this thread. afterlastangel is pushing a billion, sandfox is around 300 MM per month.


I'm looking into replacing GA Premium ever since Easylist blocked GA tracking for Adblocked users and self-hosted Piwik seems like the best solution. I'd be well into the billions.

With that kind of traffic hopefully you have the resources to pull it off. Good luck!

Do any of the Google Analytics alternatives scale to that size?

Free alternatives? Not really. Paid? Yes, SiteCatalyst and Webtrekk come to mind.

People seem to ignore that the tracking JavaScript is not what you're paying for. It's the backend + servers.

People have taken piwik to 300MM up to over 1 billion actions per month. But it certainly isn't "set it and forget it."


See https://news.ycombinator.com/item?id=10697045

Piwik is still using (unsalted) MD5 for passwords in 2015, and probably will still be using unsalted MD5 in 2016.

This is pretty bad. Piwik could be a high value target depending on the nature of the site it is used to analyze.

I can't believe unsalted MD5 is "by design" (https://github.com/piwik/piwik/issues/8753).


Considering Piwik is used by the GCHQ, I find it hilarious.

They're using an open source analytics software package to analyse the very data it was designed to analyse.

I don't find it using poorly implemented hashing in the administrative interface to be at all relevant to what they're doing, or why they shouldn't be using it.

Information on who visits WikiLeaks - and what they read and upload - is an incredibly high value target. I don't see how you can argue otherwise, when Britain's top intel agency has an expensive line item in their budget just to get at that info.

Given these known security flaws, it's not a stretch to assume anyone who can see the GCHQ's Piwik server can have that data too, regardless of whether they are authorized.

See below for a small preview of what an attacker could exfiltrate (dissident IPs redacted for a reason):


While we're talking about poor security practices: the privileged username in the screenshot is apparently still the default ("admin"), so I hope the password isn't still "changeMe" ... http://piwik.org/faq/how-to/faq_191/

Unsurprisingly, Wikipedia has a list: https://en.wikipedia.org/wiki/List_of_web_analytics_software

Wikipedia's love of lists is absolutely amazing: https://en.wikipedia.org/wiki/List_of_lists_of_lists

Strangely Microsoft's one is missing: Application Insights.

Pretty much works like Google Analytics but utilises both client JavaScript and embedded runtime code to generate a richer picture of what is going on.

Too bad the interface on the Azure Portal is terrible. They spent too much time making it look fancy, and not enough time getting the 101s of usability right (which is a criticism I'd lay at the feed of the new Azure portal in general).

Who makes these lists?!

Well you can see the list of users here:


Good question!

Probably the vendors of the software concerned. Perhaps it started out as a list of three with a major bias towards a particular product. And then the competitors responded, moderators did their things and eventually an accurate list was evolved.

Does the adblock/ublock etc block this as well?

Am looking to use this in lieu of Google Analytics.

Self-hosted means that it will be served from your own servers, and thereby your own domain. So unless your domain is on a block list, it will be loaded.

EDIT: Sorry, I've been dealing with uBlock Matrix for too long, and forgot how advanced the other blockers pattern matching is. See the many responses to this for better information.

(my apologies for the tone - I have edited the post to try to keep it purely fact based)

From EasyPrivacy[1]

This doesn't include any renamed versions, nor does it include the numerous domain-specific variations.

[1] https://easylist-downloads.adblockplus.org/easyprivacy.txt

Slow down there guy, it was a simple mistake. I've been using uBlock Matrix for too long is all.

The EasyPrivacy block list contains an entry that will block the piwik.js file. Of course, when you're self-hosting, it's trivial to serve that file with a non-default name.

That's an interesting choice. I mean, it's not like you can hide from the web server that you are making the request. But then again, I'm assuming -- by the sheer necessity of having a JS file -- that they are collecting some additional metrics not available to the server in the request.

Those filters could be in place to block the Piwik cloud service: https://piwik.pro/cloud/?pk_source=Piwik.org&pk_medium=Cloud...

It will probably take awhile, but trackers will move to aggregating log files, and blockers will move to TOR. And the arms race continues...

Piwik for example already can import log files: http://piwik.org/log-analytics/ as an alternative to JavaScript tracking

No it parses web server logs, however as mentioned above it doesn't work well for very high traffic sites.

Piwik relies on client-side JavaScript for tracking, not log analysis.

They have both, most users use the client side javascript. I'm not familiar with how well the log analysis works.

Sorry, I was not aware of that feature.

And I never quite grasp why many people working in the tech sector are insistent on reinventing things that already exist. Such thinking thrives on developers' personal sense of exceptionalism in my opinion.

Yeah, a nontrivial app is comprised of so many parts, if you tried to reinvent a few of them yourself you'd never get anywhere. Also, try looking at the commit history and issue lists of seemingly trivial libraries. It's incredibly easy to underestimate how complex something that looks simple at first can be.

That starts going down the path of the "not invented here" mindset. You could then attribute not hand-rolling every bit of infrastructure yourself as "laziness". Yes, I am lazy to the point that I don't want to hand-roll an industrial-strength RDBMS myself, or the operating system, or the networking protocol, or the key/value store, etc etc.

If all you want to know is who accessed a site, with which browser, how long for, and which pages they looked at then you could get all that from your webserver's log files without writing any code. On the other hand, to build something that's robust, relatively scalable, works across browsers and devices, and can give you an event watching platform like GAnalytics gives you (eg the useful bit), that is far from trivial.

Most developers don't develop (major) libraries, languages and OSs in house, it doesn't mean they are lazy, it means the company need to focus limited resources on their core business.

>> Google Analytics thrives on developers' laziness in my opinion.

Every service does. Pingdom, GA, Olark, Github...

It took them a few weeks to write their own analytics. What features did they not implement? How many people worked on it?

Does your 1 or 2 person startup have 4 weeks to write their own analytics package or do you have more important stuff to do? (I'm betting you do. Like launching your product instead of re-inventing the wheel with analytics)

> Google Analytics thrives on developers' laziness in my opinion.

It's almost never "developers" who are deciding to use GA; it's middle managers or marketing departments.

Isn't GA's main draw its close integration with adwords and whatnot? The dashboard and UI seem pretty clearly aimed at someone who needs to manage their spending on google marketing services, not on someone who needs to count pageviews.

So it's not hard to imagine marketing wanting it; presumably it provides them a lot of value that wouldn't be easy to recreate in-house.

My experience with in-house recreations of off the shelf solutions is disappointing at best.

Well, it's not a thing to implement in a few days, but a few weeks.

That maybe no option...

If you can reimplement GA in a few weeks, you need to do this over December, then enjoy your FU money.

GA is rather deep, with tons of integration and ways to slice and segment data.

Yeah, maybe in a few weeks you can get _something_ that'll give you something that'll make some manager not too unhappy. Seems like a terrible value prop for almost all companies since, unfortunately, approximately no one cares (or they run adblock anyways).

I mean implementing a analytics tool that does what you need. If you do it just for yourself, you don't need all those fancy things, so it is often doable in a few weeks.

If it takes you more than a few days to put together a basic analytics platform and reporting system, you're a script kiddie.

Not hard to track page hits, time on, time off, and arbitrary events.


Seriously? Folks, it's a table for analytics events, a few SQL queries to do basic reporting (at least in Postgres), a little bit of client-side JS to post the events, and a bit of server-side code to create the routes and maybe display the report page.

I guess if it doesn't include Kafka, Mesos, Kubermetes, Neo4j, and Docker, it isn't delivering business value.


It is quite costly to write to the database for each hit, I guess most downvotes is because of this. If you limit writes by keeping them in some memory cache it's doable for slightly higher loads.

If you're Google, sure. Most of the startups here aren't Google.

People are prematurely optimizing if they fear is "but but but mah datamoose".

Also, it's not prohibitively costly if you do even slight batching of the events, say batch load between every five minutes or an hour.

I'd love to hear somebody with war stories chime in though!

You criticize people for _premature optimization_ while in the same breath advocating rolling your own, shitting implementation for page views? Right...

Incorrect--remember, there is no way of guaranteeing privacy of users *if you outsource analytics for your page.

You're missing the point of why "rolling your own shitting [sic] implementation" is worth it: it's not the speed, it's the privacy.

> thrives on developers' laziness in my opinion.

Frankly most of what i read out of the tech world these days seems to be about pandering to developer laziness.

All manner of APIs and services seem to exist in their current form simply to extract rent from developers that don't want to do back end "dirty work".

Being paid for doing work by those who need the work done is the opposite of "extracting rent".

Being paid for doing work has nothing to do with extracting rent, which is the practice of inserting yourself as a middleman so other people have to pay you "rent"[1] where none should be required.

The entire idea behind writing a Service as a Software Substitute[2] is about extracting rent.

[1] https://en.wikipedia.org/wiki/Rent-seeking

[2] http://www.gnu.org/philosophy/who-does-that-server-really-se...

I understand Stallman's dislike of SaaSS in [2], but I fail to see how it meets any definition of rent-seeking. People who provide SaaSS are using economies of scale to offer services that are desirable to some, because they're offered at a cost that is less than the cost of developing and maintaining their own private solution. There is certainly a loss of freedom in using these services, as Stallman points out. But rent-seeking, not so much. Users of SaaSS need to decide whether the cost savings of using SaaSS is outweighed by the freedom they give up. Nothing more, so far as I can tell.

Perhaps you should have read that wikipedia page before so helpfully linking to it. There's nothing about "middlemen" there. "Rent" is political economy jargon; it's not just a synonym for "distasteful practices". Adam Smith wasn't complaining about shopkeepers or shipping companies, and he certainly wasn't talking about "back end" software services. There is no royal decree enforcing how such services shall be provided. If you don't like AWS then use GCP.

I feel like someone needs to rewrite Stallman's missives to eliminate the term redefinition and the connotation management. His usage of these rhetorical techniques is far too ham-handed to be persuasive to those who aren't already convinced, even when his message is important.

Laziness is a fine quality to have in a developer: http://threevirtues.com/

I would add wisdom to that list. Wisdom to know which modifications will allow you to be the lazy in the future and produce the best results before the user realizes they needed them. I think wisdom is a very important one.

> I never quite grasp how the above isn't just a matter of intuition to anyone working in the tech sector. Google Analytics thrives on developers' laziness in my opinion.

Unless I'm mistaken, one big difference is that not using Google Analytics means you don't know which Google search pages people used to access your website. That can be a really important difference for some websites.

Can't you find out from the Referer header anymore? It's been years since I tried, so it may have changed.

Only if your site is HTTPS

I think you can get that info with webmaster tools without using analytics.

Having implemented two different custom analytics dashboards, it's a lot more complex than you think.

Sure, the basics are easy. But marketers and business people want to drill into a lot of data which is non-trivial to gather.

Unless you have a compelling business case (which SpiderOak does), it's not worth it.

A lot of people are replying to the suggestion of implementing your own analytics by calling out it's NIHness.

I've recently been faced with this problem, and a solution doesn't have to be too complex.

There are roughly two parts to an analytics solution: event logging and, well, the actual analytics.

Writing your own logger in javascript is super simple, you're just sending off json objects to be inserted into a elasticsearch cluster. Since you have to define that logging anyhow, the only extra work you need to do is the layer to do the actual ajaxrequests.

What's left is running and defining your queries in elasticsearch.

BAM! Analytics

I realize it's not fit to be used for every situation, but it can so some pretty complex things this way without the hugest amount of effort ...

I get what you are trying to say and I was one of the NIH-sayers, it totally makes sense in some cases and looks like it made sense in your case. Great! :)

I don't think anyone was saying that GA is always better, it's just more often than not it is. It takes some skill and quite a bit of experience to draw the line at a reasonable place and correctly recognize the trade offs.

I've replaced Google Analytics in all my projects with my CouchDB-only web analytics service, Microanalytics[1], which I could access from a CLI[2] and worked very well.

But then I started to fall short on disk space for storing too many events. This is a problem.

[1]: https://github.com/fiatjaf/microanalytics

[2]: https://github.com/fiatjaf/microanalytics-cli

    much of what we used analytics for anyway
Until your requirements grow and your stuck building something that was in GA 5 years ago.

Don't the ad blockers disable Google Analytics by default? If I am not wrong, I think uBlock Origin does.

So, I think, as more and more people will start using ad blockers, site owners will start getting less and less accurate stats from Google Analytics, forcing them to implement their own solutions. Hopefully, open source solutions will start providing the best features that Google does.

Anything that is widely used (open source or not) will be blocked because of common names or other patterns that can be recognized and blocked. If you need exact statistics you need to roll your own sooner or later. Or at least heavily customize some other product.

And GA is inscrutable. I don't use it very much because it's got way too many layers of abstraction. It was fine before as Urchin. Maybe this is a category like email clients — there should be a sustainable paid product that doesn't suck.

There's also http://get.gaug.es/, which seems great.

I have gone through their trial, but now I think I will register for the Solo account ($6/mo).

Maybe http://haveamint.com/ is what you're looking for? (I'm not affiliated--just a former user.)

Everything developers don't do is a matter of laziness if you ignore the fact that they might have other priorities.

Looking for a good npm / express middleware module that does this. Combines privacy concerns + developer laziness!

Not strictly on topic so I apologise if this is unwanted but I thought I'd share my experience with SpiderOak in case anyone here was thinking of purchasing one of their plans.

In February SpiderOak dropped its pricing to $12/month for 1TB of data. Having several hundred gigabytes of photos to backup I took advantage and bought a year long subscription ($129). I had access to a symmetric gigabit fibre connection so I connected, set up the SpiderOak client and started uploading.

However I noticed something odd. According to my Mac's activity monitor, SpiderOak was only uploading in short bursts [0] of ~2MB/s. I did some test uploads to other services (Google Drive, Amazon) to verify that things were fine with my connection (they were) and then contacted support (Feb 10).

What followed was nearly __6 months__ of "support", first claiming that it might be a server side issue and moving me "to a new host" (Feb 17) then when that didn't resolve my issue, they ignored me for a couple of months then handed me over to an engineer (Apr 28) who told me:

"we may have your uploads running at the maximum speed we can offer you at the moment. Additional changes to storage network configuration will not improve the situation much. There is an overhead limitation when the client encrypts, deduplicates, and compresses the files you are uploading"

At this point I ran a basic test (cat /dev/urandom | gzip -c | openssl enc -aes-256-cbc -pass pass:spideroak | pv | shasum -a 256 > /dev/zero) that showed my laptop was easily capable of hashing and encrypting the data much faster than SpiderOak was handling it (Apr 30) after which I was simply ignored for a full month until I opened another ticket asking for a refund (Jul 9).

I really love the idea of secure, private storage but SpiderOak's client is barely functional and their customer support is rather bad.

[0]: http://i.imgur.com/XEvhIop.png

Many of these types of services seem to intentionally cap upload speeds to reduce their potential storage liability (since they're likely over-selling storage to be able to offer 1 TB for $12 with the level of redundancy, staffing costs, etc, needed).

I wonder if that is happening in this specific case? Although if it were the case the vendor should still be honest about it. Just saying they limit uploads to 2 Mbps is better than giving the run-around.

> reduce their potential storage liability

Its to reduce their maximum bandwidth capacity required. I don't see it as a problem, considering their price points. They're selling you storage, not "slam 1TB of your data into our storage system in a day". If you're looking for that, ship a hard drive to Iron Mountain.

EDIT: Even AWS limits how fast you can upload to S3, and built an appliance for you to rent and ship back and forth if you need to move data faster. That station wagon full of tape is still alive and well.

> Even AWS limits how fast you can upload to S3...

I'm on gigabit fiber and use S3 to backup hundreds of gigs per month to S3. I've never seen them limit upload speeds, it is clearly saturating the connection for the entire duration of my upload. I would expect that because I am paying for the storage, they would be happy to let me write data to their machines as fast as I like. Is there a citation you can provide from their docs that supports your statement? Genuinely curious, because my experience has been different.

To the point that some of these sync or backup providers limit bandwidth, I have definitely experienced that. Tested SpiderOak and Dropbox and upload speed was horrid. Dropbox in particular was disappointing because they can't even claim to have the extra encryption overhead SpiderOak does, it was just shit speed every day. I'm paying a premium for gigabit fiber to the home and you really can tell who over-promises and under-delivers quickly. Fortunately my 'roll your own' backup + sync works well and is price competitive so I'll stick with that.

> I would expect that because I am paying for the storage, they would be happy to let me write data to their machines as fast as I like.

I don't understand why you'd think this. You're paying for storage, not an SLA as to how fast you can fill it.

> I'm paying a premium for gigabit fiber to the home and you really can tell who over-promises and under-delivers quickly. Fortunately my 'roll your own' backup + sync works well and is price competitive so I'll stick with that.

This is the preferred solution if a) commercial services are too slow for you and b) you're willing to spend the time to implement and manage it. It appears, based on commercial services out there, that there is no competition based on upload speeds.

He thinks this because it's in Amazon's interest to let him dump as much data as possible. It's not a matter of an agreement, it's a matter of aligned incentives.

Thanks, I had not thought of that.

> Its to reduce their maximum bandwidth capacity required.

They should be looking to partner with someone who has bandwidth problems in the other direction. By combining a backup service's upload bandwidth and a streaming video service's download bandwidth into one AS, you can get a more balanced stream, and qualify for free peering.

Yeah, agreed. The problem is, you're limited to partners in the same DC as you (unless you're going to bite the bullet and start using fiber loops between datacenters to accomplish this). Backblaze (for example only) is only in one DC in Northern California if I recall, which limits them to whomever is in that datacenter.

A great model would be to parter with CDNs; they pour content out to eyeball networks, but you could run a distributed network of your storage system across all of their POPs.

If I have buy a 1TB plan to hold 999GB of data and it takes months to push that data up...

That ZOMG WHAT A DEAL! of a plan is kinda worthless...

"Slam" is a bit of a loaded word, since... if they are selling 1TB of storage, shouldn't we get 1TB of storage?

That's the same crap that ISP's tried to pull with UNLIMITED INTERNET!!! (as long as you stay under 30gb per month)

> if they are selling 1TB of storage, shouldn't we get 1TB of storage?

You do, they're just not allowing you to store it in 24 hours. Some services (Backblaze, if I recall) allow you to ship a drive to get around this limitation.

Notice that all services do this? If you can do better, build one! Prepare to go broke from the peak bandwidth requirements you'll need to build your networking architecture to support such transfer rates, but I always encourage experimentation and learning lessons over complaints.

Sounds like an upload cap speed should be stated somewhere, at least in a FAQ then yes?

I agree, it should be disclosed upfront.

The appliance is so that you don't need to send terabytes of data over a 10 Gbit/sec connection for example to their datacenter.

The limitation is actually the pipe that connects you to Amazon, not an inherent limitation within S3 or other services within Amazon on connection speed. If you have a good enough connection, or peering with Amazon things go amazingly fast.

When I worked at an ISP, we slammed about 20 Gbit/sec into S3 without issues, but even then data we were backing up -- about 300 TB of data a day -- at that rate took 1.4 days to upload to the cloud, so we ended up backing it up in-house instead. (we needed to store the data for 7 days, after that it went bye bye).

> When I worked at an ISP, we slammed about 20 Gbit/sec into S3 without issues, but even then data we were backing up -- about 300 TB of data a day -- at that rate took 1.4 days to upload to the cloud, so we ended up backing it up in-house instead. (we needed to store the data for 7 days, after that it went bye bye).

Seems like the perfect usecase for S3; inbound transfer is free, and you're only paying for a rolling 7 day window of storage with lifecycle rules :/

Can I ask, why would an ISP upload 300TB data/day? Are you wiretapping all users' packets?

Definitely looks like it to me. Took me a good month to back up my (video) files with CrashPlan, as it was using some 10% of my upload.

I think it would be a good selling point for a service like this to allow higher upload speeds.

A good upsell, yes. But initial seeding to "affordably priced" online services at full data rate can never be economically viable to the provider. Bandwidth is cheap(er) these days, but routers which can handle big bandwidth are still big bucks.

Hold on, this is hacker news. VCs, this is a great idea!

No, no of course it's not. Initial seeding is a competitive moat for the first mover. Moving a few hundred gigs to a new backup company just to save a few bucks? I don't think I could be bothered, because I KNOW how long it will take.

Pricing is falling rapidly for storage. Consider that S3 - IA is $15/mo for a TB, and backblaze B2 can offer 1 TB for $5/mo. I would assume both are making some profit at those price points, so $12/TB/mo should be workable if the service is doing their own hardware.

Backup services especially have low operational requirements for their hardware and network connection, since once the files are uploaded they only need to be verified periodically.

> Many of these types of services seem to intentionally cap upload speeds to reduce their potential storage liability (since they're likely over-selling storage to be able to offer 1 TB for $12 with the level of redundancy, staffing costs, etc, needed).

SpiderOak is definitely overselling the 1TB as well as another one that pops up once in a while called as the "unlimited" plan for $149 a year. This is clear from the disproportional pricing structure - $79 a year for 30GB that jumps to $129 a year for 1TB and then to $279 a year for 5TB - which entices users to go for the higher amounts because they appear to be great deals. What people with residential broadband connections may not realize is that a) uploading even 1TB of data will take a long time and b) SpiderOak cannot, and does not, provide any minimum guarantees on the upload or download speeds (assuming everything else in between SpiderOak and the user looks fine).

The thing that is silly about that is related to cost of acquisition and retention of customers. If a company is able to get more data quicker they are more valuable to the customer and will most like be used and retained by the customer. If organizations are offering storage as a solution while at the same time trying to minimize the costs of that solution, by minimizing the utilization of that storage; they are exchanging fixed costs associated with storage (that should be easily built into pricing) for large variable costs related to customer acquisition, retention, and branding.

Yup, I've noticed the same with Wuala. The uploads were pretty slow. I've heard similar complaints from people using OneDrive. I would be very willing to switch to a smaller competitor even if it meant paying more than I do at Dropbox. But from my experience Dropbox is the only provider capable of synchronizing large amounts of data 24/7.

Backblaze client is uploading at the speed of several mb/s, very close to my connection upload speed limit.

It's definitely possible to offer that on a monthly basis if you model that each customer stays for 36-39 months. Also, I doubt that they are using replicated storage, but are using erasure coding instead. Also, they dedupe before upload, so more cost savings there.

Spideroak doesn't and cannot dedupe, since everything you upload is encrypted by a key held only by you.

Spideroak can and does dedupe client side before uploading. It can't dedupe across multiple clients, but it does dedupe within the client. It also tracks syncs so that data synced between multiple client machines only has to be stored once (with appropriate redundancy).

That explains why the data are visualized the way they are in the view menu :)

That doesn't sound good. On the other hand, I use SpiderOak with not a lot of cloud storage use, with clients on OS X, Linux, and until this morning Windows 10. The only problem I ever had was more or less my fault - trying to register a new laptop with a previously named setup.

BTW, why store photos and videos on encrypted storage? For that I use Office 365's OneDrive: everyone in my family gets a terabyte for $99/year and I really like the web versions of Office 365 because when I am on Linux and someone sends me an EXCEL or WORD file, no problem, and I don't use up local disk space (with SSD drives, something to consider).

I prefer to store photos and videos on encrypted storage because I want to control who sees them. Storing them on unencrypted storage means I don't have that control, the storage provider does and is kind enough to let me make suggestions.

As for OneDrive, I tried it for a while but it didn't work out. Their clients and web interface were terrible and their API was severely lacking. I expect more functionality when I'm sacrificing my privacy.

I ended up going with Google Drive in the end, as you can get 1TB for $9/month with an Apps for Work Unlimited account (I actually seem to have Unlimited under that plan, which isn't supposed to happen until 4 users). That of course means sacrificing encryption but I trust Google enough to make the privacy tradeoff in exchange for extra features (OCR, Google Photos etc.).

I also buy extra storage from Google but I have had some problems downloading large backup files (50 GB, or so) that I have stored on Google Drive, so no system is perfect.

A little off topic, but Google really seems to be upping their consumer game lately with Google Music, Youtube Red, Google Movies + TV, etc. I am now less a user of other services like GMail and Search, but Google gets those monthly consumer app payments from me. I have the same kind of praise for Microsoft with Office 365.

This has been my experience as well, not to mention how much the client slowed down my machine. It's been really slow going but the client is getting better. I never tried doing the encryption on my side, though, they also do diffs on each file you upload so I imagine that has something to do with the lag. I still use spideroak, they're the only company I'm aware of that encrypts locally and also has done a lot to progress personal security for all of us. So I've gotten used to the slow speeds and buggy software, it keeps getting better so that's a big plus :)

There are other backup applications that encrypt locally before sending to server. Two examples are https://www.tarsnap.com/ and https://www.haystacksoftware.com/

I was going to post a comment about how cloud storage is more of a means to move data around rather than back it up, until I dug a little deeper and saw that SpiderOak actually pitches itself primarily as a backup provider. I agree, it needs to be much faster than that.

Is it possible that they are working on batches, and not doing any hashing/compression in parallel with the uploading? It seems feasible from your screenshot that they are getting ~10GB of data at a time, compressing(?) and hashing, and then uploading, and then starting on the next ~10GB.

The only issue I have, which is similar to what I see with some other providers, is that the first non-free plan is a huge jump in storage space and price. If I want a Dropbox replacement, I'd be looking at a 25GB or 50GB plan (just comparing what I have with all kinds of free storage bonuses accumulated over years). Having some more "in-between" plans that are more linear in storage and price would've been an incentive to try this out since I'm not willing to fork $49 a year for 500GB while knowing that my Dropbox usage is less than one-tenth of that.

Love the pricing and features but Win+Mac only and no API largely kills it for me as I need Linux access at the very least.

This comment is ridiculous, and so is the fact that it's at the top. This is supposed to be about Google Analytics, come on.

It is off-topic, yes. For me personally it was very valuable however since I’m in the market for a backup application, and I will definitely take Veratyr’s comment into consideration when choosing between the available offerings.

well, the post is from SpiderOak, so its understandable.

but its an ad hominem argument, thats for sure.

Could the issue be caused by bad peering between your ISPs?

If that was the case I'd expect the upload to be consistent but slow. Since it was intermittent, I believe it's an app issue.

>> my laptop was easily capable of hashing and encrypting the data much faster than the network was capable of handling it

You are assuming that you are the only one using that uplink and that server

Updated my comment:

> easily capable of hashing and encrypting the data much faster than SpiderOak was handling it

I can believe that there was upstream congestion somewhere outside my network (speeds to Google, Amazon indicated that there were no issues inside) or that their server was overloaded but the engineer who investigated seemed to attribute it to the client:

> Additional changes to storage network configuration will not improve the situation much. There is an overhead limitation when the client encrypts, deduplicates, and compresses the files you are uploading"

Why not move to push GA data server-side?

Trivial to set-up, immune to adblockers affecting the completeness of data, prevents the write of tracking cookies, leaves data and utility of the GA dashboard mostly complete (loss of user client capabilities and some session-based metrics).

This is the route I'm preferring to take (being applied this Christmas via https://pypi.python.org/pypi/pyga ).

One may argue that Google will still be aware of page views, but the argument presented in the article is constructed around the use of the tracking cookie and that would no longer apply.

I'm shifting to server-push to restore completeness, I'm presently estimating that client-side GA represents barely 25% of my page views (according to a quick analysis of server logs for a 24hr period). I'm looking to get the insight of how my site is used rather than capabilities of the client, so this works for what I want.

I agree. Server-side analytics were actually fairly mature before Google came alone. It's just more complicated in some cases, but manageable. The biggest downside these days would be SPA apps since they are not necessarily touching the server in any regular way.

People don't care about the cookie or any of the details of the implementation. They care about being tracked across the whole internet. If you are still contributing to that then you are disrespecting your customers. I hope that I am not one of them.

Except that basically nobody cares about "being tracked across the whole internet", as shown by GA, Facebook, etc being on virtually every popular website and nobody noticing or caring at all. If you care enough to make even the most trivial change in behavior, then you're optimistically 1 in 1000.

I said that I care and that I hope that I am not a customer of businesses that track me and contribute to google's tracking. In that case I am 1 in 1000 and if a site doesn't work without GA and I don't have to use it (as in to file my taxes have to) then I won't I will purchase from a competitor.

EDIT Most people do notice and do care this has come up in countless conversations. They just accept it as a necessary evil that they can't do anything about and accept (wrongly) that they as a individual can't change the world.

Did you read my comment?

You will have no GA cookie from any of my sites, I am not recording client identifying things or capabilities. It is a server-side push of GA and avoids all client-side interactions.

It is merely, "A page has been viewed, this one: /foo/bar?bash".

There's nothing in there that is tracking you. I'm not even embracing the session management aspect.

I get to use the tool that is best-in-class, in a way that lacks capability to track you.

Without any "client identifying things" how would GA be able to chain several page hits into a session then? That is, do basic visits vs. hits split.

If you are in fact anonymizing everything about a client as you claim you do, then it won't be able to. Unless, of course, you are feeding GA some opaque client ID that you then internally map to and from actual clients that hit your server. However something tells me that you aren't doing that, or you would've mentioned it already.

(edit) I re-read your comment. You aren't apparently interested in session counts. But what's good the GA summary then if you can't tell 10 bounced visitors from one visitor with 10 hits? This makes no sense. If you want to look at just page hit numbers, there are dramatically simpler ways to do that.

In the test I've done, sending no session/user data over, I lose all sense of a "session".

But I do retain insight into what content has been viewed, how much, what is rising and falling, etc.

The question really is what info are you really reporting on? AdBlockers make us blind and tracking is horrible, but I get to have a far more complete view over the simple stuff Urchin used to be great at.

Why not use your server logs for this information?

Ah, so you are passing some client IDs over the GA after all. An IP address perhaps? You know that's a leading question, right?

Incidentally, I ran similar experiment with gaug.es few years ago - pulled on their tracking API from our server side. While it worked as expected, these sort of shenanigans are good for only one thing - hiding the fact that you are using 3rd party analytics from your visitors.

On a more general note - the thing is that you either care about other people's privacy or you don't. It's not a grayscale, it's binary. And if you do, there's no place for GA in the picture.


I am not passing IP. I am not passing a client-id. I am not passing any kind of correlation identifier from which a session can be inferred or created. I am not passing user-agent information. I am not passing a cookie ID.

I am only passing a page view event. "Page /foo/bar?bash has been viewed".

Take a look here: https://code.google.com/p/serversidegoogleanalytics/

Tell me where in that example (mine is similar) you see any client identifying information.

There is none. If GA deduces anything, it will be a property of my origin server and not a client.

I do not agree that using GA in the way I have described allows Google to invade privacy at all. Please explain clearly how it does in your opinion.

But isn't the same kind of data you could extract from Apache logs? Since from what you describe is basically a log of all your requests.

GA has many utilities, mainly is to follow the user and see the funnel they go and second to monitor the marketing campaigns. If you don't need this, then Apache log + webalyzer is perfect for everyone.

I persist with GA, because every now and then I work with partners who would like to verify the activity on my websites (and yes my user agreements and privacy policy allow this) and have a means to compare this with historical data or data from other sites.

Those partners frustrate me, in that they won't trust me to provide stats generated from server logs, but they all default trust GA.

This technique allows me to use GA, produce the view of the content they need, export the PDF, and share that... and they trust it.

GA is the de facto store of trusted data when it comes to web site activity. For my sites that is tracking content page views.

I don't understand why you bother with GA then.

That's OK. It wasn't a requirement of my system.

Spectacular joke.

This whole conversation started with you saying why abandon GA when you can use it without compromising clients' privacy. An exchange that followed shows that one can't actually derive not just the same function from GA that way, but virtually any function at all. Yes, you can feed data in, but the usefulness of what you can get back out is next to zero. What am I missing?

From your opening comment:

> Why not move to push GA data server-side?

Because it renders GA largely useless if clients' privacy is actually observed.

> I am only passing a page view event. "Page /foo/bar?bash has been viewed".

I would like to say, as someone extremely hostile to tracking of any kind, that if this is all you're sending to google, that sound perfectly fine from a privacy perspective. (Google gets your information, but that's between you and Google)

Thank you for choosing a method that respects the privacy of your readers.

> (edit) I re-read your comment. You aren't apparently interested in session counts. But what's good the GA summary then if you can't tell 10 bounced visitors from one visitor with 10 hits? This makes no sense. If you want to look at just page hit numbers, there are dramatically simpler ways to do that.

I do not care to track users/sessions, page views are enough for me. I am tracking content and content views... and I get this big tool that is awesome at slicing data and presenting trend information... for free.

The only issue I can see with this is a lot of HTTPS connections with your analytics platform from your web service. If you choose to use a work queue/proxy to do it, it's additional work/point of failure, etc.. It's not as 'simple' as adding a JS at the bottom of your page.

What info does Google get on your customers in exchange for your free use of their service?

You've emphasised that word as if it changes the question somehow, but I don't see how it changes the answer.

Because by using your website I become your customer I am doing business with you. I don't always want a third party involved.

You never answered what info do you send to the owner of the tracking library that you licence? Or if you send them no info how do they get paid?

How about open-sourcing your product before worrying about improving other products? SpiderOak has been "investigating a number of licensing options, and do expect to make the SpiderOak client code open source in the not-distant future" for a very, very long time now. It's no trivial thing to have a closed source client for a "zero knowledge" service.


EDIT: I'd welcome discussion, in addition to your up/down votes

I came here for this exact thing. They said they were going to go open source in 2014 IIRC, and failed to deliver. I have stopped using SpiderOak - how am I supposed to trust them with my most private files when I can't verify that they're not doing anything shady on my machine?

The opening line of this post is amusing. They ought to give thought to fixing their core product first.

I am also concerned with that. That message has been there unchanged for some time now. To be fair, there's a lot of stuff on the Github page, including the Android client under Apache license. Although as far as I can tell, desktop client is not there yet.

The other thing is that google analytics is on many adblockers lists, precisely for that reason. As adblockers are getting widespread, the analytics is going blind.

I've been running a blocker to block GA and other junk on my PC, but I imagine I'm in a statistically insignificant minority. And I still can't block them on my iPhone unless I disable JavaScript entirely (though I'm running iOS 9, I'm not able to install a blocker for some reason; I guess Apple arbitrarily doesn't support them on my older iPhone model).

>I guess Apple arbitrarily doesn't support them on my older iPhone

It's not arbitrary - it requires a 64 bit CPU (of which Apple has now shipped 3 generations of).

Ah, is that the differentiator? I see. Still strikes me as somewhat arbitrary, though - is content blocking such a strenuous task that it requires a 64-bit CPU? Wouldn't using a blocker cause the CPU to do less work in most cases since it doesn't have to download so many ad media files or execute as much JavaScript?

Yeah, I guess it's just time to get a friggin' new phone already, but this one ain't broke yet, ya know?

If anyone is looking for a good blocker for stuff like this, I recommend ghostery. I set it to block everything by default, and whitelist the few things I want. It doesn't block scripts served by the site you are on, so it doesn't totally break your browsing experience, like others do.

I could install an adblocker on iOS9, and you can customize the block list.

I don't think you are a minority. I understand adblocking usage is around 20%-ish now.

What I don't understand is people who use adblockers but still login to their google account on chrome. It sorts of defeat the purpose...

If your device is jailbroken (not sure if there's a jailbreak for iOS 9), you could add entries for GA to its hosts file. I use these on my desktop PC: www.google-analytics.com google-analytics.com ssl.google-analytics.com
AFAIK ad blockers are only supported on iOS devices with 64 bit CPUs.

An open-source, self-hostable solution providing 80% of common Google Analytics functionality seems doable to me.

Is there anything out there in this realm? If not, why not?

Have a look at piwik: http://piwik.org/

Unbelievable. Unsalted MD5, no less. There's an issue to fix this that's been open for seven years! https://github.com/piwik/piwik/issues/5728

Eh. The analytics data is pretty low value as far as hacker targets, and this can be mostly mitigated anyways by sane segregation of the admin backend from the publicly accessible site.

There's an open ticket for it, but it looks like it hasn't been addressed in a while since they don't want to break all existing passwords.


A low value target maybe, but having a critical security ticket open for seven years is unforgivable. If they don't want to break compatibility it's pretty simple: use something like PHPass and upgrade the hash when the user next logs in. i.e. what every halfway sensible web app did at least five years ago.

It does not have to break all existing passwords. Just add an envelope for the old passwords.

There's a $555 bounty if you can demonstrate a security vulnerability in Piwik because of that.

I'm not interesting in further dehumanizing myself with participation in a bug bounty program.

I'll write an exploit for it (the general case, not just Piwik in particular) and drop it on OSS Sec some day, but here's a theoretical attack:

1. Guess a username somehow. Maybe "admin"? Whatever, we're interested in the security of the hash function. Let's assume we have the username for our target.

2. Calculate a bunch of guess passwords, such that we have one hash output for each possible value for the first N hexits.


    substr(md5($string), 0, 2) === "00"
    substr(md5($string), 0, 2) === "01"
    substr(md5($string), 0, 2) === "02"
    // ...
    substr(md5($string), 0, 2) === "ff"
3. Send these guess passwords repeatedly and use timing information to get an educated guess on the first valid MD5 hash.

4. Iterate steps 2 and 3 until you have the first N bytes of the MD5 hash for the password.

5. Use offline methods to generate password guesses against a partial hash.

The end result: A timing attack that consequently allows an optimized offline guess. So even if their entire codebase is immune to SQL injection, you can still launch a semi-blind cracking attempt against them.

By the way, if anyone else wants to try to claim the $555 from Piwik based on the above theoretical attack, feel free.

How to protect from timing attacks - It's All About Time: http://blog.ircmaxell.com/2014/11/its-all-about-time.html

password_verify() compares hashes in constant-time, so, yeah...

http://snowplowanalytics.com/ is worth considering if you have larger volumes of traffic

Snowplow is great. Super scalable and they have instructions on how to host everything on your own AWS infrastructure.

+ 1 for snowplow. We been using it for more than a year now with high traffic sites and it's working great.

What do you use as a front-end?

Why not for low volume sites?

Well it takes lots of time to setup + needs a few extra server like event collector, log cleanup and enrich, data loading and database server.

Yes, there is: www.piwik.org

Self hosted analytics is still centralizing behavioral data on the users. It's not really any better from a privacy standpoint than GA.

Its nowhere near as centralized as Google Analytics though - at least if you're self hosting that data is confined to the silo which is your own analytics, rather than Google being able to aggregate that with their behaviour on every other site they visit as well.

That silo is still aggregating data. Trying to argue its "less" centralized by using quantification of the amount of centralization is still akin to dissonance. Clearly people here don't agree with this, but that's to be expected when the topic is so polarizing. Traffic analytics must be important, so we rationalize our actions, or inactions around how we collect them.

Any centralized solution, at any scale, can possibly violate someone's privacy. Period. If we want to really fix things, we should stop circle jerking ourselves and do something about it.

Not at all. The entire point is that Google is able to track one person across many, many sites. That is simply not possible if each site had its own self-hosted analytics.

To any of the SpiderOak team: thank you.

It's more than just the tracking cookie, though. It's also about Google aggregating all its website data into a unified profile. The data they have on everyone is frightening—all because of free services like GA.

Yes, thank you SpiderOak, even though I don't use you: High profile companies quitting GA means we get aware of alternative solution. Today, I've learnt about http://piwik.org .

Spideroak user here. I stopped using Dropbox and started using Spideroak about a 18 months ago. I really like the product. It's not as good as Dropbox in some ways (like automatically syncing photos from my phone) but it really is easy to use. I still have a mobile client on Android and I can keep my files in sync across multiple computers. I pay for the larger storage size and I'm not even close to using it all.

It syncs fast too. Just thought I'd share my experience with people.

Is it just me or is this a click-bait title with hollow content?

It is. It's no big deal to stop using Google Analytics. It is, however, a big deal not to use Google Search, something I am considering for my company.

Well, the title says they stopped using Google analytics, and the article explained that they stopped using Google analytics, why they did it, and what they're doing instead. You may not find it interesting, but the title clearly reflects the content, so I'm not sure how it's click bait.

> Like lots of other companies with high traffic websites, we are a technology company; one with a deep team of software developer expertise. It took us only a few weeks to write our home-brew analytics package.

I'm a little curious why they decided to go this route instead of using one of the open-source solutions. Aren't there good solutions to this problem already?

I was curious as well and just assumed the usual NIH (not invented here) syndrome. Web analytics was so mature before Google bought Urchin and turned it into Google analytics. Since that time countless open source projects have sprung up (pwik was the first that came to mind). Google for open source alternatives brings up thousands of pages of projects.

Writing your own is easy for the basic stuff. When you want to move beyond the basics, as Spider Oak will find, it becomes much more difficult.

I'm doing my part. I'm moving to DuckDuckGo for searching more and more. It's a process. Google does have better results. For work I still rely on Google, for private stuff I use https://duckduckgo.com/

And for the sake of ducks, I'm eating less meat as well. No more chicken - too much antibiotics, and as little meat as possible, only when it's worth it, so great taste and good quality.

I'm a big DDG fan too. I don't really notice their results being "worse" than Google's (but maybe that's just because I haven't used Google for so long). The Bang feature is also very handy once you get in the habit of remembering to use it. https://duckduckgo.com/bang

More and more, if I'm not satisfied with DDG, I will Google that term, only to get the same stuff.

Do you also experience slower response times at DDG?

Here in Europe 'ping -c 5' gives an average of about 10ms for google.com and 30ms for duckduckgo.com. Since search is such a fundamental part of browsing, this is very noticeable.

You can also try https://search.disconnect.me.

> Sadly, we didn’t like the answer to that question. “Yes, by using Google Analytics, we are furthering the erosion of privacy on the web.”

The only thing "wrong" with using an analytics service to better understand your customers is that it places all knowledge of visits, including ones that wished to be private, in a centralized location. This can be useful in providing correlation data across all visitors in aggregate, such as which browser you should make sure your site supports most of the time.

In other words, there exists some data in aggregate that is valuable to all of us, but the cost is a loss of privacy for smaller sets of personal data.

If individuals don't want certain behaviors analyzed by others, then they shouldn't use centralized services which exist outside their realm of control. These individuals would be better off using a "website" that is hosted by themselves, inside their own four walls, running on their own equipment. A simple way for SpiderOak to address this is to put their website on IPFS or something similar.

I appreciate the fact that SpiderOak is thinking about these things. It's important!

>why does Google and their advertisers need to know about it I would ask

Google is pretty clear about this. The only reason they track you is for advertising, and there isn't any evidence of them using the info for anything else. In fact there is a lot of evidence pointing the other way, such as their insistence on encryption data flowing between their datacenters.

This is Google we are talking about, not Kazakhstan, China or Russia.

Google could eventually use this information to determine your eligibility for a home loan. They have already dipped their toes in this area [1]. With all this data, we have to ensure that it is used fairly (or not at all). There is enough concern about digital redlining that a 2014 report to the white house reports on this [2]. As we know machine learning is quite capable inferring sensitive attributes [3].

This inference doesn't even need to be intentional, machine learning is capable of accidentally picking up on latent variables. Even if your neighborhood (the original redlining) isn't a feature in the original variable, it could be inferred from the other variables.

TL;DR: Your surfing behavior could be used to deny you a home loan one day.

[1] http://techcrunch.com/2015/11/23/google-launches-mortgage-sh...

[2] https://www.whitehouse.gov/sites/default/files/docs/big_data...

[3] http://www.pnas.org/content/110/15/5802.abstract

Lots of speculation about what they might do. You could also say that the US government could use all of your data to spy on people who criticise the government, so they shouldn't have any of that data either.

Still, Google is a corporation whos only porpose is proffit, trust them blindly is foolish.

Their other purpose is "dont be evil". There may be some debate about that at times, but they certainly aren't going to screw their customers. They know that their customer base would evaporate pretty quickly if they tried to screw them.

> Their other purpose is "dont be evil".

I feel so tempted to laught at your face right now.

>screw their customers

Google's customers are advertizers, not common people.

>I feel so tempted to laught at your face right now.

I wouldn't recommend it.

>Google's customers are advertizers, not common people.

Their users are also their customers. Without users there are no advertisers.

>Their users are also their customers

We dissagree.

Kudos for this!!

it's interesting that still there's meta, probably leftover

    <meta name="google-site-verification" content="pPH9-SNGQ9Ne6q-h4StA3twBSknzvtP9kfEB88Qwl0w">
EDIT: wow, thanks for your answers guys!! so nice to see Cunningham's law in action ;)

That's for Google's webmaster tools, which is another service Google offers (although in this case, you get to peer into their data)

That is for webmaster tools, not analytics.

Surprised that they don't use the DNS verification option. Bloating your HTML for every request is unnecessary.

This could just be for Google Webmasters which gives them information on their search presence.

IIRC that meta tag is also used for verification for Google Webmaster Tools

> It took us only a few weeks to write our home-brew analytics package.

Unfortunately, there's no way to replicate what Google Analytics currently offers (for free!) within a couple of weeks (or even months). Not with big data sets. Yes, GA does enforce sampling if you don't pay for GA Premium, but the free edition is still one of hell of a deal (if you don't care about privacy).

If you only use Google Analytics as a hit counter, sure, you can do that yourself within a couple of minutes. The advanced features are way more complicated, though (think segmentation and custom reports).

This also begs the question: why not use Piwik?

I suspect most of the people saying "you don't need Google Analytics! Do it yourself!" have never used GA for anything that meaningful. As you begin to really familiarize yourself with your website traffic and understand how to look at your clickstream data in a more investigative an analytical way, you'll start to see how nice GA is and how easy it is to answer your questions.

You also underestimate how ubiquitous GA is because it's free and extremely popular. I'd consider myself an intermediate to advanced user of GA, but for people less experienced, I can easily share stuff with them for complicated tasks or they know how to do a lot of the basics themselves.

In hiring digital marketing people, GA is pretty much on par with Word in terms of familiarity. It's something a lot of people have a basic competence with.

I agree completely.

GA has become very, very capable in the last five years or so. Combined with their current APIs, you can do pretty much anything you want.

To me, it is the cost that matters. Most other Analytics cost $30 - $50 / 1 Million Pageview / Datapoint. To me this expensive. Even when you scale to 100M it will still cost ~$20/Million.

Piwik doesn't scale. At least it doesn't scale unless you spend lots of resources to tinker with it. Its Cloud Edition is even more expensive then GoSquared which i consider to be a much better product.

What we basically need is a simple, effective, and cheap enough alternative to GA. And so far there are simply none.

Instead of rolling your own look at Piwik. It works very well and is basically a GA clone. I actually like it better than GA in some ways. It's easy to set up and you can run it on your own site so you're not contributing to a global tracking fabric.

I don't get it. SpiderOak states that they dropped GA because it furthers "the erosion of privacy on the web.”, but then they just started tracking in house.

How is tracking in house more private than GA? The user is still being tracked.

I believe their point was that they want to track their traffic, but when they use a third party like Google, Google provides tracking services for SpiderOak, Google also tracks you as well, which SpiderOak has no control over.

With it in house it is under their control: they can anonymize it, not collect certain information, they can't cross index it with your traffic from other sites, etc.

Google likely (or at least theoretically) has the capability to track the user across all the sites that use GA. In-house tracking doesn't allow this.

I haven't checked my GA in months since it became clear that Google won't bother doing anything to fix the referer spam problem that makes the stats useless if you don't have a high-volume site. It's not like these abusers are hard to track down but I'll be damned if I'm going to manually add filters to get rid of them every time they come in from a new domain.

Admin -> select account -> select property -> View Settings (on the view you want) there's a checkbox:

Exclude all hits from known bots and spiders

For anyone here looking for a really good, free,self-hosted, hackable, open source alternative to Google Analytics that's been around for a long time, please consider Piwik.org.

I've been using it for prob 8-10 years and it has never missed a beat. I use it on all my personal / business sites as well as some client websites that are super high traffic.

Analytics, fonts, css. We include it everywhere by default. Then I realized hey we are all giving away too much. My sites now happily run self-hosted piwik, for the last six or so months.

I won't be surprised if in the coming years we hear much more about using google fonts being base to count site access, if there is no analytics in place.

It should also be noted that SpiderOak has opensourced many components of their product stack, including Crypton, which is the encryption framework underpinning many of their clients.

The source is at https://github.com/SpiderOak

Make your custom analytics library into a separate product in its own right and sell it or open-source it!

Usually I start with Google Analytics but continue to add to our own in-house analytics solution targeting the specific metrics we're interested in tracking. GA often doesn't provide us with the real insights we're looking for, but it's good for the vanity stats.

What about awstats or goaccess? Both are great log analyzers, although I like goacess better.



The look and feel of awstats hasn't changed since I last used it back in 2004...

Piwik is the new AWStats, check out http://piwik.org/log-analytics/

I'd hate to be the intern or the guy in charge of keeping that thing running.

Random fact: GA cookie is distinct from adwords(google.com) cookie, and it is illegal for Google to join those (not sure if it is even technically feasible).

I've had good experience with Statcounter (http://www.statcounter.com)

At Cloudron, our vision is to allow companies to host their own apps easily. We dogfood and don't use Google. We don't use analytics on our website (a conscious decision). Our emails are based on IMAP servers and we use thunderbird. We selfhost everything other than email (which is on gandi).

We just entered private beta yesterday - https://cloudron.io/blog/2015-12-07-private-beta.html

How's this compare to Sandstorm[0]? It seems like a Closed/SaaS equivalent.

[0] https://sandstorm.io/

Cloudron and sandstorm are similar projects. I think the main difference is the user experience (also how we handle domains, how apps are packaged etc). You can see a demo of the cloudron here - https://my-demo.cloudron.me/ (username: cloudron password: cloudron). All apps use same credentials (because of single sign on)

Is there any analytics solution that allows for on-prem data storage? Only one I know of is Adobe Analytics (formerly Omniture).

Yes, check out Piwik.org / Piwik.pro

Now if they could also just ditch Facebook, Microsoft, and Apple, they'd be getting somewhere.

Why not just use Piwik?


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact