Hacker News new | comments | show | ask | jobs | submit login

We dual write event data to BigQuery and Redshift. BigQuery is incredibly cheap at the storage level, but can be pricey to run queries against large datasets. Personally I am not a fan of paying per-query.

Historically BigQuery data was immutable. You could delete a table, but you couldn't modify or delete a single row. That support was added recently, but there are daily limits imposed on those actions.

I'm a big fan of immutable data, up until you find that someone made a mistake somewhere. Naturally you can make modifications in ETL, but if you do that long enough your ETL job is a mess of time-based if statements.

We originally moved to also support Redshift because of BigQuery not supporting row level updates at that time. Someone on the content site would forget to put an & between URL params and our source data would be busted. Like I said, immutable data is great as long as you never make mistakes...

I'm also not a huge fan of nested records. If you use BigQuery, do yourself a favor and make sure you have a uuid field (it's a good practice any way). When you get data back from the client it can be a pain to piece records back together (nested records come back as single rows for the most part). This makes certain queries really fast, but it can be a pain to work with in aggregate.

My preferred setup these days is to dual write to Redshift and BigQuery. We run almost all analysis off of Redshift, but have a natural backup of data stored elsewhere. And like I said, BigQuery storage rates are dirt cheap. And if our Redshift cluster goes down for any reason, we can hop over to BigQuery to find what we need (and backfill any missing data points).




Wait, what's the advantage of dual-writing? My understanding is that there's a lot of cases where BigQuery is significantly cheaper, but if you're paying for Redshift, there's really nothing Redshift can't do that BigQuery can.


We dual-write mostly to be able to offload certain query patterns, like joining large tables on non distkeys. Redshift does much better than BQ joins as long as they're on the same DISTKEY or if you can afford to make the table DISTSTYLE ALL, but if you can't, having to shuffle really impacts not only the runtime of that specific query, but also any other running queries. We have a job that took 1 hour on a 28 node dc2.8xlarge cluster, but only 4 minutes on BigQuery with basically no rewriting of the query, and saw a reduction in runtimes of concurrent jobs as well.


Redundancy across providers. BigQuery storage costs very little. It was implemented first and we introduced Redshift later to deal with data issues we didn't want to handle long-term in ETL. It didn't make sense to kill off the bigquery integration and I've decided for $20/mo or so that the backup storage on a separate provider is worth it.

Redshift and BigQuery are both solid services. And BigQuery has gotten better with time. But I still have some reservations with it as a pure Redshift replacement.


Thanks for the details! Could I ask you about your query patterns and the kind of data loads you're pushing into those?


Most of our use case is storing raw event data: user clicked on X; interacted with feature Y; was bucketed into test group Z; audit logs for subscription data points (joined, upgraded, downgraded, paid, payment denied, etc). Basically everything we care about, but that doesn't matter as a core concern of providing the service.

The main app needs to know if a user has access to entitlement A, but the history of that process is outside the concern of the app itself.

We throw everything into BQ/Redshift. Then when we find a need for it, we bring it into a separate reporting dataset in Redshift via ETL.

The nice part about this is you can join your sanitized ETL data against raw event data if you need to run a once off query like "for everyone put into variant A of experiment B, how many converted from a free trial to a paid subscription?" I haven't found a great way to programmatically recreate ETL data into BQ without recreating a new dataset every day (because of the history of immutable data). It's not that it's impossible, it's just not as great (imho) as an upsert type workflow.

Occasionally we'll have a once off query we access by hand, but that's relatively rare at this point. Jobs run around 1am and everything is ready to go when the team gets in in the morning.

You have the odd job where you have to manage distkeys, and that's rarely fun, but it's a trade off we're comfortable making.

And, while I'm pimping other people's services, Redshift + Looker is incredible.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: