Hacker News new | past | comments | ask | show | jobs | submit login

From the Google side: In traditional BigQuery, the answer to all three questions are related. You shard the files by partition key and put the key into the file name. You can filter the file name in the WHERE clause, and the query will skip filtered objects, but otherwise fully scan every object it touches.

There is apparently now experimental support for using Hive partitions natively. Never used it, literally found out two minutes ago.

The number of records per object is usually "all of them" (restricted by partition keys). The main exception is live queries of compressed JSON or CSV data, because BigQuery can't parallelize them. But generally you trust the tool to handle workload distribution for you.

This works a little differently if you load the data into BigQuery instead of doing queries against data that lives in Cloud Storage. You can use partitioning and clustering columns to cut down on full-table scans.




That’s basically how GA export worked from my previous work - everything in a session is nested. Upshot is basically what’s above - easy to filter and you don’t get partial data.

The catch is if you need to filter by a property of the session, you are opening every session in range to check if it’s the one you want. That gets expensive quickly and is a bit slow.

For data lakes, parquet and Spark support fairly sane date partitioning. Partitioning by anything else is a question of whether you need it, such as a customer ID, etc. but remember this is a data lake, not a source table for your CEOs daily report. The purpose of the lake is to capture everything that you sanely can.

When you can’t store everything, usually due to cost, you then have to aggregate and only keep the most valuable data. For example in AdTech, real-time bidding usually involves a single ad request, hundreds of bid requests, a few bid responses and the winning bid. Value here is inversely related to size - bid requests without responses are useful for predicting whether you should even ask next time, but the winning bid + the runner up tell you a lot about the value of the ad request.

For structuring warehousing for reporting/ad hoc querying, to me the flatter the better - this uses the native capabilities of columnar stores and makes analysis a lot faster. Downside, good luck keeping everything consistent and up to date. Usually you end up just reprocessing everything each day/hour/whatever the need is, and at a certain point say no new updates to rows older than X.

The cool thing about modern data warehouses, is that they include interfaces to talk to the data lakes, so your analysts don’t have to jump to different tool chains, such as Redshift Spectrum (which is basically Athena) and the aforementioned BigQuery ability to use tables, streams and files from GCP.

It’s an incredibly productive time to be working with all this! Even 10 years ago, you’d need a lot of budget and a team to just keep the lights on, today it’s all compressed into these services and software.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: