Hacker News new | past | comments | ask | show | jobs | submit login

That’s basically how GA export worked from my previous work - everything in a session is nested. Upshot is basically what’s above - easy to filter and you don’t get partial data.

The catch is if you need to filter by a property of the session, you are opening every session in range to check if it’s the one you want. That gets expensive quickly and is a bit slow.

For data lakes, parquet and Spark support fairly sane date partitioning. Partitioning by anything else is a question of whether you need it, such as a customer ID, etc. but remember this is a data lake, not a source table for your CEOs daily report. The purpose of the lake is to capture everything that you sanely can.

When you can’t store everything, usually due to cost, you then have to aggregate and only keep the most valuable data. For example in AdTech, real-time bidding usually involves a single ad request, hundreds of bid requests, a few bid responses and the winning bid. Value here is inversely related to size - bid requests without responses are useful for predicting whether you should even ask next time, but the winning bid + the runner up tell you a lot about the value of the ad request.

For structuring warehousing for reporting/ad hoc querying, to me the flatter the better - this uses the native capabilities of columnar stores and makes analysis a lot faster. Downside, good luck keeping everything consistent and up to date. Usually you end up just reprocessing everything each day/hour/whatever the need is, and at a certain point say no new updates to rows older than X.

The cool thing about modern data warehouses, is that they include interfaces to talk to the data lakes, so your analysts don’t have to jump to different tool chains, such as Redshift Spectrum (which is basically Athena) and the aforementioned BigQuery ability to use tables, streams and files from GCP.

It’s an incredibly productive time to be working with all this! Even 10 years ago, you’d need a lot of budget and a team to just keep the lights on, today it’s all compressed into these services and software.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: