Fairly new to this topic and coming from a traditional RDBMS background. How do ...

lmkg · on Feb 23, 2020

From the Google side: In traditional BigQuery, the answer to all three questions are related. You shard the files by partition key and put the key into the file name. You can filter the file name in the WHERE clause, and the query will skip filtered objects, but otherwise fully scan every object it touches.

There is apparently now experimental support for using Hive partitions natively. Never used it, literally found out two minutes ago.

The number of records per object is usually "all of them" (restricted by partition keys). The main exception is live queries of compressed JSON or CSV data, because BigQuery can't parallelize them. But generally you trust the tool to handle workload distribution for you.

This works a little differently if you load the data into BigQuery instead of doing queries against data that lives in Cloud Storage. You can use partitioning and clustering columns to cut down on full-table scans.

sologoub · on Feb 24, 2020

That’s basically how GA export worked from my previous work - everything in a session is nested. Upshot is basically what’s above - easy to filter and you don’t get partial data.

The catch is if you need to filter by a property of the session, you are opening every session in range to check if it’s the one you want. That gets expensive quickly and is a bit slow.

For data lakes, parquet and Spark support fairly sane date partitioning. Partitioning by anything else is a question of whether you need it, such as a customer ID, etc. but remember this is a data lake, not a source table for your CEOs daily report. The purpose of the lake is to capture everything that you sanely can.

When you can’t store everything, usually due to cost, you then have to aggregate and only keep the most valuable data. For example in AdTech, real-time bidding usually involves a single ad request, hundreds of bid requests, a few bid responses and the winning bid. Value here is inversely related to size - bid requests without responses are useful for predicting whether you should even ask next time, but the winning bid + the runner up tell you a lot about the value of the ad request.

For structuring warehousing for reporting/ad hoc querying, to me the flatter the better - this uses the native capabilities of columnar stores and makes analysis a lot faster. Downside, good luck keeping everything consistent and up to date. Usually you end up just reprocessing everything each day/hour/whatever the need is, and at a certain point say no new updates to rows older than X.

The cool thing about modern data warehouses, is that they include interfaces to talk to the data lakes, so your analysts don’t have to jump to different tool chains, such as Redshift Spectrum (which is basically Athena) and the aforementioned BigQuery ability to use tables, streams and files from GCP.

It’s an incredibly productive time to be working with all this! Even 10 years ago, you’d need a lot of budget and a team to just keep the lights on, today it’s all compressed into these services and software.

sologoub · on Feb 24, 2020

To summarize the answers below - it all depends on what you are trying to do. Data lakes are generally less structured than other things. They can also contain non-text things, like images and videos that can also be mined.

Sounds like you are thinking more of a data warehouse, which is structured data on an engine that’s designed for querying large volumes of data. I’d recommend first starting with your objectives and then going for what solves with least amount of “stuff”.

I don’t work on data warehousing or pipelines now, but when I did a year ago, AWS and GCP both offered great tools with slight differences, where AWS was a bit pricier to start, but focused on more predictable pricing and GCP was much cheaper with pay as you go, but you could get yourself in trouble with cost by not following their best practices.

dswalter · on Feb 23, 2020

If you're using AWS athena for querying, you're also using the aws glue catalog (managed hive metastore-ish service) to know where partitions are, but yeah, you'll need to partition and sort your data to make sure you're not doing full table scans.

sologoub · on Feb 24, 2020

Glue worked well for my previous gig, but honestly it felt like a bit of an overkill. If you have a large org and a lot of tribal knowledge + new fields showing up out of the blue, yes you need to organize and keep track.

If you are a relatively small operation, I’d recommend weighing additional complexity over the benefits. Sometimes a few we’ll written pages can suffice, other times you need to make the investment.

FridgeSeal · on Feb 24, 2020

First step is to figure out whether you actually need a datalake.

I’d recommend starting off with an OLAP database and going from there, reaching for a datalake once-and only once-you’ve reached the limits of the OLAP db.