Hacker News new | past | comments | ask | show | jobs | submit login
Lessons Learned from Scaling to Multi-Terabyte Datasets (v2thegreat.com)
149 points by thunderbong 10 months ago | hide | past | favorite | 45 comments



Servers grow much bigger. 256GB of RAM is pretty much a standard rack server on a major cloud. On AWS, ultra memory with 24TB of RAM is one API call away from you :-). The article took note of that, but tooling on one node puzzles me.

So, multi TB can still be a single node with DuckDB. It's also rather a small ClickHouse cluster. It sounds easier to use DuckDB than the proposed tools.

In many use cases, if it's not 24/7, you can use a data lake (e.g. Iceberg) and query it if needed. Databricks seems a way to go, since the author uses Spark.


Iceberg is working hard to support pure python[0] / rust[1] workflows without Spark. Following Tabular's acquisition [2], I hope it still moves in this direction at the same clip (very impressed with the speed of development on pyiceberg).

We're using iceberg + duckdb to power analytics in our app[3] and I'm really happy with the combo.

0 - https://github.com/apache/iceberg-python

1 - https://github.com/apache/iceberg-rust

2 - https://x.com/thisritchie/status/1800522255426072647

3 - https://www.definite.app/


You can already have it in Delta with Delta Rust and Python bindings: https://github.com/delta-io/delta-rs


yes, we're evaluating Delta. We went with Iceberg out of concern that Delta was too closely tied to Databricks.

Following the Tabular acquisition, the decision is murkier.


What do you mean by “not 24/7”? Why would a data lake go offline?


Probably in the sense of "you're only running the really heavy analytics queries once a day/week/month", and thus don't constantly need the 24TB instances.


It's pretty common to run analytics workloads on a fixed schedule. I've setup anlaytics workflows that ingest data every day/week/quarter and generate a report once a day.

With workflows like that something like DuckDB can be amazing


It is very common that the majority of work happens during business hours. In some setups it's easy. It's better to get beefy instance during that time and avoid having it running all the time.


not sure, but maybe it's possible only to update the model in a specific time? are there other uses to the data apart from learning and validation?


A “multi-terabyte dataset” is something that fits into a single machine, and can be loaded into Clickhouse in a few minutes.



I wonder what's the standard of "big" in these context--for example I always thought that some million rows is a very big, but only recently learned that it's a size that RDBMS such as PostgreSQL can handle it with no problem.


For RDBMS we typically consider millions to be tiny to small. A billion is somewhere on the boundary of medium to big.


Whatever big is defined as, it needs to at least be that your data can't fit into RAM on a high end server.

There's also the threshold where your indexes don't fit into RAM. And the threshold where your data no longer fits into PCIe SSDs on a single server. (The combined bandwidth of the SSDs will rival the RAM, but with more latency.)


These days I’d probably describe “big” as “doesn’t make sense to use SQL anymore”.

Qualitatively, I think it becomes “big” when you have to leave the space of generic “it just works” technologies and you have to start doing bespoke optimizations. It’s amazing how far you can get these days without having to go custom.


These days terabytes is a medium-sized database. A trillion rows of indexed data will fit on a single cloud VM and be reasonably performant.

I think a good definition of "large" is "several times larger than will fit on a practical modern server". Servers with a petabyte or more of fast attached storage are increasingly common, so that threshold is pretty high. Machine-generated data models (sensing, telemetry, et al) routinely exceed this threshold though.


> A trillion rows of indexed data will fit on a single cloud VM and be reasonably performant.

create and subsequently update index on trillion rows is very untrivial task performance wise.. Which DB you would suggest to use for this?


You just have to plan and do it in the background at those data sizes.


> just have to plan and do it in the background

this can mean building some untrivial infra, so the task is much more complicated than just using cloud vm.

My point is that bottle neck and why you need cluster and not single VM is likely CPU and iops, and not data storage.


> this can mean building some untrivial infra

As someone that does it all the time on 10b+ record tables, not really? if you don't have extra resources to build the occasional index your DB is under-provisioned and you're close to falling over, cluster or not.


10b+ record table and trillions indexed rows on single machine are very different tasks.


ah I forgot grandparent comment said trillion. yeah that's an order of magnitude I would distribute across a cluster if we were in active development and indexes were changing etc. if you're storing that much data you should be able to afford it :)


I think 1pb at least.


I can confirm that. 10tb uncompressed, 3tb compressed in clickhouse, 60 billion Rows, queries are fast.


Really curious what "fast" is in this context. Milliseconds, seconds, minutes?


Full-range query with aggregations is around 40-60 seconds. Can be optimized, of course.


Base on numbers you put, you would need to have stable 50-100GB/s speed to your disc system to just read 3TB data in 40-60sec. Would you mind sharing your setup?


That's only if you read everything, and if none is in ram. With a columnar DB you can just scan the values of the column you need.


Then 3tb of total data volume is irrelevant. What relevant is size of column.


How about many-to-many comparisons? Interested in this use-case for fuzzy matching of many data records using a local db.


Probably ignorant question, but how do I avoid the issues of cross join say across several tables (or equivalents)? It seems like I always need more than a many TBs of memory.

Do you spinup a U-18TB1 112xlarge or something to the equivalent as needed? Or, are you running it all day? I've never used Clickhouse, but I'd dealt with a spark system that was written like 10 years ago. I'm wondering if it's something easily migrated to Clickhouse?


You can use awk for that and pipe it into separate files for each bucket. And the you awk a bit more and the you are done.

This was sarcasm based on several earlier discussions. And i am prepared for the downvotes.

I think awk is fine for some things but awk proponents underestimate the scale of a map reduce job with multiple dimensions in both directions.

For example i believe awk will struggle with a question like "of one million customers and one billion sales, give me the top 10 spenders". And to stick to the topic, i believe a one node database will also struggle.


> one million customers and one billion sales, give me the top 10 spenders... i believe a one node database will also struggle.

Doubtful. Assume a simplified (but still normalized) schema like below:

  CREATE TABLE customer (
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    email TEXT NOT NULL,
    first_name TEXT NOT NULL,
    middle_name TEXT NULL,
    last_name TEXT NULL
  );

  CREATE TABLE product (
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    is_active BOOL NOT NULL DEFAULT TRUE,
    name TEXT NOT NULL,
    description TEXT,
    price NUMERIC(8,2) NOT NULL
  );

  CREATE TABLE sale (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    customer_id INT NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customer (id)
  );

  CREATE TABLE sale_item (
    sale_id BIGINT NOT NULL,
    id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    product_id INT NOT NULL,
    product_qty SMALLINT NOT NULL,
    product_price NUMERIC(8,2) NOT NULL,
    FOREIGN KEY (sale_id) REFERENCES sale (id),
    FOREIGN KEY (product_id) REFERENCES product (id)
  );
An empty row in Postgres is 24 bytes, and with the optimal column ordering above, a `sale` row would be 24 + 8 + 8 + 4 = 44 bytes. `customer` would be 40 bytes, plus the actual length of the text. Assume 30 bytes for each to cover most cases, so a total of 160 bytes per row. 1E6 sales, and 1E9 customers, so 44 GB for `sale`, and 160 MB for `customer`.

Assume 10,000 products, all active. `product` is 44 bytes per row, plus the actual length of the text. Assume `name` is 30 bytes, and `description` is 300 bytes, for a total of 374 bytes per row, so ~4 MB.

Assume each sale had 10 items. `sale_item` is 47 bytes per row. 1E9 sales * 10 items per sale, so 470 GB for `sale_item`.

Each table will presumably also have some indices other than the PK. Since the query in question is not likely to be run frequently, you don't have to worry about fitting it entirely into RAM (but if you did want to, you could on larger instances). Alternatively, you could create a materialized view, then create a descending index on the total. This would also have the benefit of severely reducing the buffer thrashing, since you'd only need to read in `sale_item` periodically to refresh the materialized view, and the latter would be quite small in comparison (since it would have a number of rows equal to `customer`).


I have nothing else to say than thanks for the writeup.

And you are most likely correct.

I think what I (and OP) wanted to say was that there should be a stage where you store the intermediate results of the aggregation as well. Unless you have that in an index already. And that also costs.

But you may be right in that it is possibly in the scale of a million so not a huge deal.


Well, that is only true (and helpful) if it is a dataset that benefits from being in a tabular format.

Consider, for instance, particle-based simulation data (loading that into any database is a waste of time...), or a set of images that you need to perform non-standard individualized processing on (again, a waste of time).

Stuffing stuff into postgres or clickhouse is great for your typical 'data science' workflows where data is consistent and the problem is 'oh no our transaction volume has increased 100x'. But in other (some would claim more interesting) cases, using database systems is unhelpful.


Hi! I'm the author.

You're right! The context that I was talking about was actually in the GIS space! Depending on what you're trying to do, it's not uncommon to have a single dataset that's 50TB large (think: 100-500m resolution global raster with daily data for 30 years)

And that's not even considering any memory overhead when performing operations on top of that data.

That type of stuff wouldn't fit into Postgres (we tried once, a long time ago), and usually, you might be trying to use multiple datasets simultaneously, so that data adds up pretty quickly!

It's my first post (and writing) ever, so I have room for improvement. Thanks for taking the time to read!


When I first read the title I had in my head that it said multi-petabyte. When I think Multi-terabyte, I think 'yea, that is mostly in RAM'. 4TBs of RAM is sort of the new lower norm in DB machines, and GCP and AWS have high RAM machines at 12TB and 24TB. Not that these are not good ideas about scaling to multi-terabytes.. it is just interesting how much data has grown in size.


Hi! I'm the author. I've seen this comment several times and thought I'd address it here.

I should've clarified that when I say multi-terabyte, I meant 100s TBs of data. It's my first time writing anything, so I'm still getting the hang of things.

Damn, that would've been a more clickbait title, too.

Oh well.


Ah yea, no worries. It is always good to consider ways to optimize, either for TB or PB!


Since you are already running on AWS, I wonder if you would consider AWS serverless EMR an option?

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGu...

I guess the pricing will be cheaper than renting entire machines since you are doing something like micro-spot allocations for map or reduce jobs.

We have done a lot of analytical and also batched workflows with AWS Athena. This is really cheap and also, once you get a hang of it, you get surprised how much you can achieve with only SQL.

I am hoping serverless EMR would be roughly equivalent, just with another language...


I’m not familiar with AWS Athena. But I am very familiar with being restricted to SQL for data processing and the sky really is the limit as long as you’re working with tabular data.


Athena is presto/trino rebranded. But you dont really pay for CPU and it can basically scale to 100 CPUs for free.

It is not really a database but an SQL interface for map/reduce jobs. But it can handle any size of data and is pretty efficient when it is partitioned.

And it has zero setup cost.


Ah yes then I am very familiar with Athena- like tools. At my old job I used pyspark and I must say I did enjoy how easy it was to make custom python UDFs in combination with SQL so I miss that in a pure SQL environment. But for a lot of applications SQL gets the job done.


Yeah this is why i am hoping that EMR serverless would be the next step.

You get the freedom of not having to set up the resources. It is cheap because you are basically eating the breadcrumbs of available cpu capacity "oh i have 10 idle cpus here in Italy to do one part of you map job" etc.

But you are not locked in to SQL. (Athena has UDFs but they are not very user friendly).


https://github.com/Eventual-Inc/Daft Is also great at these types of workloads since it’s both distributed and vectorized!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: