Servers grow much bigger. 256GB of RAM is pretty much a standard rack server on a major cloud. On AWS, ultra memory with 24TB of RAM is one API call away from you :-). The article took note of that, but tooling on one node puzzles me.
So, multi TB can still be a single node with DuckDB. It's also rather a small ClickHouse cluster. It sounds easier to use DuckDB than the proposed tools.
In many use cases, if it's not 24/7, you can use a data lake (e.g. Iceberg) and query it if needed. Databricks seems a way to go, since the author uses Spark.
Iceberg is working hard to support pure python[0] / rust[1] workflows without Spark. Following Tabular's acquisition [2], I hope it still moves in this direction at the same clip (very impressed with the speed of development on pyiceberg).
We're using iceberg + duckdb to power analytics in our app[3] and I'm really happy with the combo.
Probably in the sense of "you're only running the really heavy analytics queries once a day/week/month", and thus don't constantly need the 24TB instances.
It's pretty common to run analytics workloads on a fixed schedule. I've setup anlaytics workflows that ingest data every day/week/quarter and generate a report once a day.
With workflows like that something like DuckDB can be amazing
It is very common that the majority of work happens during business hours. In some setups it's easy. It's better to get beefy instance during that time and avoid having it running all the time.
I wonder what's the standard of "big" in these context--for example I always thought that some million rows is a very big, but only recently learned that it's a size that RDBMS such as PostgreSQL can handle it with no problem.
Whatever big is defined as, it needs to at least be that your data can't fit into RAM on a high end server.
There's also the threshold where your indexes don't fit into RAM. And the threshold where your data no longer fits into PCIe SSDs on a single server. (The combined bandwidth of the SSDs will rival the RAM, but with more latency.)
These days I’d probably describe “big” as “doesn’t make sense to use SQL anymore”.
Qualitatively, I think it becomes “big” when you have to leave the space of generic “it just works” technologies and you have to start doing bespoke optimizations. It’s amazing how far you can get these days without having to go custom.
These days terabytes is a medium-sized database. A trillion rows of indexed data will fit on a single cloud VM and be reasonably performant.
I think a good definition of "large" is "several times larger than will fit on a practical modern server". Servers with a petabyte or more of fast attached storage are increasingly common, so that threshold is pretty high. Machine-generated data models (sensing, telemetry, et al) routinely exceed this threshold though.
As someone that does it all the time on 10b+ record tables, not really? if you don't have extra resources to build the occasional index your DB is under-provisioned and you're close to falling over, cluster or not.
ah I forgot grandparent comment said trillion. yeah that's an order of magnitude I would distribute across a cluster if we were in active development and indexes were changing etc. if you're storing that much data you should be able to afford it :)
Base on numbers you put, you would need to have stable 50-100GB/s speed to your disc system to just read 3TB data in 40-60sec. Would you mind sharing your setup?
Probably ignorant question, but how do I avoid the issues of cross join say across several tables (or equivalents)? It seems like I always need more than a many TBs of memory.
Do you spinup a U-18TB1 112xlarge or something to the equivalent as needed? Or, are you running it all day? I've never used Clickhouse, but I'd dealt with a spark system that was written like 10 years ago. I'm wondering if it's something easily migrated to Clickhouse?
You can use awk for that and pipe it into separate files for each bucket. And the you awk a bit more and the you are done.
This was sarcasm based on several earlier discussions. And i am prepared for the downvotes.
I think awk is fine for some things but awk proponents underestimate the scale of a map reduce job with multiple dimensions in both directions.
For example i believe awk will struggle with a question like "of one million customers and one billion sales, give me the top 10 spenders". And to stick to the topic, i believe a one node database will also struggle.
> one million customers and one billion sales, give me the top 10 spenders... i believe a one node database will also struggle.
Doubtful. Assume a simplified (but still normalized) schema like below:
CREATE TABLE customer (
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email TEXT NOT NULL,
first_name TEXT NOT NULL,
middle_name TEXT NULL,
last_name TEXT NULL
);
CREATE TABLE product (
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
is_active BOOL NOT NULL DEFAULT TRUE,
name TEXT NOT NULL,
description TEXT,
price NUMERIC(8,2) NOT NULL
);
CREATE TABLE sale (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
customer_id INT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customer (id)
);
CREATE TABLE sale_item (
sale_id BIGINT NOT NULL,
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
product_id INT NOT NULL,
product_qty SMALLINT NOT NULL,
product_price NUMERIC(8,2) NOT NULL,
FOREIGN KEY (sale_id) REFERENCES sale (id),
FOREIGN KEY (product_id) REFERENCES product (id)
);
An empty row in Postgres is 24 bytes, and with the optimal column ordering above, a `sale` row would be 24 + 8 + 8 + 4 = 44 bytes. `customer` would be 40 bytes, plus the actual length of the text. Assume 30 bytes for each to cover most cases, so a total of 160 bytes per row. 1E6 sales, and 1E9 customers, so 44 GB for `sale`, and 160 MB for `customer`.
Assume 10,000 products, all active. `product` is 44 bytes per row, plus the actual length of the text. Assume `name` is 30 bytes, and `description` is 300 bytes, for a total of 374 bytes per row, so ~4 MB.
Assume each sale had 10 items. `sale_item` is 47 bytes per row. 1E9 sales * 10 items per sale, so 470 GB for `sale_item`.
Each table will presumably also have some indices other than the PK. Since the query in question is not likely to be run frequently, you don't have to worry about fitting it entirely into RAM (but if you did want to, you could on larger instances). Alternatively, you could create a materialized view, then create a descending index on the total. This would also have the benefit of severely reducing the buffer thrashing, since you'd only need to read in `sale_item` periodically to refresh the materialized view, and the latter would be quite small in comparison (since it would have a number of rows equal to `customer`).
I have nothing else to say than thanks for the writeup.
And you are most likely correct.
I think what I (and OP) wanted to say was that there should be a stage where you store the intermediate results of the aggregation as well. Unless you have that in an index already. And that also costs.
But you may be right in that it is possibly in the scale of a million so not a huge deal.
Well, that is only true (and helpful) if it is a dataset that benefits from being in a tabular format.
Consider, for instance, particle-based simulation data (loading that into any database is a waste of time...), or a set of images that you need to perform non-standard individualized processing on (again, a waste of time).
Stuffing stuff into postgres or clickhouse is great for your typical 'data science' workflows where data is consistent and the problem is 'oh no our transaction volume has increased 100x'. But in other (some would claim more interesting) cases, using database systems is unhelpful.
You're right! The context that I was talking about was actually in the GIS space! Depending on what you're trying to do, it's not uncommon to have a single dataset that's 50TB large (think: 100-500m resolution global raster with daily data for 30 years)
And that's not even considering any memory overhead when performing operations on top of that data.
That type of stuff wouldn't fit into Postgres (we tried once, a long time ago), and usually, you might be trying to use multiple datasets simultaneously, so that data adds up pretty quickly!
It's my first post (and writing) ever, so I have room for improvement. Thanks for taking the time to read!
When I first read the title I had in my head that it said multi-petabyte. When I think Multi-terabyte, I think 'yea, that is mostly in RAM'. 4TBs of RAM is sort of the new lower norm in DB machines, and GCP and AWS have high RAM machines at 12TB and 24TB. Not that these are not good ideas about scaling to multi-terabytes.. it is just interesting how much data has grown in size.
Hi! I'm the author. I've seen this comment several times and thought I'd address it here.
I should've clarified that when I say multi-terabyte, I meant 100s TBs of data. It's my first time writing anything, so I'm still getting the hang of things.
Damn, that would've been a more clickbait title, too.
I guess the pricing will be cheaper than renting entire machines since you are doing something like micro-spot allocations for map or reduce jobs.
We have done a lot of analytical and also batched workflows with AWS Athena. This is really cheap and also, once you get a hang of it, you get surprised how much you can achieve with only SQL.
I am hoping serverless EMR would be roughly equivalent, just with another language...
I’m not familiar with AWS Athena. But I am very familiar with being restricted to SQL for data processing and the sky really is the limit as long as you’re working with tabular data.
Athena is presto/trino rebranded. But you dont really pay for CPU and it can basically scale to 100 CPUs for free.
It is not really a database but an SQL interface for map/reduce jobs. But it can handle any size of data and is pretty efficient when it is partitioned.
Ah yes then I am very familiar with Athena- like tools. At my old job I used pyspark and I must say I did enjoy how easy it was to make custom python UDFs in combination with SQL so I miss that in a pure SQL environment. But for a lot of applications SQL gets the job done.
Yeah this is why i am hoping that EMR serverless would be the next step.
You get the freedom of not having to set up the resources. It is cheap because you are basically eating the breadcrumbs of available cpu capacity "oh i have 10 idle cpus here in Italy to do one part of you map job" etc.
But you are not locked in to SQL. (Athena has UDFs but they are not very user friendly).
So, multi TB can still be a single node with DuckDB. It's also rather a small ClickHouse cluster. It sounds easier to use DuckDB than the proposed tools.
In many use cases, if it's not 24/7, you can use a data lake (e.g. Iceberg) and query it if needed. Databricks seems a way to go, since the author uses Spark.