That works for files on disk but not for files fetched via HTTP - though apparently DuckDB can do that for some situations, eg if they are in an S3 bucket that it can list files in.
(sorry, not picking on you Simon! Awesome post and I just love reading and talking about this stuff)
With duckdb running on python you can register your own file-system adapters. This means that you can do things like intercept globbing, transform urls/path or physically getting files.
This means that you could inject whatever listing/lookup that might be needed for your read_{format}() table function use case.
Yep - HTTP has no native support for file listings. In the old days it would be served at the index of a url path if no actual file was available - but that was always a feature of the http server and not anything unique to HTTP. Protocols like S3 and FTP have listings built in.
Cheers and thank you! I had read the article in an earlier stage and thought you had copied that trick only here (where I would still think it would be nice to cite or link the person directly here on HN).
It's still an uncredited effort by another person, copied and posted elsewhere as magic trick. I just think it is nice to publicly credit the people via whom one builds one's wisdom.
I love and use both clickhouse and duckdb all the time, but it's become almost comical how senior clickhouse leadership comment on any duckdb-related HN thread like clockwork. And only very rarely bring up their affiliation :)
...says the completely anonymous internet guy. I'm laughing as much as you are.
I wasn't trying to imply anything negative about DuckDB with my post - was just sharing how ClickHouse does the same thing. FWIW: the blog author added my query to his blog, so my non-combative comment was politely received.
but what the person said is true: seems like clickhouse comments descend upon every recent duckdb post as if it’s some sort of competition or born out of inferiority complex.
clickhouse is really cool tech. duckdb is really cool tech. i grow weary of the CH infiltration to the point where it’s working against the intent to make me love CH
I ran the same queries and got similar results but the bandwidth utilization I measured was significantly different. On the same fly.io instance with 1vCPU/256MB both queries completed successfully but ClickHouse/chdb reached 10MB/s (max) and logically completed the count faster, while DuckDB only peaked at around 2.5MB/s.
This might be due to the tiny resources but I like rock bottom measurements. Did anyone else notice a similar bandwidth utilization gap?
If that's raw network transfer, it's probably just a difference in headers or MTU size. Larger MTU -> fewer headers required. Maybe a difference in network configuration that requires more or less data in the header.
Here is some other interesting and related duckdb SQL that you all might find helpful.
inspect the parquet metadata:
SELECT
*
FROM
parquet_metadata('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet');
If the data was on blob storage you could use a glob instead of a generator:
SELECT
sum(*) AS TOTAL_SIZE
FROM
read_parquet('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/*.parquet');
You can use the huggingface API to list and then read with duckdb:
SELECT
concat('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/', path) as parquet_file
FROM
read_json_auto('https://huggingface.co/api/datasets/vivym/midjourney-messages/tree/main/data');
So this means we can combine the list files and read files SQL into a single statement!!!
Error: Binder Error: Table function cannot contain subqueries
:( No
Want the query in a more concise, reusable form?
CREATE MACRO GET_MJ_TOTAL_SIZE(num_of_files) AS TABLE (
SELECT
SUM(size) AS size
FROM read_parquet(
list_transform(
generate_series(0, num_of_files),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
format('{:06d}', n) || '.parquet'
)
)
);
According to this discussion [1], a variety of tools are available to Linux distros (including nettop), but the Apple OS version that is baked-in may have more features.
Parquet files are also /can also be split into different row groups. By ordering the contents by whatever you're most likely to filter on you can cut down on accessed data even more.
E.g if you filter on creation time order the data by that. Then when you query it, duckdb/similar should read the metadata for each row groups and ignore any that cannot contain the data you want. Ordering by user id or country, etc. are common other ones. Order by the biggest group first (so country then timestamp if you often filter on both).
You can tune the size of the row groups depending on your access pattern.
The trick it's using is the HTTP Range header, which lets a client request eg bytes 4500-4900 of an HTTP file.
Most static file hosting platforms - S3, GCS, nginx, Apache etc - support Range headers. They're most commonly used for streaming video and audio.
The Parquet file format is designed with this in mind. You can read metadata at the start of the file and use it to figure out which ranges to fetch. Columns are grouped together, so sum() against a column can be handled by fetching a subset of the file.
Thank you for explaining. I never really looked into this to understand it and because of that it felt like magic, which is always in indicator that you just don’t understand something. I was going to add “in tech”, but it is an indicator of that with anything in life.
Parquet is columnar, you can read each column independently, and a different compression scheme can be used for each column. You first have to read the beginning of the file to get the layout though (And the end of it as well, according to a library I have been using. Don't know why) . The http server must of course support range queries.
The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.
If you want something that looks like a table, while still benefiting from not having to download 8GB of Parquet in order to run the query, you can get one using a CTE:
with midjourney_messages as (
select
*
from read_parquet(
list_transform(
generate_series(0, 2),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
format('{:06d}', n) || '.parquet'
)
)
)
select sum(size) as size from midjourney_messages;
Or you can create a view:
create view midjourney_messages as
select * from read_parquet(
list_transform(
generate_series(0, 2),
n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
format('{:06d}', n) || '.parquet'
)
);
select sum(size) as size from midjourney_messages;
Using a view in this example, you can't dynamically change which files are being selected (not even with joins or where clauses). What if new files are generated and suddenly there are more or less files? Therefore you probably wouldn't want to encapsulate your SQL into a view. Most of the time you would probably bind the list of files in your SQL as needed:
SELECT SUM(size) AS size FROM read_parquet(:files);
But in this case, a table macro/function might also be an option:
CREATE MACRO GET_TOTAL_SIZE(num_of_files) AS TABLE (
SELECT
SUM(size) AS size
FROM read_parquet(
list_transform(generate_series(0, num_of_files), n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' || format('{:06d}', n) || '.parquet'
)
)
);
SELECT * FROM GET_MJ_TOTAL_SIZE(55);
Not necessarily related to this article, but CTEs are useful for breaking down a complex query into more understandable chunks. Moreover, you can do interesting things within the CTE's temp-tables like recursion or freezing/materializing a temp-table's results so that it only gets evaluated one time, instead of every time it gets referenced.
From my experience with other databases my assumption for DuckDB is:
- Using a raw query, a CTE or a view will have no impact at all on query performance - they'll be optimized down to exactly the same operations (or to a query plan that's similar enough that the differences are negligible)
- CTEs are mainly useful for breaking down more complicated queries - so not great for this example, but really useful the moment you start doing anything more complicated.
- Views are effectively persistent CTEs - they're great if you want to permanently "bookmark" more complex pieces of queries to use later.
and guess what those tables use to store data a lot of the time? Parquet.
I assume that when you say "tables" you mean "external tables" since you're in the data world.
If you didn't then I guess when you say "tables", you mean "tables" with everything including the kitchen sink? Database, compute, etc...? Does the database always have to be running for the table to be accessible? Responsibility for the database hardware, resources or services?
Of course there are fantastic new databases like Snowflake and BigQuery that separate compute and storage... but do they, really? Separating storage and compute is just incredible for scaling, suspend/resume, etc. But can you query a Snowflake/BigQuery table without also having to use their compute? Is there a way that I can just get a "table" and not be forced into using a specific compute-engine and all the other bells and whistles?
So when you say "table", where and how do I get one? And to maintain the theme of the article, a columnar/OLAP/analytics "table" in particular?
As you probably know, there are several (external table) options, Apache Iceberg probably being the most talked about one at the moment. External "tables" are just collections of metadata about your files, or conventions about how to lay your files down. When you query these tables with SQL using athena, redshift, snowflake, duckdb, etc... each and everyone of those query-engines is reading parquet files.
(Snowflake, BigQuery and others are working on features to both manage and read Iceberg tables, so i kinda lied earlier)
As much as I love to tout the benefits of duckdb, the ability to only pull down the columns that you need is not really functionality that is enabled by or exclusive to duckdb.
It really has to do with the parquet file format. And it gets even more powerful than just being able to download specific columns. To a certain degree, you can also limit rows at the same time.
I still don’t understand DuckDB use case. To handle huge amount of data, you need a large database that’s connected to the network. DuckDB will eventually hit a ceiling.
DuckDB use case? I remember being in the same spot. Then I realized "the" use case for DuckDB didn't really exist. There were dozens of use-cases. No wonder I couldn't figure out "the" use case from reading all the articles/blogs/hype people were creating about what they were doing with DuckDB. It seemed like everyone was using it in different ways. It's a data engineering/analytics Swiss army knife.
ETL/ELT, ad-hoc/exploratory, munger/wrangler, edge-analytics, and so on...
> DuckDB will eventually hit a ceiling.
Yeah, probably. Just about every database has hit a ceiling in the past. But then someone comes out with some fantastic new idea to overcome the challenges to some degree. Map-reduce, moving the query to the data instead of the data to the query, serverless, separating storage/compute...
However, what if we start thinking along these lines with DuckDB?
Reading parquet files addresses separating storage and compute. Parquet also provides columnar/row-grouped data giving us push-down predicates (so kinda moving part of the query closer to the data).
We can run a DuckDb instance (EC2/S3) closer to the data so that sorta helps too.
What I'm really excited about using DuckDB in a similar way to map-reduce. What if there was a way to take some SQL's logical plan and turn it into a physical plan that uses compute resources from a pool of (serverless) DuckDB instances. Starting at the leaves of the graph (physical plan) pulling/filtering data from the source (parquet files), and returning their completed work up the branches until it is completed and ready to be served up as results.
I've seen a few examples of this already, but nothing that I would consider production ready. I have a hunch that someone is going to drop such a project on us shortly, and it's going to change a lot of things we have become use to in the data world.
I think the most interesting thing about duckDB (for us) is that you can "take out the DB" from duckDB and use the built-in engine over a (stateless, ephemeral) table you got from wherever.
We didn't do much map-reduce aside some quick tests (i.e. the ones in that repo), but we did pursue a larger vision for serverless lake-house in which SQL and Python co-exist and data flow seamlessly between functions and across languages. In case you're curious, we have a vision paper out and always happy to chat: https://arxiv.org/pdf/2308.05368.pdf
It fits a pretty nice niche between small data (32GB in-memory, pandas level) and truly huge data where the disk files can't even fit on one machine (2TB+). Most data that people want to process on a daily cadence fit somewhere in between those ranges.
As for whether to use a database or not, I think that's a more fundamental question about where your different read and write loads are coming from. Scaling a database to support billions of rows and have high query performance is not trivial; if your use-case fits more in a write-infrequently, read-often then OLAP is a pretty good choice of index structure.
> It fits a pretty nice niche between small data (32GB in-memory, pandas level) and truly huge data where the disk files can't even fit on one machine (2TB+)
you can build/rent server with say 50TB nvme raid and run duckdb on it?..
It's for analytics. It's a tool that can run on my laptop that means I can run analytical queries against potentially TBs of data hosted in an S3 bucket, without needing to spin up an expensive data warehouse.
No no it is insanely useful and I wish I had a job where I could use it more! Noone really has big big data, they only think they do.
You can do the analytics that Databricks and Snowflake are trying to sell your boss for $100.000 a month, only then on an 80$ vps and a couple of S3 buckets full of parquet files.
It's a fantastic tool to run analytics. Lean, purposeful. I hope they can manage to stay out of the clutches of corporate enshittification for a while longer.
You seem to be quite sure of how useful duckdb is even though you haven’t found a use case for it. This is actually a common pattern I’ve seen everywhere - the people super excited about duckdb are people who never use it in their job and at best for toy use cases.
If your data is truly small enough to run on a vps with duckdb, your monthly snowflake bill will not break a few hundred dollars by any stretch. A terabyte stored on snowflake costs 23 bucks a month and running a query on it depending on complexity will cost no more than a dollar. And you don’t pay any cost other than these two.
Wow I didn't know that snowflake charges on query level. Very good business idea to first create dependency and rip them off once their business grows.
> Wow I didn't know that snowflake charges on query level.
It doesn't. You get charged "credits" for virtual data warehouse operation plus storage. A credit corresponds (approximately) to running a VM of a certain size for an hour. You can turn VDWs on and off more or less at will.
Not taking a position on whether that's good for users or not. Just pointing out how it works.
Not disagreeing, but pick a lane - is your data too small or too big? The discussion here was about data thats allegedly too small to be called "big data"..