Hacker News new | past | comments | ask | show | jobs | submit login
Summing columns in remote Parquet files using DuckDB (simonwillison.net)
146 points by simonw on Nov 16, 2023 | hide | past | favorite | 75 comments



The real magic is this query here (which I got via Twitter after posting this article):

    SELECT
        SUM(size) AS size
    FROM read_parquet(
        list_transform(
            generate_series(0, 55),
            n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                format('{:06d}', n) || '.parquet'
        )
    );
Turns out DuckDB has a lambda function feature these days! The n -> syntax.


TIL. Usually I just went with SELECT * FROM 'folder/prefix*.parquet'


That works for files on disk but not for files fetched via HTTP - though apparently DuckDB can do that for some situations, eg if they are in an S3 bucket that it can list files in.


Actually...

(sorry, not picking on you Simon! Awesome post and I just love reading and talking about this stuff)

With duckdb running on python you can register your own file-system adapters. This means that you can do things like intercept globbing, transform urls/path or physically getting files.

This means that you could inject whatever listing/lookup that might be needed for your read_{format}() table function use case.

http://duckdb.org/docs/guides/python/filesystems

However, it is true that there is no standard way to glob over HTTP unless you're using something like S3.


Yep - HTTP has no native support for file listings. In the old days it would be served at the index of a url path if no actual file was available - but that was always a feature of the http server and not anything unique to HTTP. Protocols like S3 and FTP have listings built in.


HTTP doesn't but I think you can in WebDAV, HTTP's long lost "file management" extension which actually has great support from most HTTP servers


Yeah my bad. I only used it with S3 and somehow my brain think http will be the same.


For http(s) remote file, we can support glob pattern, for example: https://<thehfurl>/resolve/main/data/0000[00-55].parquet'

Databend supports this pattern: https://databend.rs/doc/load-data/load/http#loading-with-glo...


Twitter isn't a person, it would be nice to credit the actual human who provided this to you.


I did. Read my article. https://til.simonwillison.net/duckdb/remote-parquet#user-con...

I directly name and link to four people who contributed tips that helped me expand it.


Cheers and thank you! I had read the article in an earlier stage and thought you had copied that trick only here (where I would still think it would be nice to cite or link the person directly here on HN).


I mean, it’s not like he created this feature, or that it’s a “weird undocumented trick” he found out…

Do you have to credit every person who shows you anything on the internet nowadays? Or is it just the common twitter-hate on HN?


It's still an uncredited effort by another person, copied and posted elsewhere as magic trick. I just think it is nice to publicly credit the people via whom one builds one's wisdom.

No Twitter hate at all here, on the contrary.


ok notachatbot123


Nice. I did the same thing on ClickHouse - took 11 seconds:

    SELECT sum(size)
    FROM url('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/0000{01..55}.parquet')
    
    Query id: 9d145763-0754-4aa2-bb7d-f6917690f704
    
    ┌───────sum(size)─┐
    │ 159344011148016 │
    └─────────────────┘
    
    1 row in set. Elapsed: 11.615 sec. Processed 54.08 million rows, 8.50 GB (4.66 million rows/s., 731.83 MB/s.)
    Peak memory usage: 458.88 KiB.


I love and use both clickhouse and duckdb all the time, but it's become almost comical how senior clickhouse leadership comment on any duckdb-related HN thread like clockwork. And only very rarely bring up their affiliation :)


You mean this username composed of a random assortment of Latin characters isn't an unaffiliated, independent commentator?!


...says the completely anonymous internet guy. I'm laughing as much as you are.

I wasn't trying to imply anything negative about DuckDB with my post - was just sharing how ClickHouse does the same thing. FWIW: the blog author added my query to his blog, so my non-combative comment was politely received.


but what the person said is true: seems like clickhouse comments descend upon every recent duckdb post as if it’s some sort of competition or born out of inferiority complex.

clickhouse is really cool tech. duckdb is really cool tech. i grow weary of the CH infiltration to the point where it’s working against the intent to make me love CH


Is it so annoying? I am happy that we have at least two projects that can query big datasets with very reasonable performance.



yes it is quite annoying. i would love to see a CH dedicated post to extol its virtues and not a piggyback “us too” comment on a duckdb post.


Hey man - not trying to get on the bad side of one of my favourite databases! Making an observation, one that others have made as well.


Do you know how much data it had to download to run that query? Did it pull all 8GB?


ClickHouse just reads the values from the one column being summed: https://www.markhneedham.com/blog/2023/11/15/clickhouse-summ...


The same query takes only 1.4 seconds on my server, so I assume that the query does not read all 8 GB.


I ran a network monitor while running that query, it pulled down ~290MB


Almost exactly the same as DuckDB then - I measured ~287MB.


I ran the same queries and got similar results but the bandwidth utilization I measured was significantly different. On the same fly.io instance with 1vCPU/256MB both queries completed successfully but ClickHouse/chdb reached 10MB/s (max) and logically completed the count faster, while DuckDB only peaked at around 2.5MB/s.

This might be due to the tiny resources but I like rock bottom measurements. Did anyone else notice a similar bandwidth utilization gap?


Probably compression


If that's raw network transfer, it's probably just a difference in headers or MTU size. Larger MTU -> fewer headers required. Maybe a difference in network configuration that requires more or less data in the header.


Here is some other interesting and related duckdb SQL that you all might find helpful.

inspect the parquet metadata:

    SELECT 
        * 
    FROM 
        parquet_metadata('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/000000.parquet');
If the data was on blob storage you could use a glob instead of a generator:

    SELECT 
        sum(*) AS TOTAL_SIZE
    FROM 
        read_parquet('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/*.parquet');
You can use the huggingface API to list and then read with duckdb:

    SELECT 
        concat('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/', path) as parquet_file 
    FROM 
        read_json_auto('https://huggingface.co/api/datasets/vivym/midjourney-messages/tree/main/data');
So this means we can combine the list files and read files SQL into a single statement!!!

    Error: Binder Error: Table function cannot contain subqueries
:( No

Want the query in a more concise, reusable form?

    CREATE MACRO GET_MJ_TOTAL_SIZE(num_of_files) AS TABLE (
        SELECT
            SUM(size) AS size
        FROM read_parquet(
            list_transform(
                generate_series(0, num_of_files),
                n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                    format('{:06d}', n) || '.parquet'
            )
        )
    );
You can simply query table function:

    SELECT * FROM GET_MJ_TOTAL_SIZE(55);
You don't need to run nettop:

    EXPLAIN ANALYZE SELECT * FROM GET_MJ_TOTAL_SIZE(55);

    ┌─────────────────────────────────────┐
    │┌───────────────────────────────────┐│
    ││            HTTP Stats:            ││
    ││                                   ││
    ││            in: 295.5MB            ││
    ││            out: 0 bytes           ││
    ││             #HEAD: 55             ││
    ││             #GET: 166             ││
    ││              #PUT: 0              ││
    ││              #POST: 0             ││
    │└───────────────────────────────────┘│
    └─────────────────────────────────────┘


Neat, thank you. I added the parquet_metadata() tip to my article: https://til.simonwillison.net/duckdb/remote-parquet#user-con...


Includes a bonus TIL on nettop - today I learned that on macOS you can run this to see exactly how much bandwidth a specific process is using:

    nettop -p $PID


According to this discussion [1], a variety of tools are available to Linux distros (including nettop), but the Apple OS version that is baked-in may have more features.

[1] _ https://askubuntu.com/questions/257263/how-to-display-networ...


Parquet files are also /can also be split into different row groups. By ordering the contents by whatever you're most likely to filter on you can cut down on accessed data even more.

E.g if you filter on creation time order the data by that. Then when you query it, duckdb/similar should read the metadata for each row groups and ignore any that cannot contain the data you want. Ordering by user id or country, etc. are common other ones. Order by the biggest group first (so country then timestamp if you often filter on both).

You can tune the size of the row groups depending on your access pattern.


This is one of my favorite features of DuckDB and Parquet files!

I combined these capabilities with mounting via NFS and documented in the last part of my post here: https://about.xethub.com/blog/version-control-machine-learni...


Ok that seems like magic. How is it able to only read a subset of the data like that? Is this because of the parquet file format?

Also can the same thing work for files hosted on s3?


Yes, this works for S3 files as well.

The trick it's using is the HTTP Range header, which lets a client request eg bytes 4500-4900 of an HTTP file.

Most static file hosting platforms - S3, GCS, nginx, Apache etc - support Range headers. They're most commonly used for streaming video and audio.

The Parquet file format is designed with this in mind. You can read metadata at the start of the file and use it to figure out which ranges to fetch. Columns are grouped together, so sum() against a column can be handled by fetching a subset of the file.


Thank you for explaining. I never really looked into this to understand it and because of that it felt like magic, which is always in indicator that you just don’t understand something. I was going to add “in tech”, but it is an indicator of that with anything in life.


Parquet is columnar, you can read each column independently, and a different compression scheme can be used for each column. You first have to read the beginning of the file to get the layout though (And the end of it as well, according to a library I have been using. Don't know why) . The http server must of course support range queries.


Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format

The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.


Does anybody know why this thread appears to have just vanished?

It was on page 2 a moment ago, and now it's not listed anywhere (pages 1-14)


you know there's this great abstraction over files we came up with in the data world called 'tables'..


If you want something that looks like a table, while still benefiting from not having to download 8GB of Parquet in order to run the query, you can get one using a CTE:

    with midjourney_messages as (
        select
            *
        from read_parquet(
            list_transform(
                generate_series(0, 2),
                n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                    format('{:06d}', n) || '.parquet'
            )
        )
    )
    select sum(size) as size from midjourney_messages;
Or you can create a view:

    create view midjourney_messages as
    select * from read_parquet(
        list_transform(
            generate_series(0, 2),
            n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                format('{:06d}', n) || '.parquet'
        )
    );
    select sum(size) as size from midjourney_messages;


Are there any pros/cons to the raw query, CTE, and view approches?


Yes.

Using a view in this example, you can't dynamically change which files are being selected (not even with joins or where clauses). What if new files are generated and suddenly there are more or less files? Therefore you probably wouldn't want to encapsulate your SQL into a view. Most of the time you would probably bind the list of files in your SQL as needed:

   SELECT SUM(size) AS size FROM read_parquet(:files);
But in this case, a table macro/function might also be an option:

   CREATE MACRO GET_TOTAL_SIZE(num_of_files) AS TABLE (
        SELECT
            SUM(size) AS size
        FROM read_parquet(
            list_transform(generate_series(0, num_of_files), n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' || format('{:06d}', n) || '.parquet'
            )
        )
    );

    SELECT * FROM GET_MJ_TOTAL_SIZE(55);
Not necessarily related to this article, but CTEs are useful for breaking down a complex query into more understandable chunks. Moreover, you can do interesting things within the CTE's temp-tables like recursion or freezing/materializing a temp-table's results so that it only gets evaluated one time, instead of every time it gets referenced.

http://duckdb.org/docs/sql/query_syntax/with#recursive-ctes http://duckdb.org/docs/sql/query_syntax/with#materialized-ct...


From my experience with other databases my assumption for DuckDB is:

- Using a raw query, a CTE or a view will have no impact at all on query performance - they'll be optimized down to exactly the same operations (or to a query plan that's similar enough that the differences are negligible)

- CTEs are mainly useful for breaking down more complicated queries - so not great for this example, but really useful the moment you start doing anything more complicated.

- Views are effectively persistent CTEs - they're great if you want to permanently "bookmark" more complex pieces of queries to use later.

I wrote a bit more about CTEs here: https://datasette.io/tutorials/data-analysis#ctes


and guess what those tables use to store data a lot of the time? Parquet.

I assume that when you say "tables" you mean "external tables" since you're in the data world.

If you didn't then I guess when you say "tables", you mean "tables" with everything including the kitchen sink? Database, compute, etc...? Does the database always have to be running for the table to be accessible? Responsibility for the database hardware, resources or services?

Of course there are fantastic new databases like Snowflake and BigQuery that separate compute and storage... but do they, really? Separating storage and compute is just incredible for scaling, suspend/resume, etc. But can you query a Snowflake/BigQuery table without also having to use their compute? Is there a way that I can just get a "table" and not be forced into using a specific compute-engine and all the other bells and whistles?

So when you say "table", where and how do I get one? And to maintain the theme of the article, a columnar/OLAP/analytics "table" in particular?

As you probably know, there are several (external table) options, Apache Iceberg probably being the most talked about one at the moment. External "tables" are just collections of metadata about your files, or conventions about how to lay your files down. When you query these tables with SQL using athena, redshift, snowflake, duckdb, etc... each and everyone of those query-engines is reading parquet files.

(Snowflake, BigQuery and others are working on features to both manage and read Iceberg tables, so i kinda lied earlier)


Use polars it’s faster



Can it run aggregate queries against 8 GiB of remotely hosted data while only pulling down the 287 MiB of columns it needs?


Looks like somebody requested it after reading your TIL. https://github.com/pola-rs/polars/issues/12493#issuecomment-...

It will be in the next release. (later today?)



Nice :-D

Do you happen to know what happened to this thread?

(I recall reading a previous post of yours about tracking HN positions?)

I also asked as a top-level comment: https://news.ycombinator.com/item?id=38304938


As much as I love to tout the benefits of duckdb, the ability to only pull down the columns that you need is not really functionality that is enabled by or exclusive to duckdb.

It really has to do with the parquet file format. And it gets even more powerful than just being able to download specific columns. To a certain degree, you can also limit rows at the same time.

http://peter-hoffmann.com/2020/understand-predicate-pushdown...

All that being said... yes, as far as I know, polars can do all of that. Probably because it has the necessary parquet dependencies.

Now whether you should use polars instead of duckdb or vise-versa? Is polars "faster"? It depends


I still don’t understand DuckDB use case. To handle huge amount of data, you need a large database that’s connected to the network. DuckDB will eventually hit a ceiling.


DuckDB use case? I remember being in the same spot. Then I realized "the" use case for DuckDB didn't really exist. There were dozens of use-cases. No wonder I couldn't figure out "the" use case from reading all the articles/blogs/hype people were creating about what they were doing with DuckDB. It seemed like everyone was using it in different ways. It's a data engineering/analytics Swiss army knife.

ETL/ELT, ad-hoc/exploratory, munger/wrangler, edge-analytics, and so on...

> DuckDB will eventually hit a ceiling.

Yeah, probably. Just about every database has hit a ceiling in the past. But then someone comes out with some fantastic new idea to overcome the challenges to some degree. Map-reduce, moving the query to the data instead of the data to the query, serverless, separating storage/compute...

However, what if we start thinking along these lines with DuckDB? Reading parquet files addresses separating storage and compute. Parquet also provides columnar/row-grouped data giving us push-down predicates (so kinda moving part of the query closer to the data). We can run a DuckDb instance (EC2/S3) closer to the data so that sorta helps too.

What I'm really excited about using DuckDB in a similar way to map-reduce. What if there was a way to take some SQL's logical plan and turn it into a physical plan that uses compute resources from a pool of (serverless) DuckDB instances. Starting at the leaves of the graph (physical plan) pulling/filtering data from the source (parquet files), and returning their completed work up the branches until it is completed and ready to be served up as results.

I've seen a few examples of this already, but nothing that I would consider production ready. I have a hunch that someone is going to drop such a project on us shortly, and it's going to change a lot of things we have become use to in the data world.

https://github.com/BauplanLabs/quack-reduce https://www.boilingdata.com/


> What if there was a way to take some SQL's logical plan and turn it into a physical plan that uses compute resources from a pool

This is how ClickHouse works on a cluster.

For example, writing the same query as

    SELECT sum(size)
    FROM urlCluster('default', 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/0000{01..55}.parquet')
(note the urlCluster usage)

will give the result in 0.3 seconds on a cluster: https://pastila.nl/?00ef0aac/a54918ef6d3536fad34a5eca0e1157f...


Quack-reduce author here.

I think the most interesting thing about duckDB (for us) is that you can "take out the DB" from duckDB and use the built-in engine over a (stateless, ephemeral) table you got from wherever.

We didn't do much map-reduce aside some quick tests (i.e. the ones in that repo), but we did pursue a larger vision for serverless lake-house in which SQL and Python co-exist and data flow seamlessly between functions and across languages. In case you're curious, we have a vision paper out and always happy to chat: https://arxiv.org/pdf/2308.05368.pdf


I am extremely curious and going to make an effort to read it. Thank you!


It fits a pretty nice niche between small data (32GB in-memory, pandas level) and truly huge data where the disk files can't even fit on one machine (2TB+). Most data that people want to process on a daily cadence fit somewhere in between those ranges.

As for whether to use a database or not, I think that's a more fundamental question about where your different read and write loads are coming from. Scaling a database to support billions of rows and have high query performance is not trivial; if your use-case fits more in a write-infrequently, read-often then OLAP is a pretty good choice of index structure.


> It fits a pretty nice niche between small data (32GB in-memory, pandas level) and truly huge data where the disk files can't even fit on one machine (2TB+)

you can build/rent server with say 50TB nvme raid and run duckdb on it?..


It's for analytics. It's a tool that can run on my laptop that means I can run analytical queries against potentially TBs of data hosted in an S3 bucket, without needing to spin up an expensive data warehouse.


No no it is insanely useful and I wish I had a job where I could use it more! Noone really has big big data, they only think they do.

You can do the analytics that Databricks and Snowflake are trying to sell your boss for $100.000 a month, only then on an 80$ vps and a couple of S3 buckets full of parquet files.

It's a fantastic tool to run analytics. Lean, purposeful. I hope they can manage to stay out of the clutches of corporate enshittification for a while longer.


You seem to be quite sure of how useful duckdb is even though you haven’t found a use case for it. This is actually a common pattern I’ve seen everywhere - the people super excited about duckdb are people who never use it in their job and at best for toy use cases.

If your data is truly small enough to run on a vps with duckdb, your monthly snowflake bill will not break a few hundred dollars by any stretch. A terabyte stored on snowflake costs 23 bucks a month and running a query on it depending on complexity will cost no more than a dollar. And you don’t pay any cost other than these two.


Wow I didn't know that snowflake charges on query level. Very good business idea to first create dependency and rip them off once their business grows.


> Wow I didn't know that snowflake charges on query level.

It doesn't. You get charged "credits" for virtual data warehouse operation plus storage. A credit corresponds (approximately) to running a VM of a certain size for an hour. You can turn VDWs on and off more or less at will.

Not taking a position on whether that's good for users or not. Just pointing out how it works.


Not disagreeing, but pick a lane - is your data too small or too big? The discussion here was about data thats allegedly too small to be called "big data"..


similar use case to sqlite as an embedded database system with a columnar/analytics focus.


He did all this work and wrote a blog post bc he didn't want to download... 8gb of data?


Or because I wanted to learn how to use DuckDB to access remote parquet files, and this was a good opportunity to figure that out.


Totally understandable but just say that in the post then :)

DuckDB has been a godsend to me for a certain types of stuff (mostly accessing very large parquet files over S3)


That's kind of implied by the whole TIL site! It's for writing about things I've learned.


It's more that DuckDB and Clickhouse make it wicked easy to query data on remote servers or object stores in a fairly efficient manner.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: