Hacker News new | past | comments | ask | show | jobs | submit login

The real magic is this query here (which I got via Twitter after posting this article):

    SELECT
        SUM(size) AS size
    FROM read_parquet(
        list_transform(
            generate_series(0, 55),
            n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                format('{:06d}', n) || '.parquet'
        )
    );
Turns out DuckDB has a lambda function feature these days! The n -> syntax.



TIL. Usually I just went with SELECT * FROM 'folder/prefix*.parquet'


That works for files on disk but not for files fetched via HTTP - though apparently DuckDB can do that for some situations, eg if they are in an S3 bucket that it can list files in.


Actually...

(sorry, not picking on you Simon! Awesome post and I just love reading and talking about this stuff)

With duckdb running on python you can register your own file-system adapters. This means that you can do things like intercept globbing, transform urls/path or physically getting files.

This means that you could inject whatever listing/lookup that might be needed for your read_{format}() table function use case.

http://duckdb.org/docs/guides/python/filesystems

However, it is true that there is no standard way to glob over HTTP unless you're using something like S3.


Yep - HTTP has no native support for file listings. In the old days it would be served at the index of a url path if no actual file was available - but that was always a feature of the http server and not anything unique to HTTP. Protocols like S3 and FTP have listings built in.


HTTP doesn't but I think you can in WebDAV, HTTP's long lost "file management" extension which actually has great support from most HTTP servers


Yeah my bad. I only used it with S3 and somehow my brain think http will be the same.


For http(s) remote file, we can support glob pattern, for example: https://<thehfurl>/resolve/main/data/0000[00-55].parquet'

Databend supports this pattern: https://databend.rs/doc/load-data/load/http#loading-with-glo...


Twitter isn't a person, it would be nice to credit the actual human who provided this to you.


I did. Read my article. https://til.simonwillison.net/duckdb/remote-parquet#user-con...

I directly name and link to four people who contributed tips that helped me expand it.


Cheers and thank you! I had read the article in an earlier stage and thought you had copied that trick only here (where I would still think it would be nice to cite or link the person directly here on HN).


I mean, it’s not like he created this feature, or that it’s a “weird undocumented trick” he found out…

Do you have to credit every person who shows you anything on the internet nowadays? Or is it just the common twitter-hate on HN?


It's still an uncredited effort by another person, copied and posted elsewhere as magic trick. I just think it is nice to publicly credit the people via whom one builds one's wisdom.

No Twitter hate at all here, on the contrary.


ok notachatbot123




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: