The real magic is this query here (which I got via Twitter after posting this ar...

ies7 · on Nov 17, 2023

TIL. Usually I just went with SELECT * FROM 'folder/prefix*.parquet'

simonw · on Nov 17, 2023

That works for files on disk but not for files fetched via HTTP - though apparently DuckDB can do that for some situations, eg if they are in an S3 bucket that it can list files in.

chrisjc · on Nov 17, 2023

Actually...

(sorry, not picking on you Simon! Awesome post and I just love reading and talking about this stuff)

With duckdb running on python you can register your own file-system adapters. This means that you can do things like intercept globbing, transform urls/path or physically getting files.

This means that you could inject whatever listing/lookup that might be needed for your read_{format}() table function use case.

http://duckdb.org/docs/guides/python/filesystems

However, it is true that there is no standard way to glob over HTTP unless you're using something like S3.

pests · on Nov 17, 2023

Yep - HTTP has no native support for file listings. In the old days it would be served at the index of a url path if no actual file was available - but that was always a feature of the http server and not anything unique to HTTP. Protocols like S3 and FTP have listings built in.

booi · on Nov 17, 2023

HTTP doesn't but I think you can in WebDAV, HTTP's long lost "file management" extension which actually has great support from most HTTP servers

ies7 · on Nov 17, 2023

Yeah my bad. I only used it with S3 and somehow my brain think http will be the same.

BohuTANG · on Nov 21, 2023

For http(s) remote file, we can support glob pattern, for example: https://<thehfurl>/resolve/main/data/0000[00-55].parquet'

Databend supports this pattern: https://databend.rs/doc/load-data/load/http#loading-with-glo...

notachatbot123 · on Nov 17, 2023

Twitter isn't a person, it would be nice to credit the actual human who provided this to you.

simonw · on Nov 17, 2023

I did. Read my article. https://til.simonwillison.net/duckdb/remote-parquet#user-con...

I directly name and link to four people who contributed tips that helped me expand it.

notachatbot123 · on Nov 17, 2023

Cheers and thank you! I had read the article in an earlier stage and thought you had copied that trick only here (where I would still think it would be nice to cite or link the person directly here on HN).

MF-DOOM · on Nov 17, 2023

I mean, it’s not like he created this feature, or that it’s a “weird undocumented trick” he found out…

Do you have to credit every person who shows you anything on the internet nowadays? Or is it just the common twitter-hate on HN?

notachatbot123 · on Nov 17, 2023

It's still an uncredited effort by another person, copied and posted elsewhere as magic trick. I just think it is nice to publicly credit the people via whom one builds one's wisdom.

No Twitter hate at all here, on the contrary.

vdm · on Nov 17, 2023

ok notachatbot123