More

broner · 2025-05-04T21:47:56 1746395276

Works fine for me on TB+ datasets. Maybe you were doing in-memory rather than persistent database and running out of RAM? https://duckdb.org/docs/stable/clients/cli/overview.html#in-...

aynyc · 2025-05-05T01:52:34 1746409954

Wait, do you insert the data from S3 into duckdb? I was just doing select from file.

broner · 2025-05-06T00:22:17 1746490937

Nope, just reading from S3. Check this out: https://duckdb.org/2024/07/09/memory-management.html

fastasucan · 2025-05-05T18:02:40 1746468160

Maybe its your terminal that chockes because it tries to display to much data? 500GB should be no problem.

broner · 2025-05-04T21:26:52 1746394012

You can use DuckDB in ipython to solve the globbing issue. Then you don't have to worry about OOMs with geopandas.

broner · on Nov 9, 2024

I'm doing this with clickhouse querying parquet files on S3 from an EC2 instance in the same region as the S3 bucket (yes DuckDB pretty similar). S3 time to first byte within AWS is 50ms and I get close to saturating an big EC2 instance's 100Gb link doing reads. For OLTP type queries fetching under 1 MB you'll see ~4 round trips + transfer time of compressed data so 150-200 ms latency.

hipadev23 · on Nov 9, 2024

Are you using s3 local cache? Do you have heavy writes? What type of s3 disk type, if any, are you using? (s3, s3_plain, s3_plain_rewritable)? Or are you just using the s3 functions.

Clickhouse is amazing but I still struggle getting it working efficiently on s3, especially writes.

broner · on Nov 9, 2024

My workload is 100% read. Querying zstd parquet on s3 standard. Neither clickhouse nor duckdb has a great s3 driver, which is why smart people like https://www.boilingdata.com/ wrote their own. I compared a handful of queries and found DuckDB makes a lot of round trips and Clickhouse takes the opposite approach and just reads the entire parquet file.

broner · on June 7, 2024

Switch your blind faith to Go (s5cmd) and get ~1200 MB/s upload speed!

zfs send tank/pgdata@snapshot | pbzip2 > mybackup.zfs.bz2

s5cmd cp mybackup.zfs.bz2 s3://mygooglebucket/

https://github.com/peak/s5cmd/blob/master/README.md#Benchmar...

attentive · on June 7, 2024

If s3 is the goal, there is gof3r which can be piped into i.e. skip storing mybackup.zfs.bz2 locally.

Overall, I didn't see if they've identified the bottleneck. My guess pbzip2 is the slowest, ssh second. For compression bw I'd check zstd. For ssh there are various cipher/compression options. Or perhaps skip altogether and use wireguard.

tarasglek · on June 8, 2024

gof3r looks amazing, thank you!

broner · on May 30, 2024

Should be a tax advantage to the non-grind income. $920k with 20% fed + 5% state capital gains tax?

broner · on May 30, 2024

Could you use this to create 100s of on-demand read replicas and get 100s of GB/s out of S3? Seems like a nice way to cheaply and quickly do batch jobs.

ajc23 · on May 31, 2024

you can use Neon for this: https://neon.tech/docs/introduction/read-replicas

broner · on June 4, 2024

For $1,500 per TB per month? No thanks.

broner · on Dec 30, 2023

This looks pretty neat if you're ok moving to AWS https://www.boilingdata.com/

broner · on Dec 10, 2023

Reminds me of this previous HN submission Blotato: https://news.ycombinator.com/item?id=38206235

broner · on May 19, 2022

Glitch? Oh they're with Fastly now! https://www.tiktok.com/@benedicttown/video/70983764223009620...

broner · on Oct 9, 2020

As the climate rapidly changes and gets more dangerous, it’s good to know what you’re in store for. This site gives you a picture. https://climatecheck.com