Hacker Newsnew | past | comments | ask | show | jobs | submit | broner's commentslogin

Works fine for me on TB+ datasets. Maybe you were doing in-memory rather than persistent database and running out of RAM? https://duckdb.org/docs/stable/clients/cli/overview.html#in-...


Wait, do you insert the data from S3 into duckdb? I was just doing select from file.


Nope, just reading from S3. Check this out: https://duckdb.org/2024/07/09/memory-management.html


Maybe its your terminal that chockes because it tries to display to much data? 500GB should be no problem.


You can use DuckDB in ipython to solve the globbing issue. Then you don't have to worry about OOMs with geopandas.


I'm doing this with clickhouse querying parquet files on S3 from an EC2 instance in the same region as the S3 bucket (yes DuckDB pretty similar). S3 time to first byte within AWS is 50ms and I get close to saturating an big EC2 instance's 100Gb link doing reads. For OLTP type queries fetching under 1 MB you'll see ~4 round trips + transfer time of compressed data so 150-200 ms latency.


Are you using s3 local cache? Do you have heavy writes? What type of s3 disk type, if any, are you using? (s3, s3_plain, s3_plain_rewritable)? Or are you just using the s3 functions.

Clickhouse is amazing but I still struggle getting it working efficiently on s3, especially writes.


My workload is 100% read. Querying zstd parquet on s3 standard. Neither clickhouse nor duckdb has a great s3 driver, which is why smart people like https://www.boilingdata.com/ wrote their own. I compared a handful of queries and found DuckDB makes a lot of round trips and Clickhouse takes the opposite approach and just reads the entire parquet file.


Switch your blind faith to Go (s5cmd) and get ~1200 MB/s upload speed!

zfs send tank/pgdata@snapshot | pbzip2 > mybackup.zfs.bz2

s5cmd cp mybackup.zfs.bz2 s3://mygooglebucket/

https://github.com/peak/s5cmd/blob/master/README.md#Benchmar...


If s3 is the goal, there is gof3r which can be piped into i.e. skip storing mybackup.zfs.bz2 locally.

Overall, I didn't see if they've identified the bottleneck. My guess pbzip2 is the slowest, ssh second. For compression bw I'd check zstd. For ssh there are various cipher/compression options. Or perhaps skip altogether and use wireguard.


gof3r looks amazing, thank you!


Should be a tax advantage to the non-grind income. $920k with 20% fed + 5% state capital gains tax?


Could you use this to create 100s of on-demand read replicas and get 100s of GB/s out of S3? Seems like a nice way to cheaply and quickly do batch jobs.



For $1,500 per TB per month? No thanks.


This looks pretty neat if you're ok moving to AWS https://www.boilingdata.com/


Reminds me of this previous HN submission Blotato: https://news.ycombinator.com/item?id=38206235



As the climate rapidly changes and gets more dangerous, it’s good to know what you’re in store for. This site gives you a picture. https://climatecheck.com


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: