Hacker News new | past | comments | ask | show | jobs | submit login
A 16TB Mirror of Data.gov on Source.Coop (source.coop)
139 points by dbreunig 39 days ago | hide | past | favorite | 18 comments



I'm surprised there is no torrent available


If I had to guess this is the problem

> This is a regularly updated mirror

p2p software really struggles with mutability.

We need stable chunking (I.e. rolling hash chunking around file boundaries) and seeders to seed addressable chunks (similar to bitswap) to preserve seeders from the previous revision when a 16TB archive partially changes.

This is something my team is actively researching alongside other p2p problem statements.

If it interests you, you know rust, and you’re looking for a job, email me.


There's a BIP for updatable torrents, and a number of features of bittorrent v2 make it easier.

Most archive torrents I see like this usually freeze the older contents so they don't change in later updated releases. Or, release new torrents with new data and keep using the old ones.


Thanks, will take a look at the BIP


For real, I have a 20TB drive available, and I would seed this indefinitely.


Yo just since you mention that -- got any recs for a 20tb drive? Currently the biggest thing i rock is just a seagate 8TB nas HDD, which i nabbed pretty cheap, but these days the archiving i do kinda makes me wanna step up even bigger.


I got this on Black Friday for $270:

https://www.newegg.com/wd-elements-20tb-black-usb-3-0/p/N82E...

Its $350 now however, but that's still not bad for 20TB





That is the point. "redundant" duplication of data is needed.


The public facing data was never the problem.

We have no idea what is being nuked off the backend, copied offline and into AI

Employee records, Tax records, medical records, things like that.

Then there is the problem of private servers plugged into government networks that are obviously bypassing firewalls and prime targets for foreign governments because they aren't secured and designed to be remote accessed.


Do you have any examples of being able to get sensitive info from an LLM? I've never heard of it.


one of the kids was asking on twitter which llm to use for pdf parsing


> Then there is the problem of private servers plugged into government networks that are obviously bypassing firewalls and prime targets for foreign governments because they aren't secured and designed to be remote accessed.

This isn't getting ... any attention. (Though, I've contacted my congressional representative and I urge everyone else to do the same.) Any concerns have been dismissed as FUD and allayed by reassurances that _their access is read-only_, which is a wet fucking blanket if ever there was one. I know if I were a nation state with access to Pegasus or similar systems, I'd be actively targeting these DOGEngineers. But, sure, cry me a fucking river about DeepSeek.


How do they guarantee none of it is modified?


BagIt and sha256 and dropping Harvard's name and "look, squirrel!"

https://en.wikipedia.org/wiki/BagIt


Comment history certainly living up to your name.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: