Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because then you can access the archived destination if you already know the short URL. You just can't get a full list of potentially sensitive short URL/destination pairs.




You are aware of which thread you're discussing this in, right? The one where a bunch of like-minded souls enumerated all the address space in a few weeks?

The sibling link above that queries Wayback's warc index shows at least the first several are only 6 alnum wide so it's no wonder the ArchiveTeam got them in reasonable time

Picking one at random, it seems the super sekrit deets you're safeguarding include buyrussia21.co.kr which, yes, is for sure very, very secret


i asked them why they did this. the answer surprisingly is because they fear if they release the full dumps they will get blocked because of the AI scraping wars.

Feels like a bit of a kick in the teeth that I contributed towards archiving something that I don’t even get access to. What happens if they disappear? The dataset is gone forever.

This does seem off. Especially as I can navigate to any of those URLs myself. Hell, if I wanted to spin up 50 virtual servers and go crazy I could probably pay a few thousand bucks to re-scrape the thing myself.

You get access to it via the wayback machine

This whole thread is starting to read like some kind of misguided practical joke. I also recognize that it may seem like this is directed toward you, but I'm not shooting the messenger I'm just anchoring my reply under this new information. Sorry about that.

But, ok, let's continue in good faith

scenario 1: they don't want to uncork the .warc files because it will potentially leak the means and methods of the Archive Warrior or its usages

scenario 2: they don't want to expose the target of the redirects because it will feed the boundaries of the ravenous AI slurp machines

If it's scenario 1, then CSV exists and allows mapping from the 00aa11 codes to the "location:" header, no means and methods necessary

If it's scenario 2, then what the hell were they expecting to happen? Embargo the .warc until the AI hype blows over so their great grand children can read about how the Internet was back in the day? I guess the real question is "archive for whom?" because right now unless they have a back-channel way to feed the Wayback Machine's boundary using the .warc files, and thus it secretly populates the Wayback without wholesale feeding the AI boundary, this whole thing is just mysterious


i think you're missing some key information. the warcs do not just contain the location header information. and their methods are fully public/open source so scenario 1 makes no sense.

sure maybe the warcs will be unlocked at some point in the future. this is a fairly small volunteer effort. i doubt there is some "unlock in 100 years" feature on IA.


Yes exactly, Wayback Machine can use the warc files despite them being blocked for direct download.

Who fears they will get blocked by whom?

Archive team blocked by hosts wanting to protect their data from AI companies (presumably because they want to extract money from them)

Yeah what they did is probably the best way to handle it.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: