Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Internet Archive is under a DDoS attack (archive.org)
499 points by toomanyrichies on May 27, 2024 | hide | past | favorite | 211 comments


This is why I’ve gotten into the habit of maintaining my own WWW archive of sites I find interesting. Probably have around 1 TiB now, and One Of These Days I’d like to set my network up so it can serve arbitrary sites directly from local archive to revive any site I want.

I have a `wget-mirror` shell function invoking wget with all the trimmings that takes care of 99% of sites. I’ll edit the full command into this comment when I get home if anybody else wants to start doing the same :)


Assume everyone is familiar with this project, dating back to 1996:

https://en.wikipedia.org/wiki/WWWOFFLE

https://ftp.netbsd.org/pub/pkgsrc/distfiles/wwwoffle-2.9j.tg...

The way the www is going, it seems like downloading a copy of libgen, i.e., nonfiction books, and scimag, i.e., academic journals, via torrent, would be more valuable than archiving websites, in general. These primary sources are part of the material used to train so-called "AI" anyway. The problem is that this so-called "AI" also includes all the garbage from the www.

Worst case is eventually these books and journals will again become publicly inaccessible but "AI" will be offered as a bogus substitute; a future where few people will do research using primary materials anymore, they will just submit questions to a remote "AI" server. Truth will be decimated.



Do we know for sure that they trained on data from libgen etc? It's such a powerful source of information you'd assume they must have, although they would never admit it. There must be a way to test if they have, via enquiring about some niche information only found in certain books.


It is apparently widely suspected that a certain "Books2" dataset mentioned by OpenAI is basically just LibGen:

https://blusharkmedia.medium.com/the-ongoing-battle-against-...

https://techhq.com/2023/09/can-libgen-shadow-library-survive...

https://www.twitter.com/theshawwn/status/1320282152689336320

https://qz.com/openai-books-piracy-microsoft-meta-google-cha...

https://qz.com/shadow-libraries-are-at-the-heart-of-the-moun...

https://goodereader.com/blog/e-book-news/authors-file-lawsui...

When asked about whether this was true, they refused to answer based on confidentiality concerns, then said they had deleted all copies of the dataset, stopped using it, and no longer employed the individuals that compiled it:

https://www.businessinsider.com/openai-destroyed-ai-training...

We do know for a fact that the (non-OpenAI-controlled) "Books3" dataset is just "all of bibliotik":

https://www.twitter.com/theshawwn/status/1320282149329784833

https://github.com/soskek/bookcorpus/issues/27

And we also apparently know for a fact that this was included in the datasets used to train LLAMA:

https://en.wikipedia.org/wiki/The_Pile_(dataset)

https://aicopyright.substack.com/p/the-books-used-to-train-l...

https://aicopyright.substack.com/p/has-your-book-been-used-t...


Thanks a lot for all the links. Fascinating stuff.


We already kind of see this with search.


This "AI" nonsense seems like a legitimate threat to literacy. Why would young people read a nonfiction book when they can just send questions to a so-called "tech" company that has used the book in training a LLM. These companies, needless intermediaries with zero experise on the subject matter of the book, exist only to collect data and use it for commercial purposes. Unlike the books' authors and publishers they have no responsibility for publishing information that is adequately researched and factually correct.


s/experise/expertise/


Same. I've been scraping PDF'ed magazines, etc. and keeping them locally. In addition to feeding my byte-hoarding tendencies, I like the idea I could be off-grid in my van/RV somewhere and reading a "Popular Electronics" magazine from 1972 on my laptop.

(Oh, never mind YouTube videos that I once added to playlists ... that later disappear leaving only holes in my playlists.)


My problem with this approach is that the stuff I want to look at in 10 yrs time is never the stuff I think of saving right now. In the 2000s there were browser extensions I've forgotten the names of (shelf? slogger?) that would automatically save local copies of every webpage on page load. But I don't think they're around anymore and have no idea how you could achieve similar functionality with dynamic pages anyway.


> But I don't think they're around anymore and have no idea how you could achieve similar functionality with dynamic pages anyway.

Chromium's MHTML "Save as…" and the SingleFile WebExtension should both save copies of the rendered DOM.

Apparently Safari has WebArchive and Mozilla had MAFF for similar use cases.

I think WARC is supposed to save enough data about network streams for dynamic pages to work. At least on the Wayback Machine, infinite scrolling and "Load More" buttons do kinda work sometimes. You may have to load the archived pages in a browser and try to use each dynamic feature at least once, to trigger requests for needed resources.

SingleFile: https://github.com/gildas-lormeau/SingleFile

LWN on WARC, tools: https://anarc.at/blog/2018-10-04-archiving-web-sites/

Self-hostable web archives: https://awesome-selfhosted.net/tags/archiving-and-digital-pr...

Wayback Machine addons, bookmarklets: https://help.archive.org/help/save-pages-in-the-wayback-mach...


> But I don't think they're around anymore and have no idea how you could achieve similar functionality with dynamic pages anyway.

It is probably easiest to save the render as a picture and then store text separately for searchability?


There's a way to get "the best of both worlds" (tbh, works most of the time): print to pdf.


how is this a solution? The Archive performs a valuable service. They're collecting wahy more of the internet than you are (I assume) so when that thing you didn't back up today is not available in 10yrs it's more likely to be on the archive.

I donate to The Archive. More people should too.


I don't know why you're treating them as mutually exclusive. Single points of failure are as bad when it comes to organizations as they are with anything else. Internet Archive (the org) could stop existing with the flick of a pen. I don't think “Let Somebody Else Do It” is a healthy attitude to take, and I'm going to keep doing what I'm doing.

Plus for as great of a service as Wayback Machine is, it can be very unpleasant to actually browse. I dislike how it injects its own toolbar into every page (yes I know how to massage the URLs to get the raw page data, but it isn't browsable that way). Have you never encountered sites in Wayback Machine where certain pages were just randomly not archived? Or when you click a link and get a page from years earlier or later than the one you came from? Never encountered a page or an entire domain that was blocked from Wayback Machine? Why do you think I would get started doing something like this in the first place if I didn't find it more fun to browse my own archives than Somebody Else's?


Heck IA will even temporarily IP-block you just for loading an archived page with too many images. It's a very useful resource but often also very painful to use.


Even if you're logged in?


Can we see the command? I've been using this one:

wget \ --recursive \ --mirror \ --timestamping \ --page-requisites \ --html-extension \ --convert-links \ --restrict-file-names=windows \ --no-parent \ $url



You can also use archivebox very user friendly


> I’ll edit the full command into this comment when I get home if anybody else wants to start doing the same :)

I would love that. I have a little for parameter version, but I feel yours is more tried and true.


See my comment here, and happy archiving! https://news.ycombinator.com/item?id=40496558


Missed the edit window, but here's the command I use. Newlines added here for clarity.

  wget-mirror() {
    wget --mirror --convert-links --adjust-extension --page-requisites \
    --no-parent --content-disposition --content-on-error \
    --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
    --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0" \
    --restrict-file-names="windows,nocontrol" -e robots=off --no-check-certificate \
    --no-hsts --retry-connrefused --retry-on-host-error --reject-regex=".*\/\/\/.*" $1
  }

Some notes:

— This command hits servers as fast as possible. Not sorry. I have encountered a very small number of sites-I-care-to-mirror that have any sort of mitigation for this. The only site I'm IP banned from right now is http://elm-chan.org/ and that's just because I haven't cared to power-cycle my ISP box or bother with VPN. If you want to be a better neighbor than me, look into wget's `--wait`/`--waitretry`/`--random-wait`.

— The only part of this I'm actively unhappy with is the fixed version number in my fake User-Agent string. I go in and increment it to whatever version's current every once in a while. I am tempted to try automating it with an additional call to `date` assuming a six-week major-version cadence.

— The `--reject-regex` is a hack to work around lots of CMS I've encountered where it's possible to build up links with an infinite number of path separators, e.g. an `www.example.com///whatever` containing a link to `www.example.com////whatever` containing a link to…

— I am using wget1 aka wget. There is a wget2 project, but last time I looked into it wget2 did not support something I needed. I don't remember what that something was lol

— I have avoided WARC because I usually prefer the ergonomics of having separate files and because WARC seems more focused on use cases where one does multiple archives over time (as is the case for Wayback Machine or a search engine) where my archiving style is more one-and-done. I don't tend to back up sites that are actively changing/maintained.

— However I do like to wrap my mirrored files in a store-only Zip archive when there are a great number of mostly-identical pages, like for web forums. I back up to a ZFS dataset with ZSTD compression, and the space savings can be quite substantial for certain sites. A TAR compresses just as well, but a `zip -0` will have a central directory that makes it much easier to browse later.

Here is an example of the file usage for http://preserve.mactech.com with separate files vs plain TAR vs DEFLATE Zip archive vs store-only Zip archive. These are all on the same ZSTD-compressed dataset and the DEFLATE example is here to show why one would want store-only when fs-level compression is enabled.

  982M    preserve.mactech.com.deflate.zip
  408M    preserve.mactech.com.store.zip
  410M    preserve.mactech.com.tar
  3.8G    preserve.mactech.com
Also I lied and don't have a full TiB yet ;)

  [lammy@popola#WWW] zfs list spinthedisc/Backups/WWW
  NAME                      USED  AVAIL     REFER  MOUNTPOINT
  spinthedisc/Backups/WWW   772G   299G      772G  /spinthedisc/Backups/WWW


  [lammy@popola#WWW] zfs get compression spinthedisc/Backups/WWW
  NAME                     PROPERTY     VALUE           SOURCE
  spinthedisc/Backups/WWW  compression  zstd            local



  [lammy@popola#WWW] ls 
  Academic                        DIY                             Medicine                        SA
  Animals                         Doujin                          Military                        Science
  Anime                           Electronics                     most_wanted.txt                 Space
  Appliance                       Fantasy                         Movies                          Sports
  Architecture                    Food                            Music                           Survivalism
  Art                             Games                           Personal                        Theology
  Books                           History                         Philosophy                      too_big_for_old_hdds.txt
  Business                        Hobby                           Photography                     Toys
  Cars                            Humor                           Politics                        Transportation
  Cartoons                        Kids                            Publications                    Travel
  Celebrity                       LGBT                            Radio                           Webcomics
  Communities                     Literature                      Railroad
  Computers                       Media                           README.txt


Some of this could stand to be re-organized. Since I've gotten more into it I've gotten better at anticipating an ideal directory depth/specificity at archive time instead of trying to come back to them later. Like `DIY` (i.e. home improvement) there should go into `Hobby` which did not exist at the time, `SA` (SomethingAwful) should go into `Communities` which did not exist at the time, `Cars` into `Transportation`, etc.

`Personal` is the directory that's been hardest to sort because personal sites are one of my fav things to back up but also one of the hardest things to try and organize when they reflect diverse interests. For now I've settled on a hybrid approach. If a site is geared toward one particular interest or subsulture, it gets sorted into `Personal/<Interest>`, like `Academics`, `Authors`, `Artists`, `Goth` (loads of '90s goths had web pages for some reason). Sites reflecting The Style At The Time might get sorted into `1990s` for a blinking-construction-GIF Tripod/Angelfire site or `2000s` for an early blog. Some times I sort personal sites by generation like `GenX` or `Boomer` (said in a loving way — Boomers did nothing wrong) when they reflect interests more typical of one particular generation.


Maybe save the log automatically? And then check and report unsolved errors, at end of the fuction or better separate one so log can be reinspected any time.

I have encountered "GnuTLS: The TLS connection was non-properly terminated. Unable to establish SSL connection." multiple times, and retry options seem to be useless when that happens. Some searches suggest it could be related to tls handshake fragmentation, but nonetheless wget could retry if related options are used. Manual retry seems to download the missing URLs, otherwise mirroring jobs are randomly incomplete.


It's weirdly specific but I remember old versions of go caused that error. The final packet (close_notify) to close the connection was set with the wrong error level.


This is great, thanks for sharing with that additional context.


Wow. Only 772GB. Way under 1TB. Liar!!


yes please. extra credit for anyone who shares instructions on how to inject this into every website i browse sans blocklist


Keeping a local, searchable record of all web browsing comes up every few months here, but it took me a while to find a lengthier discussion like this one from 2022: https://news.ycombinator.com/item?id=31848210


incredible linkfinding. you deserve 10 upvotes.



Why, what's the point in doing such nonsense? Unless it's someone with lots of money, contacts in the dark web, and some historic Barbara Streisand type chip on the shoulder.


DDOS attacks are dirt cheap and can be contracted from large professional sites offering customer support and the works. The largest one taken down had hundreds of thousands of users, and had carried out some 4 million attacks, for prices starting at $14.99/month. [1]

So in other words, anybody can carry out a DDOS for basically no cost. So trying to analyze the purpose, let alone suspects, is probably not going to be fruitful.

[1] - https://wccftech.com/865619-2/


And they're curiously usually protected by Cloudflare.


Doubt cloudflare has anything to do with it. The operators most likely don't want to openly expose their website's ip addresses.


That is exactly the problem. These services are constantly at war with each other and are attacked by competitors. Cloudflare provides DDoS protection to the DDoS providers so they can keep their services online, which directly benefits Cloudflare by DDoS being a bigger problem than if they were all busy attacking each other.

This is a sampling of currently available services and who they use for DDoS protection:

  stresslab.app - Cloudflare
  maxstresser.com - Cloudflare
  sunnystress.com - Cloudflare
  tresser.io - Cloudflare
  ip-stresser.net - Cloudflare
  hardstresser.com - DDoSGuard
  zdstresser.net - Cloudflare
  starkstresser.net - Cloudflare
  stresserhub.org - Cloudflare
  nightmarestresser.net - DDoSGuard
Just for fun head over to Cloudflare's abuse reporting site and try to figure out how to get one of these taken down. https://abuse.cloudflare.com/


DDoSGuard has a reputation for being The Crime CDN, disproportionately serving things like phishing campaigns, black hat forums, piracy sites, etc, so the fact that they are merely the second most popular CDN amongst DDOS providers after Cloudflare speaks volumes.


TIL. thats shocking. i doubt it’s intentional but “institutions will preserve the problem to which they are the solution. no need to ascribe to malice that which can be blamed on simple incentives (and of course its a big problem, things fall thru the cracks, etc etc)


I find the idea of DDoS providers confusing. If someone tried to operate a service that can be abused easily to cause similar disruption in the physical world, the operation would be taken down quickly and the people behind it would probably end up in prison. But somehow the internet is still a lawless zone where crime is tolerated and everyone is out for themselves.


It used to be very rare for DDoS providers to publicly advertise their services, you kinda had to know a guy who knew a guy. If you put up a website offering this service the Good Guys of the Internet would track you down and get your provider to take you down, or that provider would in turn get disconnected from the internet.

Now they hide behind Cloudflare who will refuse to turn over any information so that security folks can get them taken down. Unfortunately Cloudflare has grown too large that we can't just block all of it or depeer them like we would any other network that provided services to bad actors.


That kind of vigilante justice is part of the general lawlessness.

Most of the listed domain names are under US jurisdiction. That means the authorities should be able to take them down. If Cloudflare is found to have been knowingly enabling crime, it could face fines, and the CEO and other key people could end up in prison. The Cloudflare services have probably been paid using means that are under US jurisdiction. Those payment accounts can be closed and the people behind them tracked down and potentially charged with crimes.

Or at least that's how things work in the real world. The internet is still apparently too new for the authorities to understand how to deal with it.


It's true that you can't practically block Cloudflare without impacting legitimate users, but they can absolutely be depeered if you're willing to pay a higher transit bill.


The added element of international relations makes it a far bit more tricky than any real-world equivalent. Usually these operate out of places that are not on good relations with the countries they target. Russia and China are the big ones.


It's obvious why a DDOS provider would want to use Cloudflare, but their point is that Cloudflare turns a blind eye to DDOS providers using their services. Actively helping to keep DDOS providers online while also selling DDOS mitigation isn't a good look to say the least.


Cloudflare is a data goldmine setup by people who love fedoras and newspapers. Professional DDOS providers won't use Cloudflare ever and have the skills, metal and (human) network to do everything in-house.


Yeah you're actually worse off using Cloudflare because you can't block attacker IPs anymore, once you're dependent on them to protect you, and they're not very good at protecting. I run an online service that invites hackers to DDOS the server. Cloudflare's servers would usually go down before we did. The only way we could stay online was by switching to GCS and using token buckets to blackhole IPs in the raw prerouting table, which made the hostile packets into mighty Google's problem. Thankfully they don't charge for ingress, so it was about as cheap as Cloudflare too.


You mention the key feature for ddos (self-)protection - zero ingress fees. Non-Availability hurts you in harder-to-quantify terms than a bill for bandwidth used.

Zero ingress puts the upfront bandwidth cost onto the attacker. Because... you actually may succeed to defend and stay up. Their success is not guaranteed, they might be shouting into the void.

Attack success (as in, "impact on you") is guaranteed if your ingress is chargeable.


And you probably think glass repairers don’t drive around freeways at night dusting gravel.


What makes that curious?


Cloudflare's "protection" is basically a racket. e.g.

https://robindev.substack.com/p/cloudflare-took-down-our-web...


While I think there might be valid arguments to support that claim, that blog post hardly qualifies. The author runs a gambling site and while the way Cloudflare handled the situation (according to the author) could certainly be improved, they clearly were affecting other users by "tainting" shared IPs.


And yet if they ponied up the money, that issue of "tainting" shared IPs suddenly goes away. You can bet CloudFlare would graciously give the gambling site as much time as they need to bring their own IP (they went out of their way to link third party sellers of IPs with dubious provenance, after all).


Did you read the blog post? It doesn't include the entire correspondence so it's not clear how explicit Cloudflare was about this but the Enterprise plan they were trying to upsell them includes BYOIP. It's clear to me that Cloudflare insisted they buy the enterprise plan because it includes BYOIP.

So in other words, Cloudflare noticed the author was running a gambling site, they decided that this was negatively impacting the shared IPs and the author would therefore need to upgrade to a plan that included BYOIP because they would need to use that feature to continue using Cloudflare and they likely insisted on prepayment for the annual plan because gambling sites have a reputation for being flaky and prepaying would have demonstrated the liquidity necessary to continue operating the site at that plan.

Again, Cloudflare could have communicated this better (and maybe they did in parts of the correspondence the author didn't share) but this all seems perfectly understandable, especially given how the sales team kept referencing Trust and Safety (implying the alternative is ending the contract for violating the ToS).

The issue of tainting shared IPs would indeed have suddenly gone away had the author brought their own IP (which would have required an Enterprise plan to do while staying on Cloudflare). Instead the author feigns ignorance arguing they don't even need the features of the Enterprise plan and doesn't acknowledge the issue with sharing IPs while sheepishly mentioning that maybe they're accidentally invading bans of their domain in certain countries by having alternative domains which they of course don't actually need because most traffic comes from their main domain yet somehow having these alternative domains is critical to running their business.

What are you even trying to argue here? The author is being deliberately dishonest in how they frame the incident and Cloudflare's motivation is perfectly understandable. The only thing to take offense with is the communication style which we can only judge based on a select few messages the author shows us. We have to rely on their word after they have already demonstrated dishonesty.


To me it’s similar to the whole “SSO wall of shame” thing, where a vital feature is locked behind more expensive pricing. As said in the article:

“We tried saying that we don't need any number of the 14 features that are included”

Which, to me, is the crux of the issue. Is it fair for Cloudflare to say “You are breaking the terms of service if you do not change your set up in this specific way, and also the way you need to chance your setup is locked behind a significantly more expensive pricing.” Being able to bring your own IP does not, to me, seem like something that should require a plan that is orders of magnitude more expensive than the standard. It seems much more to me like something that is more fundamental, and should be included as an option in a lesser version of the product Maybe I’m wrong, and there is actually significant overhead to Cloudflare for letting customers bring an IP. But as is, it feels very much to me like a situation where something vital was locked at the most expensive tier to force certain kinds of customer to pay more.


I'd say at that point it's essentially compensation for personal suffering.

Yes, BYOIP as a feature does not seem complex enough to warrant paying for an Entperise license. But the kind of customers who need BYOIP (especially if they need it to avoid harming your IP reputation) are likely to be at a higher risk of being flaky or otherwise painful so this is very much a tax on running that kind of business (just as porn sites often find it hard to find payment processors because of the high risk of credit card fraud).

As a freelancer I have absolutely made offers at 10x my going rate for client I did not want. The idea is that if they really want me to work for them, at least I get reimbursed for the suffering that will entail. This kind of pricing structure is no different.


Maybe it's a form of advertising certain capabilities and services.


IIUC, that's always a good theory for unexplained DDoS. Though, even if they have only profit motivations, I'm a little surprised when they don't seem to let ideology influence their selection of targets for demos.

For the sake of argument (maybe not true), let's say that all techies are aware of archive.org, and consider it beneficial, probably using it themselves.

Why don't they instead demo against a target that will be proof of capability, and one that someone won't pay them to do (no freebies), yet one that they perceive as bad or deserving in some way?

Probably improper to suggest "better" targets here, but I really wonder what's going on when some relative do-gooder gets attacked.

Similarly, ransomware attack on a children's hospital, of all places? Doesn't that get you uninvited to criminal mastermind dinner parties?

As Omar of "The Wire" told us, a man's gotta have a code.


One thing to keep in mind m about LockBit ransomeware was it was SaaS — errr RaaS — and there is a good chance the target was picked by an insider there, or it was at least some opportunistic hacker not really associated with those who provided the service, besides signing up as an affiliate.

LockBit was so successful partly because they didn’t have to hack anyone themselves. It was basically something advertised “Got SSH or RDP access? Let’s make a bunch of money.”

This attracted hackers who might not trust themselves to do the extortion part safely, as well as people who didn’t actually hack anything but hated their boss, wanted a payday.


Perhaps cruelty is the point.

Perhaps they intentionally attack targets that are generally seen in a positive light, to prove to potential customers that morale is not an issue.

Oh, you want me to DDOS a children's hospital? No problem.


Thus is the power of The Dark Side^W^W^W late-stage capitalism.


Extortion is usually the motive. “Nice porn/gambling/crypto website you’ve got here. Shame if something happened to it”.


Maybe there is something damning on there that someone needs kept quiet for a while?


No.


> Please don't post shallow dismissals

https://news.ycombinator.com/newsguidelines.html


The user you're responding to is Jason Scott of TIA.


What's the significance of that?

(Googling "Jason Scott TIA" gives me "Dr Jason Scott is a Senior Research Fellow in the Tasmanian Institute of Agriculture" which doesn't explain much to me)


Jason Scott works at the Internet Archive[1].

[1]: https://en.wikipedia.org/wiki/Jason_Scott


And he knows about every single file in it?


Every archived file knows Jason Scott.


The beauty of acronyms/initialisms that people are too lazy to spell out!

TIA = The Internet Archive (i.e. the victim of the DDoS).

>The user you're responding to is Jason Scott of The Internet Archive


The Tasmanian Institute of Agriculture are well known for their work on biological models of computer security architecture.

I am shocked that any HN reader could be ignorant of this fact. Their director is a (controversial) Turing Award winner.


"Homomorphic encryption using selectively-bred rice yeast and corn fungus" by Jason Scott, TIA.


Shallow dismissal anyway, even if he was the Supreme Majestic King of New Americania. He might further explain his answer. And I'm truly sorry for the DDos happening to this guy's organisation!


What is there to explain further?


Some evidence or reasoning that there's nothing damning anywhere on IA that anyone could possibly need kept quiet to the point of having ordered this particular attack. Just being the target of an attack doesn't mean you have perfect information about the perpetrator or their motives.

This could even be as simple as "Some aspect of the attack pattern is inconsistent with such a motive", or "We spotted the perpetrator credibly gloating about it". But just from IA's public statements, the pattern ("launching tens of thousands of fake information requests per second") is quite consistent with simple denial.


That would have been an excellent addition to that comment.


Which without context that is not given I certainly didn't know, so it isn't safe to expect others to know to. Are we supposed to dig into people's profiles to derive relevant context?


One could also argue that this is merely being conversant on the topic one is conversing on.

And not taking on the job of police when you don't know as much as you think you do, such as the speakers whose speech you presume to police.

You might not know the significance of "textfiles says no", but you do know that in general it's a thing that on HN, sometimes the rando is no rando, and you do know that you're not dang.

That's all it takes to avoid looking like a douche. And soon enough some comment or other would fill in the significance, from someone else looking like a douche and having it explained to them.

Jason could have added "Internet Archive here, it's not that." But he would have to say that in front of every comment he ever writes, which I think would get old for him and probably no small number of other people would criticize that too "yesss we know you work for IA FFS get over yourself..."

I think it's fine for him to just speak and let everyone else take care of themselves.


> but you do know that in general it's a thing that on HN, sometimes the rando is no rand

Sometimes. More often than many other places we could mention. But in the majority of cases, even here, a rando is a rando. And if the randos see the accepted conventions being ignored without comment it might encourage them to do it more.

> Jason could have added

I'd argue should have.

> But he would have to say that in front of every comment he ever writes

Only comments where it is significantly relevant, or in this case where his experience and proximity to the issue at have might be considered enough to ignore the standard commenting conventions.


Oh sorry, I guess that makes it a very detailed and well though-out response.



It does, yes. The single word, from that source, on this topic, communicates all relevant information.


Hackernews, it never disappoints.


Makes sense: Large media outlets don't like their old BS stories staying accessible. I've seen it used as an accountability tool.


There are some very bad very shitty people about, just trying to make earth worse.

Npm has been under pretty severe attack for ~6 weeks now. I forget who else.

The scariest thing to me is what we might do in the face of persistent online attacks. If this stuff gets rolled up into western nations rolling back privacy & liberty? That's an theonion.com "bin laden plan to sit back and enjoy collapse" situation. Freak out & let cyber security paranoia reign & destroy free communication & connection.


My wild guess is most of them are ran by companies offering ddos mitigations services.


My thoughts exactly, what is the point of attacking a library... so lame... =/


It's probably because someone saved incriminating evidence on it and they refused to take it down.


If it's on their own site, that wouldn't be a problem. IIRC, archive.org stops serving pages that later appear in a robots.txt file.


It could be on a website they don't control like twitter.


Al-shabaab, Boko Haram, China, Russia, DPRK, basement dweller, or third-party offensive hackers.


Are you upset? Can't do nothing about it? It even made a headline or even just a thread on a forum? That's reason enough for some. It could easily be a teenager with no better excuse than not having a fully developed brain and no better reason than liking to ruin things. Having seen how much that happens, I guess it's more likely than a conspiracy or a crime with any rationality behind it.


It's probably just some kid with a botnet that's showing off. You all give these people way too much credit, lmao.


Seriously? People do this shit for fun. There used to be a program (LOIC) popular on 4chan used for DDoS attacks all the time, it's the origin of the "firin mah lazer" meme.


The lazer meme (2006) predates LOIC (~2010) by years https://knowyourmeme.com/memes/shoop-da-whoop


I stand corrected.


How much $$ does archive.org spend on infra and such? How much does one need to endure the most damaging DDOS? I remember seeing from somewhere that Google went through some huge DDOS attack without going down.

Given its benefit to the lay persons I recommend everyone who use their services give a small amount once for a while. I already did so but if not for family issues I'd donate way more.


DDoS attacks are usually volumetric attack. Send more bits than the pipe the website has to the internet.

To combat this you need to buy enough pipes to the internet for your regular internet traffic, as well as an extra 500 Gbps or so. That is a lot of unused bandwidth to be paying for every month. Then once the packets arrive at your datacenter you still need to buy dedicated appliances to scrub out the bad and let the good flow.

Google is constantly under attack, but their normal daily traffic volume (multiple Tbps) is large enough that just the extra capacity they keep on hand to deal with traffic spikes the World Cup or a popular YouTube video is larger than what most attackers can muster.


What are the ways to manage a DDoS attack, preferably using open source? Don't say Cloudflare because they're an extortionist firm.


Cloudflare is not "an extortionist firm". It is a large tech company, where occasionally teams employ shitty sales tactics to meet their numbers, but generally provides a valuable service and acts reasonably ethically.

There are open source tools to mitigate DDoS, but all of them will have some marginal cost to run, and they will all be significantly worse than Cloudflare as they benefit from neither Cloudflare's data moat or scale.


No, thanks. Cloudflare acts ethically only until it suits them. It is the pre-exploitation phase to lure a customer. We are not fools here. The report at https://news.ycombinator.com/item?id=40481808 says it all.

Secondly, considering Cloudflare would MITM all traffic, it would make a data good source for the NSA, thereby violating all user privacy.


> Secondly, considering Cloudflare would MITM all traffic, it would make a data good source for the NSA, thereby violating all user privacy.

This seems like a weak argument. Should we just take down anything widely accessed because it might be used by the NSA? What about AWS?


Yes, that's pretty much the view in Schrems II from the European Court of Justice. The CLOUD Act does not respect data protection rights.


Is AWS providing DDoS mitigation services now, coupled with MITM access to user traffic?


They're the man at the end, actually. No extortion necessary.


That argument doesn't really fly. The "poor little customer" was an online casino who was using Cloudflare to avoid getting taken down in countries where online gambling isn't allowed.

This had a high risk of getting Cloudflare's limited ipv4 addresses to a blacklist - affecting ALL of their customers.

All CF did was ask them to switch to an Enterprise plan and bring their own IP-addresses. They refused to do either and rather cried on the internet claiming CF to be bullies. It's not like the price they asked was even a fraction of the profits an online casino brings in every DAY.


Spoken like a true Cloudflare employee! But no, we see through your false reframing of the facts.

If a customer's action is truly illegal, the customer's account should be terminated, either immediately or after a reasonable warning to fix things. Under no circumstances should money be sought to support the illegal activity.

CF is engaged in a pig-butchering scam whereby they lure customers, then when the customer is all fat and happy, they get asked to pay up or lose their business.

In this case, CF destroyed the customer's business as soon as CF got word that the customer was going to move to Fastly, considering that the protection money was not paid. It's an open and shut case of racketeering.


Nah. Just don’t do anything illegal and you’re good


> Cloudflare is not "an extortionist firm". It is a large tech company, where occasionally teams employ shitty sales tactics to meet their numbers, but generally provides a valuable service and acts reasonably ethically.

Nice.


Plenty of DDoS mitigation firms use open source tech, but that’s only step one of mitigation, most normal firms will never be able to stop a DDoS attack without someone else with a lot of resources tanks the attack for you.

Even if you go all out and buy a bunch of huge IP transit links, you are not gonna be able to stop the IXP 800 miles away from getting congested and blocking your customers from accessing your site anyways. You need access to a backbone to route traffic differently to avoid those kinds of issues, which is why DDoS scrubbing services will partner with a T1 ISP to do most of the work.


HAProxy + DDOS protection? https://www.haproxy.com/blog/application-layer-ddos-attack-p...

Or any proof of work proxy that delays the ingress traffic. If you only have one server there is very little you can do except maybe redirect to a static page or kill the DNS entries.


HAProxy looks good, but I would also consider a unique unguessable subdomain per established customer. This won't thwart the IP-level attacks, but it will thwart most subdomain attacks for those customers.


It's not (just) a case of the software, it's the hardware and the position in the network. These DDoS primarily just saturate your internet pipe: you need to be able to co-ordinate with core ISPs to block the ddos traffic before it concentrates too much.


I have asked this a few times and never gotten an answer beyond "One day they could turn evil." What is the reason Cloudflare is an extortionist firm? I am way more concerned about Amazon than Cloudflare.


Beyond the upselling under duress, I've also seen complaints that Cloudflare protects the client-facing websites of DDoS-as-service operators. This enables them to sell their service, which then creates demand for Cloudflare's service from their targets.

Cloudflare describes that policy as a commitment to content neutrality rather than extortion, and I think that's more or less sincere (since they've protected many other unpopular sites that didn't give them such a benefit, with a few high-profile exceptions). It does work out very conveniently for them, though.


> Cloudflare describes that policy as a commitment to content neutrality rather than extortion

But we know that's not true. Point out problems with a very controversial blogger and they'll cancel your service.


There was a recent anonymous critic of Cloudflare on the front page: https://news.ycombinator.com/item?id=40481808


"What is the reason Cloudflare is an extortionist firm?"

Because they are a publicly traded company


So is Amazon.


Nailed it. The implications are saddening.


The only universal way IMHO, is to associate a small cost to every internet request. The cost has to be as small as possible, but it has to be there for million/billion/trillion requests to add up the cost and make it uneconomical to continue the attack past some point.

It is the same problem with email spam. What's stopping someone from sending billions of spam mails?

If we suppose that: a blockchain exists which is fast enough, cheap enough, and spread out enough on the globe (to mitigate latency), then there is no reason, for a tcp packet to not carry with it a small money transaction, in the order of a millionth of a cent. Information gets served back, only when the transaction is confirmed.

In that way, any request with no transaction gets discarded, and only requests with a small cost pass through. Suddenly by sending requests one after another and no end in sight, DDoS attacks and mail spam start to cost money. It is the serving of request that makes DDoS attacks and mail spam to be effective.

The problem however, is that no blockchain is fast enough and cheap enough as of today. But there will be one in a handful of years.


Similar systems were proposed in the late '90s/early 2000s (hashcash/micropayments) to combat spam. The big problem isn't a technological one, it's that it presupposes some "sweet spot" price (negligible for legitimate users, yet prohibitive for abusers) that has never been shown to exist in reality.

(also, you're arguably just moving the problem to DDoSing the payment processing / firewall mechanism)


> Similar systems were proposed in the late '90s/early 2000s (hashcash/micropayments) to combat spam.

These ideas indeed exist for decades.

> The big problem isn't a technological one, it's that it presupposes some "sweet spot" price (negligible for legitimate users, yet prohibitive for abusers) that has never been shown to exist in reality.

Advances in technology, software and hardware, make it easier and easier for that sweet spot to exist. That sweet spot, didn't exist in the past, certainly, but we are close right now.

One example that i think is useful here, is aluminum cans for fizzy drinks. Aluminum, a strong metal compared to cardboard or plastic or glass, is better at withstanding pressurized gases without exploding. The downside, is that it's more expensive. When manufacturing prices dropped down a lot, then it was feasible to drink half a liter of liquid and just throw away the metal. Aluminum still not free though, but the small price did worth it. Huge waste of energy as well to smelt all that metal and throw it away after 10 minutes of drinking, but it is economically viable.

One could manufacture Titanium cans, and drink even more fizzy drinks. But that's not economically viable as of today.

> you're arguably just moving the problem to DDoSing the payment processing / firewall mechanism.

Yes, the problem is moved elsewhere, that's the weak link in the scheme i described. The thing is that a flood of transactions still costs money. Blockchains cannot be flooded just with requests, they have to be flooded by transactions. Take a look at the article [1] which outlines some ideas. I don't agree with a lot of things in there, but it states the problem and gives some numbers.

The theory when it comes to blockchain deterring DDoS attacks (and other kind of attacks), is that there are not bad guys in general, just rational economic actors who use dirty tricks. When a dirty trick starts to cost money, and profit disappears from an attack, then the rational economic actor will stop the attack. The bad guy will resume the attack regardless of profit, but that's one of the axioms of the theory, that there are no bad guys.

[1] https://www.dlnews.com/articles/defi/ddos-attacks-are-an-inc...


This wouldn’t fix anything. Most DDoS attacks today are amplification attacks, e.g. “I sent 10 bytes to this unpatched NTP server and as a result it sends 500 kb to this target server., so in your scheme the costs would not be borne by the attacker.


That would assume the cost is borne by the attacker, and not every smart thermostat in their botnet.


Aw darn, they are? I was just considering migrating my frontend to them after seeing all the positive reviews. What's the issue?


Refer to the report at https://news.ycombinator.com/item?id=40481808

Secondly, considering Cloudflare would MITM all traffic, it would make a data good source for the NSA, thereby violating all user privacy.


If your application can take it, drop it in the application. If your load balancers can take it, drop it on your load balancers. Otherwise you have to get your provider to drop it, if they can take it. Worse case they'll drop all traffic meant for you to protect the rest of their network.


Proof of work gateways and really annoying captchas.


That won't fix the issue of your inbound pipe being saturated, preventing legitimates from accessing your site.


I believe it would though? Isn't that the whole point?

If you need PoW for every connection, it's going to end up being very expensive for an attacker to saturate the connection. And the captcha is probably a different server to the main site.


You can't check the PoW until it's at your connection. Do you mean that every packet everywhere would have PoW that is checked by their ISP/upstream?


No, I meant just a usual gateway.

I thought a gateway would work for that purpose.

Are you sure it wouldn't? In that case there's no defending against ddos.


A gateway (router/firewall) on your own connection cannot protect you against a DDoS. Really nothing at your connection can prevent you from getting DDoSed. The attacker can cause you to go offline/degraded even if you are 100% successful at dropping his packets. They still fill up your inbound pipe and prevent (or majorly hinder) legitimate traffic from fitting in.

The filtering/dropping of packets has to be upstream of your connection to be able to protect your connection - additionally somewhere where the available bandwidth is greater than the attacker's bandwidth.


This is like an arsonist lighting an orphanage or library on fire. Why would you do something like that?


To sell your services as an arsonist. Being able to point to a big successful attack helps professional DDoSers sell their services.


more cynically, to sell fire-safety insurance


I always wondered if the NSA and Cloud flare do stuff like this with websites not behind cloud flares umbrella.

"Make em offer they can't refuse."


Most DDoS for hire services are on CloudFlare and they refuse to drop them. You're at least in the right ZIP code.


But wouldn't they need to telegraph the attack in advance to customers? Taking credit after the fact is risky as many competitors will also take credit


Yes, that is part of it. They tell potential clients, "On May 27th, I'm going to take down the Internet Archive". Then they do it, and then go back to their clients and say, "now that you've seen my work, do you want to pay me?".


Then wouldn't it make more sense to take down a target with few eyes on it? Since you're not paid, why deal with the risk of an attack that will make the news?


Making the news is the goal. As long as a few people can verify it was you, word will get around about the person who can take down big targets, and will cite the news articles as part of the proof.


Isn't word getting out about you bad though? As it puts you in the spotlight of law enforcement?


I wonder where we can find a middleman for that kind of service. Criminal groups would be pretty stupid not paying a middleman to do the checks and filtering.


I don't understand, what prevents them from running to your competition if you dont mark your brand?


Some people just want to watch the world burn.


Precisely no honor amongst thieves.


Probably no good reason. In technical terms, some asshole is dicking around.


Because the orphanage refuses to use cloudflare.


Precisely this. There's a reason cloudflare was part of the CIA's incubator. Totally organic growth and didn't have government and spook hands all over it /s.


People who want to get rid of data history should be considered the enemy of humanity. I hope the archive is fine after all this.


> Sorry to say, archive.org is under a ddos attack. The data is not affected, but most services are unavailable. We are working on it & will post updates in comments.


Seems to be back online. For now, anyway!

https://mastodon.archive.org/@brewsterkahle/1125141764988452...

  @internetarchive is back up!
  this is a back-and-forth with attackers.  Sees weekends and holidays are popular.
  we made adjustments, but we will see.
  at least Happy Memorial Day!


If anyone from the archive.org team read this: love the website, by the way. You've saved so much rare content its really awesome.


would a decentralized internet archive make sense? impossible to ddos.


I'd love to work on something like this. But the internet archive themselves don't seem too interested in it I guess.


they literally organize an annual conference called "Dweb Camp" to support a community of distributed web nerds and have a partnership with Filecoin to mirror content to IPFS. They opened up a Vancouver headquarters to start acting as a mirror - they are very interested in removing themselves as a single point of failure


That exists and it's called Arweave. You even collect imaginary brownie points for maintaining the archive.


> Arweave

looks a bit more broad than i wanted. as is its like a thin wrapper for IPFS.


It's a fake brownie point scheme (i.e. cryptocurrency) to let the people who archive the most stuff decide what goes into the archive. There's also IPFS but that has absolutely no way to decide what gets archived.




Who's their biggest enemy at the moment


Anyone who doesn't like the availability and accessibility of history and documents.

Lots of people want to rewrite or erase history.

Quoting a story I wrote about this a few years ago:

"Everything you speak, all ideas, all things, all thoughts, they are all of the past. Society and knowledge is a composite of the shadows of former presents.

When people lie or misrepresent knowledge they speak of a past they wish to change.

What if people who have the most to gain from deceit had a tool to actually change the past and make these lies the truth?"

Here it is if you're curious https://kristopolous.medium.com/stephen-hawking-had-a-time-t...


Paywalled sites?


let me go further: the whole of the copyrighted industry

including all media conglomerates (obviously) and all scientific, literary, etc, publishing houses.

also, there's a global war, so it well may be a fog-of-war technique or like somebody else also mentioned: someone needs something to stay quite for a little bit as part of some larger operation


The establishment always gets the most advanced technology, attack and defense because they have the big bucks. That's why I never believed that technological advancement promotes individualism, distributed X (whatever X is, money, power, whatever). Eventually it always points to a more centralized world because the elites are able to control more with each technological advancement.


Doubt this is coordinated - more likely a singular (m/b)illionaire wanted a post/photo/video, or multiple of, deleted for good, perhaps for suppression of legal evidence, and this was one way of bringing some firepower to a… library. One of the internets biggest libraries too. Odd.


But a ddos doesn't remove the page...


it may prevent its capture in the first place, and it can also prevent consulting the page (a delay tactic)


Good points.


That's a tough question to answer without devolving into politics, which is off topic for Hacker News.

I think that's also the wrong question to ask. "Who's doing it?" is less interesting than "What's enabling them to succeed?"


Politics isn't fully off-topic on HN; per guidelines, most political stuff is off-topic, "unless they're evidence of some interesting new phenomenon"

https://news.ycombinator.com/newsguidelines.html


It's an attack on human progress. Attack on our ability to build and grow over the past.



Everyone needs to see this, back up those words with action (I'm presently a donor)


Is there any way to know who is responsible?


I think "follow the money" is a decent heuristic here. Why else would anyone do it?


Is it possible for archive.org to trace back the IP addresses? I assume maybe the attackers used a lot of IoT devices or VMs in cloud?


Botnets usually, sometimes amplification attacks against NTP or DNS, although the Chinese government’s Great Firewall also has offensive capabilities known as the Great Cannon, although they are generally used against GitHub because it hosts censorship-circumvention software like VPNs.


Are botnets usually hosted on personal computers or servers or IoT? I'm thinking maybe archive.org can block a whole range of IPs if needed.


Resistance to that kind of simple countermeasure is exactly what distinguishes a DDoS attack from a non-distributed DoS attack. The traffic basically comes from "everywhere". Not literally every IP block and route, but widespread enough that it's difficult to separate from legitimate users without actually processing the traffic (which is what you're trying to avoid by e.g. blocking an IP range).


Thanks. And I assume they mostly come from friendly countries which makes it even harder to block?

This is indeed very tough to resist.


Yes. This is why it's maddening that most people don't take computer security seriously. Virus infected devices are what give these botnets their scale and wide distribution.


So compile a long list of compromised source IP's and just block those upstream?


All of the above, and also increasingly IoT devices, specially printers and routers. There is no simple way to block them.


There are many individuals with embarassing things on the Internet. Perhaps one of the recent University empcampment participants wants to get an internship and doesn't want prospective employers to see articles about them them screaming and yelling. I think the odds that a major publisher is behind this is slim....


It's those damn record labels throwing a cog in the wheel, isn't it


Looks like someone at Amazon is trying juke those remainded servers again. When is the IRS gonna catch up with this? They are worse OpenAI at abusing their non-profit status.


Cui bono?


The guy from U2, I think


lol, thanks for the good laugh :)


Publishers.


Clicking this link isn't helping them with their current situation. You'd just be contributing to the ddos.


The link directs to an announcement via "toot" on mastodon.archive.org


Nope. This should have no impact on the rest of what they’re doing.


Don't worry, you can visit it through the wayback machine instead!

oh wait


It's where I get all my PD movies, among Jamendo's dump for music albums.

This is really bad for CC media.


1716850657

Internet Archive is now working for me


Seems to have stopped or been stopped.


so is purl.org


archive-it.org is still working


#!/bin/sh read x; tnftp -4o/dev/stdout http://wayback.archive-it.org/all/20250000000000/$x # (TLS forward proxy listening on port 80)

Example:

https://wayback.archive-it.org/all/20240506083041/https://ar...



Url changed from https://bsky.app/profile/archive.org/post/3ktiatctiqm2r, which points to this.


its probably fine now but maaaybe not great policy when ddos is involved


“AWS has been losing money, quick dump more old stuff to IA. So we can write it off!” - Amazon now.


they are trying to make their case for BGP regulation.





Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: