
Wayback Machine was down - k-ian
https://web.archive.org/web/*/https://news.ycombinator.com
======
jarfil
The Internet Archive just announced a no-waitlist book lending due to
COVID-19, I'd guess their servers might not be too happy about the inrush of
users.

[http://blog.archive.org/2020/03/24/announcing-a-national-
eme...](http://blog.archive.org/2020/03/24/announcing-a-national-emergency-
library-to-provide-digitized-books-to-students-and-the-public/)

~~~
mirimir
Right, and
[https://archive.org/details/nationalemergencylibrary](https://archive.org/details/nationalemergencylibrary)
is up and responsive.

So I'm guessing that they've shifted resources.

------
dublinben
Now would be a great time to donate to the Internet Archive if you're able to.
They can surely use the help.

~~~
dheera
If only shareholders would think the same way about PG&E and other companies
that could use infrastructure upgrades ...

~~~
hinkley
PG&E is so far behind on deferred maintenance that people have been
petitioning for California to socialize it so the state stops catching on fire
from high tension powerlines.

You will be glad to know that they've protected executive bonuses, though.

~~~
CameronNemo
Didn't PG&E get blocked from performing upgrades by the state utilities
commission? Something about not wanting to increase power customer bills...
utility incentives are not easy.

------
nikisweeting
In the meantime, distributed archiving ftw, run your own archives with
Webrecorder.io, ArchiveBox.io, SingleFile, kiwix.org, etc!

~~~
trevyn
The kiwix _Wikipedia-en_ full scrape with images has been broken for over 18
months, I think they could use some technical help. I tried running their
scraper myself on a nice AWS instance and it just stalls after many days of
downloading articles. Could probably use a rewrite. ;)

[https://sourceforge.net/p/kiwix/discussion/604121/thread/1f2...](https://sourceforge.net/p/kiwix/discussion/604121/thread/1f27659cdb/)

[https://github.com/openzim/mwoffliner/issues/1020](https://github.com/openzim/mwoffliner/issues/1020)

[https://github.com/openzim/mwoffliner](https://github.com/openzim/mwoffliner)

~~~
traverseda
The whole zim file infrastructure is pretty broken. I've been trying to put
together a system for generating a WARC file by rendering all the wikitext
content in a database dump, which is a lot more reasonable of an approach.

Rendering wikitext is challenging though, since wikitext can include chunks of
other wikitext, and wikitext can use some pretty complicated templating
functionality.

Oddly enough where I've run into the biggest issues is in weird slowdowns of
the python WARCIO library that making dealing with large archives just about
impossible. I haven't had time to really track that down, but if anyone want
to it's pretty easy to reproduce, just try adding a few million lorum-ipsum
articles and look at how far from linear time it's running.

There are a lot of advantages to starting from a dump, you can provide much
better tools for filtering articles, probably even provide rudimentary
document classification. You can also do things like re-compress and minify
images, a dump intended for a cellphone probably doesn't need 4k images.

WARC is also probably a better tool for distributing web-archive type content,
like wikipedia dumps. You can distribute a package of text content and image
content as separate files, for example. Generally I have not been very
impressed with the quality of ZIM file tooling. One disadvantage is you need
to provide separate search indexing, but that's doable.

I'd love to be able to get a wikimedia grant to work on this, and take on less
contract work, but so far their grant process is pretty hard to follow.

~~~
nikisweeting
I'm actually currently working on the ZIM toolchain for Kiwix on a contract
basis, so I'd be interested to hear more about your pain points, they might be
something I can help out with.

In general, I'd say that ZIM and WARC are not really direct competitors or
solutions to the same problems, they're really for distinct use-cases. ZIM is
a highly-compressed format that's designed solely for static articles and flat
content, it doesn't really store headers or anything else that WARC does in
order to support full request/response replaying. ZIM is optimized for storing
thousands to millions of pages of homogenous content, WARC is optimized for
high-fidelity collections of smaller amounts of content.

If you want to help out with our efforts, feel free to DM me on Twitter
@theSquashSH or reply here and I can introduce you to the ZIM people (who get
grants to improve this process on the regular, and are open to hiring contract
workers).

------
thegeekpirate
I sent them a bug yesterday where I was being blocked for "Too Many Requests"
regarding an endpoint I wasn't actually using (they thought I was attempting
to submit URLs using the "Save Page Now" feature), so they've been having
issues across the board.

This is good though, as they're now hopefully aware of some previously unknown
deficiencies.

Best of luck to the Archive team to get things up and running again with
minimal stress!

------
tgsovlerkhgsel
That would explain the random issues I saw recently (within the past ~12
hours) where I asked for a page version from 2019 and got one from 2018.

------
pcdoodle
It's been down for a few weeks for cnn.com (Can't load feb 1st to current
day). I wonder if they're getting pressure from somewhere. Check it out for
yourself.

------
username2020
Massive layoffs today I heard.

~~~
wideasleep1
Any kinks would be appreciated.

~~~
saagarjha
I think you may have had a Freudian slip ;)

~~~
wideasleep1
Indeed! Kinks still welcome. (was going to make a joke about fat-fingering,
butt decided to let it lay).

