
The GitHub Arctic Code Vault - gingernaut
https://github.blog/2020-07-16-github-archive-program-the-journey-of-the-worlds-open-source-code-to-the-arctic/
======
tw4l
As David Rosenthal (formerly of Sun, NVIDIA, and Stanford) explains, the
actual Arctic Code Vault is a PR stunt, and has almost no chance of helping
anyone in any kind of realistic disaster scenario:
[https://blog.dshr.org/2019/11/seeds-or-
code.html](https://blog.dshr.org/2019/11/seeds-or-code.html)

That said, the _rest_ of the project, which focuses on preserving several
independent copies of repositories hosted on GitHub with a handful of partner
organizations, is quite useful. From the same post: "They are using a range of
technologies, making feeds available over the Internet, and partnering with
the Internet Archive, the Software Heritage Foundation and the Bodleian
Library. These are mostly things which will get used in the foreseeable
future, and should be applauded for that reason."

~~~
rezendi
Archive Program director here - it's really not a PR stunt, we genuinely
believe it will be of significant historical value and quite a good chance it
will be of practical value.

Much of that is "if we forget technology which we realize somewhere down the
road we actually might want to use again." History provides plenty of examples
of this, and it's particularly important with a technology which mostly lives
on ephemeral media that only lasts a few decades.

Even if you do expand your speculation to post-disaster scenarios, though,
while it's true the archive wouldn't be an instant reset button, it would help
greatly accelerate the recovery of technology. It's worth noting that it will
come with a slew of (human-readable, not encoded) technical works regarding
subjects ranging from modern software engineering to microprocessor design to
photolithography to power systems, which we call the Tech Tree, along with a
guide and index to all the stored repos. Wherever its inheritors / discoverers
may be in terms of technological advancement, and especially if they have
modern-ish hardware (which can last much, much longer than most storage
media), recovering the archive's contents will be a lot faster than
rediscovering them from scratch.

(Also worth noting we'll be storing "greatest hits" copies of the ~15,000
most-starred / most-relied-on repos, along with a sampling of several thousand
repos with few/no stars, in a selection of places like Oxford's Bodleian
Library; our hypothetical future tech seekers won't have to go all the way to
Svalbard for those.)

I don't want to stress the doomsday scenarios too much, though, despite our
ongoing pandemic. I think the most likely outcome by far is that progress will
continue; the archive may be useful to recover a couple of otherwise forgotten
technologies that suddenly become important / interesting; and it will
ultimately be chiefly of interest to historians. That historical value is a
key reason why it casts such a broad net. I too have a couple of fairly
unsophisticated pet projects in there that the future won't be interested in
individually - but collectively is another matter. One of the most interesting
things our advisory committee told us is that history is replete with lists
composed by wealthy people of the books they thought most important, carefully
preserved for posterity, whereas what modern historians _really_ want is
ordinary people's shopping lists, of which almost none survived. That's one
reason there are millions of repos in the Arctic now, instead of eg just the
most-starred 100K: some of those may be the modern technological equivalent of
Renaissance shopping lists, for the historians who may take a particular
interest in this (possibly) especially wacky and volatile era.

I know it's an inherently cinematic and dramatic project and so it's tempting
to call it a PR stunt ... but I assure you, it's not, and, speaking
personally, I would never have gotten involved with it if I thought it was.

~~~
rkagerer
Did people with repositories know this was going to happen and did you give
them a choice to opt out?

~~~
throwaway368765
Rather more eloquently asked than by the other person I saw querying this[0]!
I suspect it's covered under Github's TOS - specifically[1], only public
repositories were included and these are all effectively just backups.
Especially in the case of the vault in Svalbard. But you can opt out of the
'warm storage'[0].

[0] [https://github.com/github/archive-
program/issues/36](https://github.com/github/archive-program/issues/36) [1]
[https://docs.github.com/en/github/site-policy/github-
terms-o...](https://docs.github.com/en/github/site-policy/github-terms-of-
service#4-license-grant-to-us)

~~~
rkagerer
I recognize they wouldn't have done it unless they felt confident of having
the legal right, but it's just bad manners not to ask first.

If that's the case, this not-a-PR stunt degraded my impression of them.

I'm quite certain this isn't what their customers contemplated when reading
"backup" in their ToS.

EDIT: Interestingly it says " _This license does not grant GitHub the right to
sell Your Content or otherwise distribute or use it outside of our provision
of the Service._

It also says " _You still have control over your content_ ".

Is a subarctic vauly really within the ordinary course of providing the
service? Did content owners have an opportunity to exert any control?

Most probably think it's neat, but GitHub would be naive to imagine everyone
would consent.

Also what happens if it turns out one of those repos had personal information
in it and the subject makes a GDPR right-to-forget demand? Are they going to
drag it out and purge that bit of tape?

~~~
throwaway368765
>Also what happens if it turns out one of those repos had personal information
in it and the subject makes a GDPR right-to-forget demand? Are they going to
drag it out and purge that bit of tape?

I believe GDPR has exemptions for archives ([0] section 28) so that's less of
a concern for them I imagine. I recognise what you're saying, but I think
anyone _very_ opposed would have a difficult time in court arguing GitHub
should remove their work/name/etc. My (very loose) understanding of the law is
that they would have to demonstrate some kind of loss. That being said, GitHub
could just have sent a notification email with very little effort. Maybe 'no
harm, no foul' applies here?

[0]
[https://www.legislation.gov.uk/ukpga/2018/12/schedule/2/part...](https://www.legislation.gov.uk/ukpga/2018/12/schedule/2/part/6/enacted)

------
ca_parody
Honestly, for however much this project either (a) is a genuine archeological
move for the preservation of information or (b) to get good press, all I
genuinely thought when this happened is "aw shucks - wish i fixed those bugs
before they zapped it onto film and flew it to santa clause".

~~~
jcahill
I am a web archivist with an archival project on Svalbard that predates this
GitHub initiative.

Additionally, large-scale github-specific projects like
[https://gharchive.org](https://gharchive.org) (formerly GitHub Archive) have
existed for some time.

In my experience, code is more likely than not to be preserved in a stale
revision, if at all.

The most common forms of preservation are (a) simple tarballing and (b) git
bundles.

------
Gollapalli
Beautiful.

Honestly, nothing scares me more than losing all the code and all the
technology we've developed in the past 70 or so years. There's been so much
advancement, but it's also transferred in such a way (institutional knowledge,
propietary software, proprietary hardware, etc.) that it's super easy to lose.
If we preserve open hardware and software, then we could rebuild in the case
of civilizational decline and the accompanying knowledge loss, something which
we would neither be the first nor the last to experience.

~~~
Wowfunhappy
> If we preserve open hardware and software, then we could rebuild in the case
> of civilizational decline and the accompanying knowledge loss

...can we?

I'm sometimes a little concerned about how complicated chip fabs are. They
feel like something that could take generations to rebuild, even if we had all
the knowledge on what to do.

~~~
helldritch
Home photo-lithography and chemical etching setups aren't common, but have
been done by several people. We wouldn't be able to jump straight to 14NM, but
we would probably be able to get to the 500-300nm size relatively quickly (a
year or two, maybe, if starting from scratch) and shrink down from there.

Devices would be much bigger and less efficient, but we would be able to run
code and pump out 8086 processors within 6 months.

~~~
quicklime
That's just one layer of the stack though. Future archaeologists will also
need to create mock npm registries and maven repositories, and set up docker
and k8s so they can deploy a complex set of microservices to look up our
birthdays.

~~~
Wowfunhappy
...all the code to which should be right in the Github Vault, right?

Idk, the hardware part seems much more difficult to me.

------
toomuchtodo
> The Internet Archive is a well-known, widely beloved non-profit digital
> library which provides free public access to collections of digitized
> materials. In partnership with the GitHub Archive Program, the Internet
> Archive (IA) commenced its ongoing archive of GitHub public repositories on
> April 13 of this year. At present, IA is using a two-pronged approach.
> First, their well-known Wayback Machine is accessing and archiving raw
> GitHub data as WARCs, or Web ARChive files. As of this writing they have
> archived some 55TB of data. Second, they have the goal of making entire
> archived GitHub repositories available via “git clone,” while also keeping
> repo comments, issues, and other metadata easily accessible on the web. This
> second initiative is well underway and initial archiving is expected to
> commence this month.

Tremendous news.

------
atonse
This is so awesome, but the most surprising to me is that all the public
source code on GitHub only totals 21 TB.

I forget that they do fundamentally host text, and not video etc.

I somehow thought it would be petabytes. The private repos might be more than
that but those are historically paid.

~~~
no_wizard
On the topic of size, I wonder how small it would be if you were able to
deduplicate all repositories against each other. I sometimes suspect there is
a tremendous amount of copy/paste code out there masquerading as someone
else’s.

Even a naive deduplication might yield some very interesting results

Reminds me of a time I caught someone using someone else’s code in an
interview and passing it off as their own. (Using was fine, it was the claim
that it was theirs that bugged me)

~~~
progval
I work at Software Heritage, where we archive all source code we can find,
including all GitHub repositories, and deduplicate them internally.

The size of all file contents (including older versions of files) is a few
hundreds TBs, and everything else (directory structures, revision history,
etc.) is under 10TB.

So for GitHub alone it would be a little under that

------
gdsdfe
Am I the only one thinking this is a waste of money and time?! How any of this
makes sense, maybe as a weird PR stunt but ... Just strange

~~~
dakiol
I agree. I can't believe they are spending so much money and effort to
preserve code I don't give a damn now and once I pushed to GitHub. And like
me, 99% of the devs I know personally.

~~~
cmrx64
it's probably less effort to just archive the whole damn thing and let the
future figure it out than to decide important things to archive and leaving
everything else to disappear someday

~~~
rezendi
Archive Program director here. One of the most interesting things our advisory
committee told us is that it's really hard to determine what's important in
advance: history is replete with lists composed by wealthy people of the books
they thought most important, carefully preserved for posterity, whereas what
modern historians _really_ want is ordinary people's shopping lists, of which
almost none survived. That's one reason we cast a wide net and archived
millions of repos instead of eg just the most-starred 100K..Even seemingly
trivial repos might collectively be the modern technological equivalent of
Renaissance shopping lists, for the historians who may take a particular
interest in this (possibly) especially wacky and volatile era.

~~~
cmrx64
thank you so much for doing this work btw, archival is one of my loves :)

------
rwky
This means that after the apocalypse people will be able to reclaim the Linux
source code but not Windows. I find it poetic that open source may one day be
the norm.

~~~
jedieaston
I’m thinking that someone at Microsoft may have snuck the code for Windows
into the archive after it was pulled from Github. Between Windows and OS X, a
ton (most?) of the end user software would be unusable to a future generation
in its original form since they didn’t have the desktop OS it was used on.

Ironically, 500 years from now, they may think that the year of the Linux
desktop was 2008 :-D

~~~
zaptrem
[https://github.com/reactos/reactos](https://github.com/reactos/reactos) would
probably make this less of an issue as well.

------
1337shadow
Where can we find the list of the 6000 repos ? On my profile it just shows 3
"and more", would like to get the full list. TYIA ;)

~~~
axegon_
Same. Or how they were picked. I kept scratching my head all evening cause I
haven't made any updates or contributions to mine in quite a while.

~~~
zenhack
My best guess is it's some function of the popularity. The three that my
profile shows are

\- capnproto/capnproto

\- sandstorm-io/sandstorm

\- erlang/otp

(I don't remember the order).

I actively contribute heavily to sandstorm. I've sent patches here and there
to capnproto, and it's vaguely a sister project to sandstorm. Those are
probably some of the most popular projects I have multiple contributions to,
though there are others.

otp feels a bit odd though, if there's and "and more" \-- I sent them a one
line patch to fix a build error when building against musl. I haven't really
been involved since, nor was I before. But it's a high profile project.

------
grogenaut
What I really want from github is to allow people who own open source projects
who don't want to own them anymore to just hand them off for escrow so that at
a later date a reputable group like apache can maintain them if needed.

~~~
sudhirj
Couldn’t Apache just make a fork and announce it? Or is this just about the
convenience and marketing?

~~~
grogenaut
If it's done this way then all of the web links stay live, and a new owner
doesn't need to be found immediately. Think of it as a special permission
holding pool. There are many cases of "done" libraries that need changes
later. This would help with them. However when they're not done this way you
can spend a few weeks / months trying to get ahold of the author and for them
to decide "oh yeah I don't really care about x anymore"

------
brendanmc6
I'm curious, do they perform some sort of test reads on the reels to make sure
that the data was actually copied over correctly?

------
sixhobbits
What stops this stored data degrading? Do they have to periodically check /
renewal the reels?

------
makerofspoons
I was hoping for more of a description on how they plan to keep this vault
safer than the Global Seed Vault, which was once flooded due to soaring arctic
temperatures:
[https://www.theguardian.com/environment/2017/may/19/arctic-s...](https://www.theguardian.com/environment/2017/may/19/arctic-
stronghold-of-worlds-seeds-flooded-after-permafrost-melts)

~~~
erikbye
That was sensationalism, per usual. Bit of water in the access tunnel, no seed
damage.

~~~
price
The story says right up front in the subhed that the flooding didn't reach the
seeds. But the quotes make it pretty clear that what did happen was out of
spec.

For something that's meant to survive any catastrophe that might happen over
centuries to come, it's not a good sign to see that happen so early. It's
extra bad to see it driven by a trend, namely global warming, that we're
continuing to push farther and farther and have shown few signs of stopping.

------
etaioinshrdlu
It looks like the code is actually stored in plain text, and that this is
basically microfilm?

~~~
rob-olmos
I don't think so. Project Silica talks about storing the data in droplet-
looking voxels rather than etching language symbols. Cool video of the
process:
[https://www.youtube.com/watch?v=6CzHsibqpIs](https://www.youtube.com/watch?v=6CzHsibqpIs)

~~~
etaioinshrdlu
But, from the article, it doesn't look like they used Project Silica here,
they used piqlFilm.

~~~
rejschaap
Yeah, Project Silica is another project within the GitHub Archive Program. You
can see the microfilm in this video

[https://www.youtube.com/watch?v=fzI9FNjXQ0o&feature=youtu.be...](https://www.youtube.com/watch?v=fzI9FNjXQ0o&feature=youtu.be&t=72)

------
nomoreservices
Strong A Fire Upon the Deep vibes thinking of future archeologists studying
that.

------
Zamicol
They __are not__ using QR code for storage as has been misreported by a few
media outlets.

See
[https://earth.esa.int/documents/1656065/3222865/170922-Piql-...](https://earth.esa.int/documents/1656065/3222865/170922-Piql-
ESA_Slides-Final) for piql's storage method.

------
un_montagnard
> The next morning, it traveled to the decommissioned coal mine set in the
> mountain, and then to a chamber deep inside hundreds of meters of
> permafrost, where the code now resides fulfilling their mission of
> preserving the world’s open source code for over 1,000 years.

What is the probability that we still have the required tech to read that code
in 1,000 years?

~~~
davedx
Depends on whether the Great Filter is before or behind us.

------
benatkin
Ice. Not to be confused with ICE.

~~~
rvz
GitHub is working with both? Very chilling.

------
therealmarv
Only disappointed that the new badge does not show the 2 open source projects
I contributed to in the last 10 years of my work for open source :( They are
not super big, but also not super small.

Seems organisation work is ignored and only individual username fork/PRs
respected (is this a bug?). Software is teamwork ;)

I mean awesome-react, tldr-pages or homebrew-cask are probably not unimportant
but that's not where I contributed most to.

~~~
Phillips126
I am not a huge GitHub user and have only contributed some code to a single
repo that was merged. I was surprised to see I had the badge in my profile.

------
fnord77
[https://en.wikipedia.org/wiki/5D_optical_data_storage](https://en.wikipedia.org/wiki/5D_optical_data_storage)

------
symplee
1000 years from now, I can only imagine the hidden Y3K bugs...

------
malechimp
If things come to that I doubt the practicality of it all. But it makes easy
headlines. It also makes open source an immortality project for a lot of
people.

------
TheSpiciestDev
Ha, that README grammar fix years ago finally pays off!

------
girst
In a 1000 years people will surely benefit from the millions of copy-pasted
dotfiles :^)

------
Google234
This is a waste of money.

------
chickenpotpie
So, if I'm in the EU can I GDPR my repo out of their vault?

~~~
adrianpike
I'm guessing you're being facetious, but it has come up and it's covered in
the FAQ:
[https://archiveprogram.github.com/faq/](https://archiveprogram.github.com/faq/)

------
juanbyrge
This is pointless - a complete waste of time, effort, and energy. Isn’t there
something more beneficial they could have done instead? Why pollute the Arctic
with plastic and film canisters?

