
Public appeal to backup Climate Monitoring and Diagnostics Laboratory's FTP repo - bane
https://www.reddit.com/r/DataHoarder/comments/5q4xxe/erik_fichtner_on_twitter_please_wget_m_np/
======
astrodust
We should be mirroring _fucking everything_.

Where's a good place to put data like this? Is there even a place where people
can start to inventory what's at risk?

The biggest problem is not knowing what's out there, what records need to be
protected, as there's so many people doing important work.

~~~
kalleboo
The internet archive has made this their mission.

Archive Team keeps a constant eye out for major content sites that are at risk
of being shut down and proactively mirroring their content.
[http://www.archiveteam.org/index.php?title=Main_Page](http://www.archiveteam.org/index.php?title=Main_Page)

On EPA data
[https://twitter.com/textfiles/status/824110893034311680](https://twitter.com/textfiles/status/824110893034311680)

They don't just archive online content either. They spend many resources
scanning magazines, ingesting (and cataloging) CDs and floppies before they
all rot away.

~~~
closeparen
Won't they delete it all when epa.gov puts up a robots.txt?

~~~
af16090
My understanding is that they will disallow access to a webpage if a
robots.txt goes up but they will keep the copy of the webpage in case the
robots.txt is ever changed[1]. So a future administration could change the
robots.txt and the webpages would be accessible again.

[1]: Seems like that happened here:
[https://en.wikipedia.org/wiki/Wayback_Machine#Netbula_LLC_v....](https://en.wikipedia.org/wiki/Wayback_Machine#Netbula_LLC_v._Chordiant_Software_Inc).

~~~
hunter2_
I know robots.txt is all about respect in the first place, but is it really
expected that obtained-via-robot mirrors will be made inaccessible to humans
during periods of time when the upstream prohibits robotic fetching following
that original fetch? It seems much more reasonable that the mirror would
merely prohibit robots, not humans, from fetching its copy during periods of
time when the upstream prohibits robots.

~~~
CM30
I suspect the original thought was that a website owner might later decide
they don't want their site publically viewable in the Internet Archive, and
would sue/complain if it was still visible there. Like say, if they
accidentally allowed robots to archive a 'hidden' directory.

Problem is, those use cases are greatly outweighed by ones where either:

1\. The site has changed ownership to a different (legitimate) company or
organisation, and they don't want their new site archived.

2\. A domain squatter/seller has bought the domain, blocked all bots to stop
the holding page being indexed in Google and accidentally blocked the archive
in the process.

3\. A technician or developer has accidentally blocked the archive/all robots
due to a personal mistake/copying code from the internet.

4\. The site owner doesn't know the archive has a bot, and has blocked all non
Google/Bing bots to 'reduce strain' on the server. That last one is
depressingly common:

[http://webmasters.stackexchange.com/questions/75993/only-
all...](http://webmasters.stackexchange.com/questions/75993/only-allow-google-
and-bing-bots-to-crawl-a-site)

------
smkellat
Any particular reason we absolutely _must_ DDoS this server at the moment?

From a current session:

ftp> open aftp.cmdl.noaa.gov

Connected to aftp.cmdl.noaa.gov.

421 There are too many connected users, please try later.

ftp>

~~~
bane
One of the beautiful things about ftp servers is that you can set a connection
limit and they'll just refuse connections above that limit until a slot frees
up.

~~~
IgorPartola
You can do the same with almost all services I know. Well, at least
connection-oriented ones.

------
tsomctl
Also note [https://github.com/climate-
mirror/datasets/issues?q=is%3Aiss...](https://github.com/climate-
mirror/datasets/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc)

Seems like archive.org should be archiving this stuff, so it's all in one
place and future researchers don't have to go hunting all over the internet
for it.

~~~
pimlottc
Jason Scott (twitter handle @textfiles) is already on it:

[https://twitter.com/textfiles/status/824104225688981504](https://twitter.com/textfiles/status/824104225688981504)

~~~
bane
Jason Scott et al may find themselves in line for a medal of freedom at some
point in the future.

------
niftich
Things to keep in mind:

\- In the US, the CFAA has language that may or may not be relevant [1]. This
isn't _intended_ to be FUD, although I can understand if you think so. As with
most legal things, don't believe me -- a random guy -- on the internet; ask a
lawyer, or _at the very least_ be aware of your risks.

\- Check license and terms, if included. Reddit post claims these [2] are
terms, seems okay, but I have not confirmed this for myself.

\- FTP is unencrypted and non-tamperproof, so data integrity and data
authenticity, and connection privacy isn't guaranteed. And because the
original source did not publish checksums over a secure channel, there is no
way to know whether these files are the originals. It will be difficult to
prove the provenance and accuracy of the data, despite having lots of third-
party copies, without the original source having safeguarded the strong
cryptographic checksums the entire time, or having published them over a
tamperproof channel.

[1]
[https://www.law.cornell.edu/uscode/text/18/1030](https://www.law.cornell.edu/uscode/text/18/1030)
[2]
[https://www.reddit.com/r/DataHoarder/comments/5q4xxe/erik_fi...](https://www.reddit.com/r/DataHoarder/comments/5q4xxe/erik_fichtner_on_twitter_please_wget_m_np/dcwo823/)

~~~
jMyles
> strong cryptographic checksums

They needn't even be strong, right? Even a weak checksum on a data set of this
size is basically collision-proof, no?

~~~
niftich
It _absolutely_ needs to be a strong cryptographic hash, because you want to
protect against both deliberate tampering and accidental modification.

Any other function, whether non-cryptographic, too short, broken or having
known weaknesses, while perhaps suitable to detect accidental corruption, will
not protect against the former.

For example, SHA-384, or SHA-3 are great choices.

~~~
Dylan16807
Though this isn't a case that needs to be append-proof, so SHA-256 is fine.

------
richard_todd
Making important data "impossible" to destroy/lose/vandalize is sensible on
its own merit. The more signed and decentralized data is, the better. You
don't have to be against the new US President to think it's a good idea, but
if the political climate brings attention to the matter that would be one
silver lining anyway.

------
barney54
Why is there any threat of these databases going away?

~~~
astrodust
If you're asking that question you haven't been paying attention.

Science is the enemy. It will be destroyed if it continues to get in the way
of the administration.

It happened here (Canada) [http://www.macleans.ca/news/canada/vanishing-
canada-why-were...](http://www.macleans.ca/news/canada/vanishing-canada-why-
were-all-losers-in-ottawas-war-on-data/) and it could happen on an even bigger
scale in America. The Conservative party went about cutting funding, slashing
archives, burning everything to the ground, sometimes literally. They'd do it
with little warning, zero fanfare, and an aggressive timeline. One day you had
a climate archive, the next the shredding company had taken care of it.

Those servers aren't free. They depend on budgets and grants to stay running.
If that money is cut, those files are _gone_.

------
bane
The reason I posted a reddit thread is that the discussion is full of many
other related discussions.

------
Skunkleton
It would be cool to get this all available via bittorrent. The data could be
split up into reasonable sized chunks, and everyone could store a little, and
everyone would still have immediate access to the whole thing.

------
losvedir
Should I know who Erik Ficthner is? Is he an NOAA scientist? From what I heard
on NPR this morning and a nytimes piece, the Trump administration doesn't seem
to be doing anything unusual. I thought it was customary for the incoming
administration to make sure the agency is all on the same page. I'm getting
hysteria fatigue, I think.

~~~
eropple
The anti-science posture of Trump specifically and Republicans generally is
alarming by itself. The steady denials of _observable fact_ by this
administration and the immediate scrubbing of executive-controlled public
relations pages of _factual materials_ about topics that are very inconvenient
for them makes worries of Orwell's memory hole very, very reasonable.

This isn't hysteria. This is insurance. Don't be That Guy.

