
Never trust a corporation to do a library’s job - ColinWright
https://medium.com/message/never-trust-a-corporation-to-do-a-librarys-job-f58db4673351?repost=HN2
======
asuffield
Hang on, this article links to instructions on how to do some of the things it
claims can't be done.

It complains that the news archives frontend is gone, but then links to the
page which explains how to do the same things using search:
[https://support.google.com/news/answer/1638638?hl=en](https://support.google.com/news/answer/1638638?hl=en)

It also complains that groups is dead because you can't search by date... but
the exact same method used in those instructions works just fine:
[https://www.google.co.uk/search?q=site%3Agroups.google.com+a...](https://www.google.co.uk/search?q=site%3Agroups.google.com+after%3A1995+before%3A1999&biw=1485&bih=952&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1995%2Ccd_max%3A1999&tbm=#tbs=cdr:1%2Ccd_min:1995%2Ccd_max:1999&q=site:groups.google.com+linux)

The article claims that books scanning is slowing and links to an article
which says it's still going in some places, but explicitly says that it's
slowing down because some of the libraries are running out of books that need
scanning.

It links to an old quartz article from 2012 claiming that "20% time is dead".
After the first three paragraphs, that article links to the rebuttals:
[http://qz.com/116196/google-engineers-insist-20-time-is-
not-...](http://qz.com/116196/google-engineers-insist-20-time-is-not-dead-its-
just-turned-into-120-time/) [http://qz.com/117164/20-time-is-officially-alive-
and-well-sa...](http://qz.com/117164/20-time-is-officially-alive-and-well-
says-google/)

I'm not all that interested in arguing the title point of the article, but
when an article provides a whole stack of "evidence" and superficial
investigation reveals that most of it does not support what the article
claims, I question the motives of the author.

~~~
waxpancake
Hi! I wrote the article. First off, I find it disingenuous that you don't
mention you work for Google. But, hey! I'll give you the benefit of the doubt
that it was a simple oversight.

Addressing your comments:

1\. Google News Archive is, without question, a dead project. No new material
is being added, no new development is being made, and it's unsupported. They
removed the News Archive and homepage and redirected it to News.

The method Google suggests for web search isn't limited to news articles,
making it effectively useless for research. (It shows _everything_ indexed in
Google.)

You can search for some newspapers in Google Search, but it's impossible to
find any date before January 1970, order by date, or filter by publication.
You're stuck with post-1970 date filtering for all papers, ordered by
relevance.
[https://www.google.com/search?q=site%3Agoogle.com%2Fnewspape...](https://www.google.com/search?q=site%3Agoogle.com%2Fnewspapers+%22the+Berlin+wall%22&es_sm=119&source=lnt&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A01%2F01%2F1970&tbm=)

For reference, these were the options that were available in News Archive
Search:
[http://www.library.illinois.edu/hpnl/images/newspapers/gna_a...](http://www.library.illinois.edu/hpnl/images/newspapers/gna_advanced_screen.gif)

2\. I didn't say Groups was dead. I said it was effectively dead for research
purposes, which is true. For example, you can't search or filter by date
across groups anymore:
[https://groups.google.com/forum/#!search/linux](https://groups.google.com/forum/#!search/linux)

In your example, how would you propose (for example) finding the first mention
of Linux on Usenet? You can't, at least in part because the option to order by
date is completely broken:
[https://www.google.co.uk/search?q=site%3Agroups.google.com+a...](https://www.google.co.uk/search?q=site%3Agroups.google.com+after%3A1995+before%3A1999&biw=1485&bih=952&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1995%2Ccd_max%3A1999&tbm=#q=site:groups.google.com+linux&tbs=cdr:1,cd_min:1995,cd_max:1999,sbd:1)

Not to mention, only a fraction of the total posts are indexed and available
in Google Search. For example, changing your query to limit to 1995 only
results in 70 posts. There were many more than that being posted monthly in
1995 in comp.os.linux.advocacy alone.

3\. It's entirely plausible that Google's library partners are running low on
books, though that doesn't explain why the project appears to be completely
dormant. As I mentioned, the official blog stopped updating in 2012 and the
Twitter account's been dormant since February 2013. It doesn't seem like any
book's been added in the last year -- no new books from January 2014 to today:
[https://www.google.com/search?q=a&biw=1146&bih=933&source=ln...](https://www.google.com/search?q=a&biw=1146&bih=933&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2014%2Ccd_max%3A&tbm=bks)

4\. The 20% time thing is interesting. As a Google engineer, I imagine you'd
have a better perspective on that than I would.

Former employees have explicitly said that 20% time no longer exists in the
way it used to, and current employees, including here on Hacker News, say that
it exists but only on top of your existing workload (effectively making it
120% time). I tend to trust them over a PR person, but really, that was a
brief aside in my overall article.

The fact that a tiny fraction of the former functionality of a service is
possible, albeit with an obscure and user-unfriendly method, does not detract
from the overall point:

Google's current priorities don't appear to be in archiving the past.

~~~
asuffield
Sigh. Standard disclaimer: nothing I write here has got anything to do with my
job, I'm not representing my employer in any way. I _cannot_ talk about
anything I know personally. I've also never worked on any of these projects.
However, I can read, describe, and link to public material on the internet
like anybody else.

In all honesty, I have no interest in the news archive projects; I read an HN
link, followed some of the links in it, and said "this isn't what I was
promised on the previous page". It sounds like you just made a second attempt
at writing the article. I suggest you take the original one down and put this
one up instead; it stands up to at least the completely superficial fact-
checking of reading the links in it, which makes it a significant improvement
- although it now appears to be a list of fairly straightforward bug reports.
(I like bug reports. Bug reports are actionable.)

Engaging with the subject would require substantially more effort on my part
to research and investigate what's going on here, because I don't know
anything about it beyond what I read in links here. I'm not going to do that.
However, I would encourage anybody with an interest in this subject to do the
research and write up their findings.

~~~
britta
Saying that these problems look like bug reports is dismissive of the depth of
the problems. Stopping development of products and removing access to features
isn't unintentional, and a lot of people have already complained about each of
these problems over the years. Andy's article is making a larger point that
what has happened to these products is part of a pattern, that Google is not
being as responsible in stewarding its information as its mission statement
said it would try to be.

~~~
asuffield
> Saying that these problems look like bug reports is dismissive of the depth
> of the problems.

Personally I completely disagree with your priorities. I think a bug report is
far more valuable, since people can act on bug reports and make things better,
while I would not anticipate any meaningful action as a result of speculation
about mission statements.

~~~
joepie91_
That makes me wonder why the Chromium bugtracker appears to effectively be a
black hole, if the bug reports are so "valuable". I don't think I've ever
gotten a single response to my CSS calculation bug report.

------
brador
Why isn't archive.org distributed P2P at this point? Instant, massive,
redundancy.

Let me download some software and allocate how much of my drive space i'd like
to help them with. The software would then intelligently use that space as
their distributed backup system. Then they can focus on collection and
collation with one less thing to worry about.

~~~
ivank
I'll assume IA has 20PB+ of unique data today, going by
[https://archive.org/web/petabox.php](https://archive.org/web/petabox.php)

Let's guess that the average person can contribute some 300GB of their disk
space. If IA wants to keep a minimum of 5 copies on the network (probably a
safe number given how many people will be constantly dropping out), they need
(20x1024x1024x5)/300 = 349,525 people contributing their disk space. That
doesn't seem even close to attainable.

~~~
cma
If the goal is perfection rather than triage.

------
mwsherman
This is a good example of narrative trumping reality, and we geeks are every
bit as susceptible as anyone else.

The actual thesis of the article is that Google is losing interest in
archiving efforts. Perhaps true, perhaps partially true, perhaps false, or
perhaps unsupported either way. Entirely valid and worth exploring.

The second thesis is that the Internet Archive is doing good work. Great!

The title, however, is “Never trust a corporation to do a library’s job”. A
generalization which no article can prove or disprove. Just dogma.

Why must we make these leaps?

~~~
tedunangst
Correct or not, the articles thesis is more like you can't know which
corporations to trust with archiving, therefore you can't (or shouldn't) trust
any. It's not possible to determine, a priori, when or if Google will decide
that archiving is no longer interesting.

Apparently reports of the demise of googles archive are greatly exaggerated,
although the fact that they are plausible represents the danger.

~~~
mwsherman
s/corporations/organizations

------
hackuser
The Internet Archive also seems to be building a Usenet archive, though the
site doesn't seem clear about their plans or status:

[https://archive.org/details/usenet](https://archive.org/details/usenet)

I also found The Usenet Archive, which claims to be "much larger scale" than
Google's: [http://www.theusenetarchive.com/](http://www.theusenetarchive.com/)

Are there others?

------
jacquesm
I struggle with something similar. Building reocities was a one-time affair,
hosting and maintaining it really starts to add up. But I'll keep it alive as
long as I can and as long as it is being used.

~~~
textfiles
Hi, jacquesm. Can we talk about mirroring reocities?

------
jccalhoun
archive.org is awesome but it has limits on what it can archive. Many of the
sites I look for there have broken images or missing flash elements and with
more and more sites using complex javascript the limits of what archive.org
can curate are only going to become more of an issue.

There is tons of stuff from the 90s that is gone. For example, back in the day
there were a few audio shows that were recorded in realaudio. archive.org
doesn't capture that stuff. I was going through the archives of bluesnews.com
for a project and found all these references to interviews with people like
Carmack and Romero and other more obscure people and the audio is just gone. I
tweeted at one of the guys who owned the company and he says the harddrives
are in storage and he wants to get them online again one day but that day may
never come.

Another example, gamespot used to have some good articles about the history of
things like rts games and such. archive.org has most of that stuff but not
all. I messaged the people doing the tech support on the site about it and
they were like "yeah, that was a few site upgrades ago" and had not interest
in trying to get that stuff back online properly.

With joystiq and tuaw shutting down it is only a matter of time before aol
just pulls the plug on those sites, too.

~~~
ghaff
All that is true but it was also true before digital media. While saving
"everything" (whatever that means exactly) may or may not be a laudable goal
there are always going to be practical limits to what any organization can
realistically achieve for the reasons you cite and others. I could probably
name any number of online magazines that have gone belly-up or at least
restructured and old content in CMS systems is effectively gone forever as a
result. I'm not sure how you would even approach archiving a complex site with
high fidelity.

Even if the Library of Congress, say, took the task on with good funding, lots
of things wouldn't be preserved. (And I'm not sure I would like seeing the
government in this role in any case.)

------
lettergram
One of the paragraphs:

"Even Google Search, their flagship product, stopped focusing on the history
of the web. In 2011, Google removed the Timeline view letting users filter
search results by date, while a series of major changes to their search
ranking algorithm increasingly favored freshness over older pages from
established sources. (To the detriment of some.)"

I don't know if this is what they mean, but I can search by date just fine:

[http://imgur.com/e1tEq4M](http://imgur.com/e1tEq4M)

Perhaps it's just not as clear of a view?

Overall, I agree with the article, but we have the internet archive for the
internet, perhaps it's time for another organization for everything else. I am
sure there are those organizations, but they don't seem all that large or
effective.

~~~
narcissus
I think the difference is that the date search is for finding stuff that was
originally posted / crawled in that date range. (I believe) the original
comment was regarding searching some HTML, for example, as it was at that
date.

That is to say: if you had a page that changed over time, you would be able to
search for that page as it existed in a particular date range, not just
searching for a page that was posted / first crawled in that range.

------
ghaff
I can't really disagree with the basic point. Expecting a for-profit
organization--even if they're Google--to reliably over a long period of time
provide an archiving service that's mostly a money [EDIT] leak isn't a
realistic hope. The fact that it's embroiled Google in a number of ongoing
legal disputes doesn't make it any easier.

On the other hand, it's not as if anyone (other than the Internet Archive of
course) has exactly been stepping up to the plate. There's also the question
of what a library is for these purposes and what do they have the right to do
with respect to archiving copyrighted digital text and media.

~~~
DanBC
Deja News used to archive Usenet. They got bought by Google. It's now not
possible[1] to specify a date range when searching Google Groups.

They took it, they broke it, and now we can't use it properly.

[1] [http://webapps.stackexchange.com/questions/73179/how-do-i-
sp...](http://webapps.stackexchange.com/questions/73179/how-do-i-specify-a-
date-range-when-searching-google-groups)

~~~
ghaff
That's sort of the point of the article, no? And it's not like Deja News was
likely to stay functional in any case.

It's probably also worth pointing out that before Deja News and Google were
ever involved, a large chunk of Usenet came close to being lost [1] and
significant chunks of it have been in any case. So it's not as if preservation
necessarily happens in the absence of corporate involvement either.

[1]
[http://www.salon.com/2002/01/08/saving_usenet/](http://www.salon.com/2002/01/08/saving_usenet/)

~~~
mrottenkolber
What I don't understand is why companies aren't held accountable for their
_obvious responsibilities_. Google _has_ to publicize these archives. It
shouldn't be up to them if their public records are publicly accessible. I was
born on this planet and I have the _right_ to be able to get a copy of any
Usenet archive Google has. For free. Whenever I want.

I really really hate them for locking up this data (like so much its not
funny, like major humanitarian crime imho).

~~~
Tomte
I don't see how Google should have any responsibility wrt. Usenet.

Google is not the only party having Usenet archives. They are not even in any
way special. They aren't the "official" archives, just as Deja wasn't.

If you're really interested in getting your hands on Usenet archives, going
way back to the Ice Age, there are plenty of ways to do so.

Especially with the huge Hamster community
([http://www.tglsoft.de/freeware_hamster.html](http://www.tglsoft.de/freeware_hamster.html)),
who are archiving almost fanatically, you should be able to get virtually
everything you'd want just by asking nicely.

Other parties who certainly have archives are all the major news servers.

~~~
hackuser
> Especially with the huge Hamster community
> ([http://www.tglsoft.de/freeware_hamster.html](http://www.tglsoft.de/freeware_hamster.html)),
> who are archiving almost fanatically, you should be able to get virtually
> everything you'd want just by asking nicely.

What is the Hamster community? It sounds interesting. The link points to some
software that could be used to help archive, but doesn't say much more (at
least browsing around the Google translation of it).

~~~
Tomte
Hamster is a local news and mail server for Win32.

It's comparable to sn on Linux: it acts as a reader, not as a "real news
server" (IHAVE/SENDME).

So people could use it with their regular Usenet account on a dialup PC.

And since it's scriptable (alas, in some own script language), people extended
it in lots of ways.

~~~
hackuser
Thanks. How does "archiving almost fanatically" fit in with that? Or is this
archiving culture just coincidental?

~~~
Tomte
Coincidental. I haven't kept my archives, even though at one point I had
sucked postings for several nights (all of de.*, and the Big8, I think).

But others kept their "purple data" religiously.

------
PaulHoule
Is it just me or is the way back machine behind a 2400kbps modem? When I try
it either doesn't work or it takes minutes to load.

I can also say that hard experience has taught me to fear nonprofits. A
corporation makes money by satisfying needs as well as from rich people
financing it. Non profits are financed by rich people so you are working 100%
for the 1% instead of 99%.

You see all the same niggardlyness with nonprofits but at least in a
corporation it is possible you can improve your service and make money and get
rewarded for it. Nonprofits tend be lose-lose or no deal.

~~~
scraplab
On the other hand:
[http://ourincrediblejourney.tumblr.com](http://ourincrediblejourney.tumblr.com)

------
jkot
This sort of question will become obsolete, once data are freely available.
4TB tarball with Usenet archive. And nice ecosystem of open-source tools for
mining.

Already happened with maps thanks to Open Street Map.

~~~
ghaff
Usenet's just an example. Its hosting at another site with better search tools
could (I assume) be dealt with relatively easily given that it's pretty much
just text. There may, of course, be issues of which I'm not aware. After all,
this hasn't happened.

By no means though is the ongoing archiving and sharing of petabytes of
complex web sites and other information repositories a simple problem even if
the data were readily available and the issues associated with rehosting
copyrighted material worked through.

------
kenrick95
"Don't be Google" is a good motto.
[http://doonesbury.washingtonpost.com/strip/archive/2014/06/0...](http://doonesbury.washingtonpost.com/strip/archive/2014/06/01)

------
kemiller
You either die a hero, or live long enough to see yourself become the villain.

~~~
minthd
Personally i think Google's mission is much more heroic.

~~~
chippy
One of the article's main point was that the mission has changed, its no
longer about "all the worlds information" because it (storing old historical
information) doesn't help their business.

------
wtbob
The Internet archive is a 501c(3) corporation, isn't it?

As an aside, its new interface is utterly terrible with JavaScript turned off.

~~~
briandear
Why would you turn off JavaScript?

~~~
Figs
Security and better performance on cheap hardware are two reasons I turn it
off by default and only enable it when I really need it.

~~~
textfiles
You should turn it on when browsing archive.org - you'll really need it.

------
freelikegnu
I just played Oregon Trail through the browser with my wife and daughter
taking turns! So much fun!

------
patrickaljord
Two very cliché anti-Facebook and anti-Google articles on the front page
today. Slow news day?

------
chippy
A library is a public institution and thus is not really the same as thinking
it as a "non profit". It could be thought of an organisation that the public
has shares in, that the public are the angel investors in. Thus, we, the
people, have an investment in a library. Allowing a private corporation to run
a part of our institutions effectively dilutes the investment of the angel
investors.

This changes the thinking about money and sustainability from non-profit,
charity through to for-profit to one of a public ownership.

The lessons that we the investors (and thus as a kind of board member) in our
public institutions are making is that big public spirited helpful
corporations are ultimately corporations, and will ultimately behave as such.
And we should never allow our investments to be diluted in this way again.

