Hacker News new | past | comments | ask | show | jobs | submit login
The Internet Archive takes over foreign dissertations from Leiden University (universiteitleiden.nl)
164 points by gpvos 55 days ago | hide | past | favorite | 57 comments



> “...it decided to deselect these dissertations, so that 3.2 km could be freed up for new acquisitions”

Am I reading this correctly and they have 3.2 kilometers of dissertations? What an interesting unit of paper archive size, though it makes sense.


I think linear bookshelf distance is a normal unit for talking about collections. At least as informative as number of books. Guessing 15 meters per bookshelf from photos, 214 bookshelves? doesn't sound as cool to me.


3.2km of linear storage space makes sense for books. You aren't just piling them up in stacks, where volume might be a useful measure, and you aren't putting them arbitrarily deep on the same row because that prevents access. You'll usually store things like this one book deep. If you have a 4-row shelf where you could have an 8-row shelf with the same width, each row 1m wide, you have 4m vs 8m of linear storage space.


About 3 200 000 cm... That is actually surprisingly large number if you assign any number of centimetres for each.


You are of by a factor of 10.


You are off by one f.


My dad's PhD is listed on Google scholar, but not digitalized. Although I never read it (I don't understand it) I would like it to get preserved. All universities should provide digital copies of their students bachelor's and masters thesis as well as PhDs. Data storage is so cheap these days


>> All universities should provide digital copies of their students bachelor's and masters thesis as well as PhDs

I'm not sure that is healthy, not for undergraduates. I'm all for open access to knowledge, but I question how much knowledge is actually in the average undergraduate thesis. I think a greater danger exists in people being held to things they said while an undergraduate student.

Famously, some of the stuff written by president Obama while he was a law student at Harvard has not been released, nor should it be. We shouldn't hold people for a lifetime to the incorrect, dangerous, or just outright silly stuff they might have said in a papers when they are new to a subject. Putting undergrad work into a perpetual public archive would also have a chilling effect amongst young students who should be enjoying academic freedom. I cannot remember 99% of the stuff I wrote as an undergraduate, but I know that somewhere in there is something horrible that I am glad to have forgotten.


Or we could try to accept that everyone makes mistakes and that's fine. Scientific advancement is basically making slightly fewer mistakes.

My bachelor's thesis was pretty terrible and there probably is not much to learn from it for an expert. It would have been helpful to me to read other peoples thesis when I was a student though and maybe that would have led to a better outcome.

At least here in Germany, a lot of the funding to do the research comes from the government. As a tax payer, I'd like to be able to know the outcome of the research. I am sure there are some real gems in there too.

If a student has reasonable concerns, I would be fine with it not getting published. I believe that the default should be that it gets published.


Ha My university (University of Florida) doesn't even keep it's graduation records. They have an error in my 30 year old graduation records but it has been impossible to fix because they don't maintain the records anymore, at some point they outsourced it to a 3rd party who is almost impossible to contact.


logging into a long dormant account to say i went to uf and there were hard copies of masters theses sitting on a shelf in the corner of one of my classrooms dated to the 70s. sounds about right for them to mess up.


There are strict legal rules about educational records.


While PhD theses are typically quite straight forward, i.e. at many (most) universities a PhD needs to be a proper publication often with an associated IBAN and with a copyright licence assigned to the University (or at least a number of hard copies given to the University library), masters and bachelor theses differ considerably. Often the copyright fully belongs to the students, they are not required to be published (often even are not supposed to be, as they were done at some industry partner, or results have not been published in journals yet due to time constraints...). So it's legally not that easy for universities to publish or even archive them especially in retrospect.


Shodh ganga in India does that on a national level.

https://shodhganga.inflibnet.ac.in:8443/jspui/browse?type=ti...


I'm guessing most recent dissertations have been digitized, but this is probably the norm only in the last 10-15 years? Most universities likely have never given thought to digitize anything from before then due to the extra costs that would be involved in digitizing those physical copies. I am curious how much such an effort would cost though.


Everything was digital at UC Berkeley back in the early 1990s and before.


> Everything was digital at UC Berkeley back in the early 1990s and before.

I can't believe I have to say this, but not every university is UC-Berkeley. Digitization isn't free and requires specialized labor and technology.

And are you really saying that in the late 1980s, all dissertations were submitted digitally? In what format?


I should have qualified this with "the engineering departments at UC Berkeley". Everything we put out (papers, technical reports, open source software) was on the Internet. Formats were varied; LaTeX and Postscript were commonly used. PDF a bit later.


There needs to be a global effort to backup the Internet Archive at this point.


Just need to find someone with ~220pb of storage and the ability to increase that by approximately 50% annually forever more.


That's only about 38 racks of storage, at a cost of ~$3.5M for the hard drives (redundancy included). Not that big, in the grand scheme of things.


That actually sounds remarkably accessible. Considering how much of a donation you need to make for naming rights to a rural university professorship/library building, surely this would appeal to some freshly minted startup decamillionaire with a slight peterthielite anti-establishment bent?


Actually 15 racks if you’re using backblaze storage pods. Which now that I think about it, is about how many racks I saw in the various rooms of the church. [I just happened to be at IA headquarters last weekend.] The storage pods hardware itself would be another $1m, and then let’s assume other $0.5M for various things I’m not considering (network equipment, power transformers, etc.). Still just $5m for the base hardware to store that info.

Yeah, pretty affordable.


Well buying 220pb of storage space is really not the problem nowadays, at least from a cost perspective. But you need to maintain all that stuff. What happens when a disk goes broke, what if a network switch goes broke, how do you update your software at scale and so on.

I think it would be best to put it on AWS S3 Glacier Deep Archive for about 2.5 million dollar per year.


2.5 million per year is about 10x what the worst case ongoing costs would be.


I doubt that you can do it cheaper. To permantenly archive the whole internet is an ongoing task that basically requires a small company, thats why Internet Archive (169 employee) exist (which costs more than 2.5 million dollar per year). It is not done with buying a huge bunch of disk. Setting up a permanent stream to S3 would be the only solution I can think of a single human could handle.


Whenever you have that much data stored how do you actually know the data is still there and can be retrieved? Even if you have absolutely insane connectivity to it at some point don't you run out of time to check it? Apparent 200 PiB at 1 GiB per second would take about 58254 hours to retrieve.


It's not like it's all coming from one disk, or going to one single CPU.

20TB drives with 500mb/s sequential read are available today. Reading the whole disk takes about half a day.

If your storage pod has 12 of those, even a $50 n100 CPU can run xxHash at 6gb / s (could probably even manage MurmurHash).


I won't even pretend to know how to begin with this type of project.


Crawlers with jobs, building searchable indexes? Similar to youtube. Down at the source its blobs, but above it all floats a layers of tags, metrics and searchable text. That is what the searches run against and the preferences algo builds its lineup against?


There is, at least with book's etc:

https://annas-archive.org/torrents


I wonder if this is a large enough catalog for IA to fly out to the Netherlands to ship these in as they do with entire libraries:

>We will be very accepting of materials that you will pack, ship and de-dupe, and we are more selective when we have to pay and coordinate. But we can do this and we have done so for many many collections of items we do not have. For full libraries our Away Team will travel to your location to pack and ship.[0]

See also "Preserving the legacy of a library when a college closes."[1]

[0] https://help.archive.org/help/how-do-i-make-a-physical-donat...

[1] https://blog.archive.org/2019/12/10/preserving-the-legacy-of...


The British Library which is responsible for hosting our PhD's has been offline for a year following a cyber attack. It's really frustrating how long it is taking them to bring it back, and would really value IA having an archive.


That long is a indicator of permanent damage? aka they had one copy and its encrypted and they hope to keep it lowkey..


The interesting question is why they aren’t expanding their archival storage space. What’s higher priority for any university archives than keeping dissertations?


These are dissertations from other universities, where the originating university still has a copy.

> The dissertations were originally part of an exchange programme between (mostly European) universities until the year 2004 but were never catalogued on arrival. ... The universities where these dissertations originally were defended informed UBL that they still have the dissertations and were not interested in receiving back the Leiden copy.


Wonder when the day will arrive when universities decide to offload all archives to online media only, just keeping the most important books and maybe unique manuscripts in libraries.


This is already happening in the Netherlands. Used to be that every book and newspaper was stored as a hard copy now they scan it.

I think people underestimate just how much it takes to archive everything that is released in the information age.


You sure they weren't using microfilm? Quoting https://en.wikipedia.org/wiki/Microform

> Libraries began using microfilm in the mid-20th century as a preservation strategy for deteriorating newspaper collections. Books and newspapers that were deemed in danger of decay could be preserved on film and thus access and use could be increased. Microfilming was also a space-saving measure. In his 1945 book, The Scholar and the Future of the Research Library, Fremont Rider calculated that research libraries were doubling in space every sixteen years. His suggested solution was microfilming, specifically with his invention, the microcard. Once items were put onto film, they could be removed from circulation and additional shelf space would be made available for rapidly expanding collections. The microcard was superseded by microfiche. By the 1960s, microfilming had become standard policy.

and

> Harvard University Library was the first major institution to realize the potential of microfilm to preserve broadsheets printed on high-acid newsprint and it launched its "Foreign Newspaper Project" to preserve such ephemeral publications in 1938


That will be a sad day. One of the best books I checked out of the library during my graduate studies was a copy of "Wind Waves" from 1965 with handwritten corrections written in pen by some former student.


their generation and propagation on the ocean surface?


Yes, by Blair Kinsman. There are a few errors in the equations of the first edition. Minus signs that should be plus signs and things like that. My copy was stamped "University of Alberta" as the University of Calgary only became an independent institution in 1967.

If you're interested in this topic, my second-favourite was Biology and the Mechanics of the Wave-Swept Environment by Mark W. Denny. The books are both good overviews of the subject of ocean waves and they have a folky charm that makes them quite enjoyable.


Presumably most of the dissertations produced at reputable universities would be valuable enough to keep at least 2 copies in storage.


Tangential: Archive.org is giving alert popup "Have you ever felt like the Internet Archive runs on sticks and is constantly on the verge of suffering a catastrophic security breach? It just happened. See 31 million of you on HIBP!"


Wow, I'm seeing that as well.

Earlier today, I was seeing reports on Bluesky that it was down for a lot of people.


Possible supply-chain "attack" (or demonstration, from what I can tell) on wherever they get their polyfill library? It's coming from:

https://polyfill.archive.org/v3/polyfill.min.js?features=fet...


Possibly unrelated. How can they elevate from a script injected in the frontend to the database of all users?

Also, the vulnerability seems to be a domain overtake. But Archive is self hosting a static version of the dependency?


One way would might be to capture credentials for admin accounts if they have a "god mode".




https://blog.archive.org/2021/02/04/thank-you-ubuntu-and-lin... They openly show a possible vector. "The Internet Archive is wholly dependent on Ubuntu and the Linux communities that create a reliable, free (as in beer), free (as in speech), rapidly evolving operating system. It is hard to overestimate how important that is to creating services such as the Internet Archive." Maybe CUPS?


I mean that gives nothing away, if someone compromised Ubuntu the OS they have a lot more targets than IA here.


somebody really wants to cause digital scarcity in direct opposition to digital abundance

my guess is they are people who mistake the fact that scarcity amplifies value with the idiot idea that scarcity creates value

and also these are entities holding on to a way to do business, publishing, and media that made sense back when the internet wasn't around


This is an amazing trove of human knowledge - if made digitally accessible, the titles should be on a Web page and the references crawled by Google Scholar.

We should eventuall OCR all that stuff to use it to train LLMs. Seen from that commodity perspective, it has financial value.

Unfortunately, human species is pretty bad at long-term archiving of digital assets. Good luck to the Internet Archive - they have had their share of recent troubles, and I hope their continuation is secure.

Imagine the struggle, sweat and suffering that went into these 3.2 kilometers of shelf space; actually, only someone who has done a Ph.D. can probably appreciate that.


When IPO?


off topic, b


I believe there was pressure on IA so that bigger corporate players of data hoarding could monopolize access.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: