Hacker News new | past | comments | ask | show | jobs | submit login
Archive.org and California to start a data sharing and preservation project (archive.org)
228 points by bpierre on June 5, 2018 | hide | past | favorite | 45 comments



Does anyone here work at Archive.org? Can you speak to how well-funded the organization is and what sort of measures are in place to keep it afloat? I think it's a fantastic service and I donate, but I worry that it could vanish the next day if funding suddenly dries up. I feel like a large corner of the Internet collectively takes the site for granted and don't bother doing their own in-house archiving because "TheArchive will just suck it up for us."


I can't speak formally for the Internet Archive, but the existing content and services are not going to disappear overnight: funding comes from several sources, thought has been put in to organizational structure, and things have been designed to keep core access and preservation infrastructure running with minimal cost and effort (eg, if the economy tanks).

Getting the content coverage people sometimes assume we already have is another matter. Additional funding (thanks for you donation!) go towards additional crawling and keeping up with the endless treadmill of media types and protocols. Eg, headless browser crawling development and deployment to capture javascript-heavy sites (https://github.com/internetarchive/brozzler); this is much more expensive than "classic" crawling.

For more on increasing storage costs and the under-funded state of web archiving in general, I recommend David Rosenthal's blog, eg:

https://blog.dshr.org/2018/05/longer-talk-at-msst2018.html

https://blog.dshr.org/2014/03/the-half-empty-archive.html

Far more effective and robust than hoping the archive is "suck it up for us" is to upload snapshots/dumps/exports yourself! Anybody can create an archive.org account and upload content (recommend https://github.com/jjjake/internetarchive over the HTML form), within reasonable limits. Obviously, care needs to be taken to remove sensitive (and personal) information first.


These blog posts are fantastic. Thanks so much for sharing them.


brozzler looks great - do you know how they decide what are javascript heavy sites and how they queue them up?


You can read their IRS 990 forms to get a sense of what their financial health is. Ironically, I couldn't locate these anywhere on Archive.org

2016: http://990s.foundationcenter.org/990_pdf_archive/943/9432427...

more: http://990finder.foundationcenter.org/990results.aspx?action...


I don't work for them but I think the move towards decentralization is the best way to address those concerns.


> I don't work for them but I think the move towards decentralization is the best way to address those concerns.

How feasible would a reliable, distributed archive be; given how massive amount of data Archive.org has? After all, it was created precisely because the already-decentralized web was too ephemeral and unstable. I don't think decentralization is a panacea in this case.

Some problems are best solved by institutions.


Yeah, spot on. Governance and durability through org structure is a thing.

The way to solve this is to provide the Internet Archive with enough resources to build out a globally distributed storage system. Could you hack something together using their torrent tracker for every item served? Yes. But you don't hack together something made to preserve digital human culture in perpetuity.


> Yeah, spot on. Governance and durability through org structure is a thing.

> The way to solve this is to provide the Internet Archive with enough resources to build out a globally distributed storage system.

Yeah, I agree. I do see a space for other appropriate institutions (such as the Library of Congress, British Library, etc.) to pool resources and facilities with the Internet Archive to achieve that goal.

Ultimately, it'd be awesome to see each national library run a semi-autonomous IA copy that synchronizes with all the others, but can continue to operate independently (scrapers and all), if need be.


Please research dat and similar systems more before you dismiss them as unscalable and ephemeral. That isn't the case.



To be clear, this is not the state government of California, but a division of the University of California (the California Digital Library) working with the Internet Archive and Code for Science & Society.


> To be clear, this is not the state government of California, but a division of the University of California

The University of California is a part of the government of the State of California established in the State Constitution, whose governing body is comprised of 18 members appointed by the Governor and confirmed by the Senate, plus seven ex-officio members, three of whom are State elected Constitutional officers (Governor, Lt. Governor, and Superintendent of Public Instruction) and one of whom is the Speaker of the Assembly.

(That said, it is unusual and potentially misleading to refer to UC as “California”, but not because UC is actually separate from the government of the State.)


I hope archive.org doesn't fall victim to GDPR and the upcoming EU "copyright" reform (eg. will stop serving to EU). I'm not a fan of the vague "right to be forgotten" concept as it applies to individuals, and think history rewriting is a much more serious issue going forward.

Though I've heard credible complaints from "copyright" holders vs archive.org.


Yes. The popular sentiment here is "GDPR good, EU copyright reform bad". And that's understandable.

But both data-privacy and copyright they try to create ownership of information and must do so through intrusive legal measures because physical nature makes is against it.


TL;DR:

The project aims to demonstrate how members of a cooperative, decentralized network can leverage shared services to ensure data preservation while reducing storage costs and increasing replication counts.


The former to be influenced to the "common truth" by the latter, no doubt.


[flagged]


> the copyright holder of the copy you download from archive.org tomorrow, will be copyright archive.org

That is not true.


I think he's taking poetic license there.


Mea culpa, I gave way too much benefit of the doubt here.


https://archive.org/details/TaylorSwiftReputation

That is the latest Taylor swift full album streamable from archive.org


Ok, but that is not related to what he said. Archive.org will not hold Taylor Swift's copyright in the future.


When a DMCA takedown notice is filed, the item will go dark. It will still be archived, but inaccessible. If your Internet Archive uploader account has frequent DMCA requests against it, it's disabled.

Items don't lose their copyright status by being uploaded or stored in the Internet Archive.


It will be inaccessible until it becomes public domain, but the copy that you download from archive.org will be copyright archive.org. You can get the music from somewhere else, if you can, you most likely won’t be able to.


Copyrights are not transferable like that once they expire, they expire.

Further, that's opposite the entire purpose of Archive.org

Where are you getting this information?


If I take a photo of an original work of sheakespear, then I own the copyright on the photo. And i could license the usage of my photo. If I make a copy of anything, then I own the copyright on the copy. What if, in time, the only copy you could get hold of was my copy. My copy would be copyright me and you’d need to wait for it to become public domain before you could use it under your terms


> If I take a photo of an original work of sheakespear, then I own the copyright on the photo. And i could license the usage of my photo.

That's 100% not true. Please stop spreading misinformation because you have no idea what you are talking about.

"According to a landmark 1999 federal district court ruling, The Bridgeman Art Library, Ltd., Plaintiff v. Corel Corporation, 'exact reproductions of public domain artworks are not protected by copyright.'"

https://www.huffingtonpost.com/bernard-starr/museum-painting...


The important word is ‘exact’. Non exact means the copy contains copyrightable differences.


You're just flat out wrong dude. You've got some very bizarre believes about copyright law and I can't fathom where you got them. Whoever you learned from did you a great disservice, or if you 'taught' yourself then you didn't do a very good job at reading.


Thats a bit unfair; copyright is a fairly unintuitive concept if you haven't been exposed to it.


That's a fair point, but somebody without much exposure to copyright law should avoid saying inflammatory things like:

>"The internet wayback machine is one of the biggest copyright thiefs in my opinion."

(https://news.ycombinator.com/item?id=17220943)


Hah, I stand corrected


If I take a video of jimi Hendrix improvising guitar, then he owns the copyright on the music, but I own the copyright on the recording


I believe that in the context of copyright, 'exact' doesn't mean a 100% perfect copy. If I make a film, I hold the copyright for it whether its distributed on VHS, DVD, or online streaming, even if the quality on VHS is different than the film shown in theaters. It'd be like if changing an image from jpeg to tif changed its copyright status. Your example elsewhere of correcting a few typos and saying it's a different work is similar: it's still the same work, even with the minor tweaks.

Anyways, copyright is weird and complicated and there are many people who don't really understand it. But your theory that archive.org is going to get a copyright on everything they scan and store is not rooted in any sort of legal reality.


> If I take a photo of an original work of sheakespear, then I own the copyright on the photo.

If the photo meets the requirement for independent copyright ability, which requires being a distinct creative work, that's true, and for a photo that would often but not always be the case.

For a simple non-transformative digital copy, there's no copyrightable new work and no copyright in such a work.

> What if, in time, the only copy you could get hold of was my copy. My copy would be copyright me and you’d need to wait for it to become public domain before you could use it under your terms

If someone was copying elements that were original in your work of the photograph, true, but if they were merely copying the pubkic domain text of which you had taken a photograph, no, not true at all. Copyright protects your original work.


The difference is that picture is a derived work. When you just make a copy you don't get copyright over the copy.


> It will be inaccessible until it becomes public domain, but the copy that you download from archive.org will be copyright archive.org. You can get the music from somewhere else, if you can, you most likely won’t be able to.

No.

> For example, if a book published in 1995 is a reprint of a book published in 1900, then it is eligible. However, the onus is on us to prove that it is a reprint, and if it doesn't say on the TP&V that it is a reprint, confirming its eligibility may be impractical.

https://www.gutenberg.org/wiki/Gutenberg:Volunteers%27_FAQ#V...


What if the 1995 copy had a few deliberate errors thrown in like spelling mistakes and re worded sentences, and your new copy copied those errors verbatim, then I could say that you copied the 1995 version, which was still under copyright by the publisher, and not the 1900 version which we can all agree is public domain


Adding a few spelling mistakes and reworded sentences will most likely not pass the https://en.wikipedia.org/wiki/Derivative_work#Originality_re..., and therefore not be considered a new copyrighted work.


I'm not sure how worried I am about archive.org, specifically, in this regard. But the concern does reflect a trend in copyright, promulgated by those with significant vested interests (IP value they're seeking to maintain or grow) and so-called "maximalists".

There's a growing push to legitimize copyright claims for "instances" of a work, even after the base work has entered the public domain.

If this "sounds ridiculous" to you, just recall how most of us are increasingly worried that, in the U.S., Congress and the Executive are going to... "re-Mickey" the copyright term. As in, they already pushed it to life plus 70 years when certain Disney copyrights were about to expire. (And keep using "trade agreements" as one mechanism to try to "back door" increases to "plus 70" into other countries' IP terms.)

A separate concern I have, is that currently archive.org continues to "retroactively respect" robots.txt changes.

404 your once public content, and archive.org "disappears" it from their corresponding records/copies.

As long as that's true, you can't really view it as a permanent, unbiased archive.

As politicians, commercial interests, and their lawyers continue to have a field day constraining "online rights" (and IP rights, and etc.), currently the only "guarantee" the public has of continued access and a more complete historical record is, ironically, local copies.

They say, "History is written by the winners."

Well, unless they can't find the copy you have squirreled away.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: