Hacker News new | past | comments | ask | show | jobs | submit login
There’s a simple alternative to the current web (hapgood.us)
150 points by mgunes on Aug 23, 2014 | hide | past | web | favorite | 143 comments



Clearly I'm a biased observer, but I really think people should take steps to archive stuff that is important to them. Of course it's terrible when large sites go offline and take vast swaths of the Internet with them, and we should continue to shame the ones that do it. At the same time, if something is really important to you, you shouldn't store it in the form of links to random third-party servers.

One problem we need to solve as coders is giving people better tools for saving stuff. It's really hard right now to save a webpage (or worse, series of connected pages) with any confidence that you've captured everything you need to see it again if the original server disappears.

A project that I think has struck a really good balance between permanence and retaining authors' control over their writing is the Archive of Our Own (AO3). A bunch of fanfic authors got tired of sites falling out from under them, and decided to implement their own system, along with sensible governance and a way to fund its ongoing operations. The only broken links I've ever seen to AO3 are ones where the author consciously decided to take the material offline.


It always seemed like such a regression to me when browsers disabled entire page caches and only ever show the freshest content. Before, if a page served 404, I could easily see my local cached version from my previous visit. Now, I'm shit out of luck.


What is amazing is that:

1) Never in the history of civilization has local knowledge storage (disk) and local compute been so cheap

2) Never have we had a larger free software ecosystem or more hackers to deploy free software locally

3) Never have we had more evidence about the differences in civilizational freedom between local and central storage/compute

What if backup/restore skills were taught alongside home economics? Is a home a place of both shelter & storage? Are we abandoning a set of possible futures because we want the "convenience" of "someone else" backing up our digital selves/souls?

"For want of a nail the shoe was lost.

For want of a shoe the horse was lost.

For want of a horse the rider was lost.

For want of a rider the battle was lost.

For want of a battle the kingdom was lost.

And all for the want of a horseshoe nail."


> I really think people should take steps to archive stuff that is important to them

I understand why you didn't post a "self promoting" link, but I want others to know about this option:

https://pinboard.in/tour/#archive

Currently $10 for a lifetime bookmarking account, plus $25/year to archive every bookmarked page:

"Pinboard offers a bookmark archiving service for an annual fee of $25. The site will crawl and store a copy of every bookmark in your account, and display a special icon you can click to see the cached copy. If the page you bookmarked goes offline, you'll still be able to see the archived copy indefinitely."


My second reply, but, I think that this is really important.

We only need to look at early film history to know how easy it is to lose massive parts of our history.

Going back to old pages, I frequently get 404 results. For politically sensitive documents, the problem is much more widespread.

I would like something that not only archives pages I visit, but also versions them and tracks changes. If there was a bookmarking tool that did this, you could easily have an opt-in feature that shared content. This type of system would be a huge boost to something like the wayback machine.


What do you think of this generalized architecture?

HARDWARE: differs depending on whether you want local search/analytics or just network storage.

For mobile use, either a VPN back to your personal home/cloud server, or a hackable wifi hard drive proxy, e.g. Seagate Wirless Plus + HackGFS.

For non-analytics home use, hackable router with USB3 storage and Linux software RAID, connected to a USB3 drive chassis with room for 2-4 disks.

For analytics home use, a microserver like HP N54L, Dell T20 or Lenovo TS140. Up to Xeon processor with ECC memory, plus 4-6 internal disks and up to 32GB RAM. Sold without a Windows tax, supports hardware virtualization and Linux. Possibly FreeBSD with ZFS.

SOFTWARE: generalized multi-tier cache AND compute. Camlistore and git-annex are tackling multi-device storage sync. For archives, we need a search interface that will query a series of caches, e.g. mobile > home > trusted friends private VPN (tinc overlay) > public paid cloud archive (pinboard et al) > public free cloud archive (archive.org).

It's important for usability to have a simple, local UX that will take a search string, propagate across all private/public federated tiers of storage and compute, then aggregate the metasearch results on the client.

With this approach, we can collectively pool resoures to improve on CommonCrawl.org, without locking up the 300TB index at AWS. This would turn web search engines into a secondary source, rather than a primary source. First search your archive + trusted friends, then trusted verticals (e.g. HN, StackOverflow), then a generic web search.

Let's be clear: the goal is not to archive "everything" in the world, only that which is personally important to the viewer. This attention metadata has long-term value. With this architecture, it is always optional to escalate a query to a public archive or search engine. Most importantly, there is technical autonomy and low-latency compute for local queries.

For web pages, wget of WARC formats (per HN advice on another thread) and wkhtmltopdf (available as Firefox plugin to print to PDF) will keep local archives. Recoll.org (xapian front-end with user-customizable python filters) on Linux will search full text and provide preview snippets, or lucene/solr can be adapted.


Holy living God.

Solves my problem, then some. Also provides an alternative to search engines as the goto for internet browsing. Brilliant stuff. I can't tell you how grateful I am for your work on this.

The only thing left is FLOSS version control for sound and video editing, and an effective "publish to BitTorrent" feature, and then we can pretty much put this "Web 2.0" crap to bed.

EDIT: Out of curiosity, is there any reason that analytics couldn't be done with a dedicated PC, or do you think it requires server hardware to run effectively?


It can be done on a dedicated PC, ideally one that supports h/w virtualization (VT-x and VT-d). People who want to purchase a new PC would need known-good configs. The suggested devices are relatively cheap (no Windows tax, the Dell one officially supports drivers for RedHat Linux) and can be used as PCs. With virtualization support, one could run a local Windows desktop, analytics in a Linux VM, and NAS storage in a separate Linux/FreeBSD VM - all on the same computer. For defensive security, you want to separate a read-only content store from potentially-vulnerable programs which parse & analyze data in the content store.

As someone said in another comment, the challenge with these solutions is making them usable to a mass audience who won't know or care about Linux. Android & OS X both created user experiences that hid the underlying Unix OS. If someone can only afford a single computer, then virtualization allows that device to play the role of both "server" and "desktop".

With local s/w RAID, the system can be designed so that backup consists of (1) shutting down the computer, (2) removing one hard drive, replacing it with another, (3) take the (encrypted) hard drive to offsite location, e.g. trusted friend/family.

> The only thing left is FLOSS version control for sound and video editing

Coud you expand more on this use case? Do you mean archiving binary blobs and storing their metadata in git? git-annex does this and camlistore can be adapted for this purpose. Is this only for local production workflow, or is there a need for remote collaboration on pre-final artifacts?

> an effective "publish to BitTorrent" feature

Could you spec out what's needed? Are there trackers which specialize in public domain content? Strong DRM is coming to browsers and new hardware. While it will create unexpected user experiences and change the web as we know it today, it could increase demand for public domain content. We are going to need distribution channels for public domain content which treat copyrighted content like viruses, i.e. reject before they can enter the channel.

The same fingerprinting techniques used for copyrighted content can only be used to improve discovery of public domain content. This means central directories (which can be cached locally) of metadata and hash/fingerprints for public domain content (text, audio, video, raw data).

Another feature: support for software agents that take action based on computed result of (remote event + private data). A FLOSS version of IFTTT, Yahoo Pipes, e.g. https://github.com/cantino/huginn . A remote event could be a private signal from a mobile device app. This would increase flexibility since the user has a larger private dataset about themself than any web/cloud service. The user can configure FLOSS algorithms, instead of relying on remote black boxes without appeal or transparent governance.


I'm starting to think that remix culture has been essentially stifled by licensing. Artists like Steinski, Danger Mouse, and Girl Talk have gained notoriety exclusively through word of mouth. Copyright and licensing prevent them from monetizing their music, so they are forced to promote and release it in other ways, either quick pressed vinyl or free internet distribution. In a remix culture, I'm hesitant to describe anything as "pre-final", or, perhaps we should label everything as "pre-final".

I think a public domain or creative commons music community could develop a thriving remix community in unique ways. Remixes can be recontextualized as forks of music. I would imagine that fingerprinting could be used really effectively here, which will require some thought. As a recording artist, it would be really useful for me to be able to delete takes but retain them in version history. The music industry is filled with stories of the person who owns the recording studio retaining the masters to a session, and then refusing to cooperate with the artist. Opening that data up would be a massive boon not only to musicians but I think to recording studios as well, as the finer aspects of a recording session become much easier to access.

There's a deeper problem, though. There's a dichotomy between the binary blobs that DAWs use and the user-readable stem files that artists like Radiohead release. It's as though all the information about the studio session and all the non-temporal aspects of editing are obfuscated by assembly code. Ideally, it would be nice to replace the binary blobs that programs like Audacity and Ardour save to with human-readable stems that encode sound or video editing in metadata. Ultimately I think this is a critical UI requirement, but for now I think the best solution is to use git-annex to store .flacs and .oggs, as well as both individual exported stem files and binary project blobs. In the final analysis, different DAWs are really more like different instruments, and cross compatibility between instruments is a requirement.

Uploading files to a server is pretty easy, but what is a little more challenging is automatically licensing the content, uploading to your server, creating a torrent that uses the server as a webseed, and then publishing it to a tracker. This publishing flow would let me directly publish content even on my puny little shared web hosting account. Traffic load is automatically distributed through bittorrent, and censorship becomes much more difficult (albiet not impossible.) From there, I think, it would be fairly trivial to build a front-end to replace an interface like YouTube's "Submit Video" page.

Treating copyrighted content like viruses is a feature I would really like to see more of. While I'm a firm supporter of PopcornTime/Time4Popcorn/AsPopcornGoesBy, I don't really want to use it. I'd rather find public domain things to watch than Hollywood Movies or TV shows. I wish PopcornTime had a public domain feature.

If there are trackers that only use public domain or Creative Commons content, nobody has invited me to them. ;) I think it wouldn't be too hard to make a tracker that scanned for a machine readable public domain or CC license, but I don't have much experience with torrent trackers. I'd like to try and set one up soon, I think initially you could just moderate content. I'm kind of baffled by the direction Bittorrent, Inc. has taken, because they could just as easily be advocating for number of seeds as a surveillance-free fanbase metric, rather than just releasing DRM'd music through Bittorrent.

Huginn looks really, really interesting. I sure appreciate your posts. Thanks very much.


It clearly hasn't occurred to you that private collections evaporate over time, just like public ones.

> I would like something that not only archives pages I visit, but also versions them and tracks changes.

That's a huge storage requirement, you must realize this. If you're an avid Web browser, and if every archived page had to look as it originally looked (i.e. all the linked resources) you could accumulate several terabytes per week.

> This type of system would be a huge boost to something like the wayback machine.

Here's the road to madness. Someone, aware of the rapidly declining cost of storage, rebuilds the Wayback machine based on your scheme, with the intent of archiving every Web page in existence, including all required resources so the pages look just as they originally did. Then, as the project approaches completion, this genius says, "For the next phase, I need to archive the Wayback machine itself." At that point, as the implications of what he's said occur to him, a strange look crosses his face and his imagination begins writing checks his intellect can't cash.


Several terabytes per week is a huge overestimate. Right now I'm getting 12 Mbps download speed (sadly typical for US broadband). If I saturated the connection, and if no one throttled me, I could download 900 gigabytes in a week. That would be some intense web surfing.

My practical experience is that you need about 1.5 MB per URL for storing large numbers of web pages, if you exclude video.


For compressed static html and maybe images, you could store your entire history on a flash drive nowdays. See this: http://memkite.com/blog/2014/04/01/technical-feasibility-of-...


I never said I was trying to save my entire web cache locally. Just bookmarks.

I don't have Flash Player installed. Instead, I use youtube-dl for flash videos across many websites. I'm well aware of the storage implications of this kind of activity.


The storage implications are really not so bad. You would have trouble ever breaking 100 GB unless you are some kind of bookmarking mutant.


Slightly related to this, the other day I tried to download every youtube video in my watch history.

Turns out it's pretty much impossible. The Youtube API only delivers about 20 results, which is a bug that has existed for about 2 years. I tried manually loading the watch history page, and was only able to get about 1000 out of ~8000 results. When I selected those 1000 results and tried to add them to a playlist, the interface crashed.

Does anyone have a solution for this, or is my watch history just at the mercy of Google?


You can use youtube-dl to do this:

    $ youtube-dl ':ythistory' -u USER -p PWD --write-pages
Of course, you won't get your entire history since g00gle would rather mine all your personal data itself, and never share it back with you again.


I tried this earlier. The result:

[youtube:history] playlist Youtube Watch History: Collected 0 video ids (downloading 0 of them)


Just a thought, do you have google two factor auth turned on? Maybe you need to generate an application password and use that to get access.


It baffles me how we got from "tracking is creepy" to "please track all I do on your site" in ~10 years.

Your watch history should be stored by your browser on your machine. It is only your business and only locally it is fully in your control.


A week ago, I had my phone forcefully smashed into a floor. The screen ended up completely dead, but fortunately, the phone worked, and the remote debugging option was on, so I was able to pull all the data on it via ADB.

But if the motherboard or some other vital had component died, I would've beem locked out of my data. It's at times like this I'm glad that Google Keep syncs my notes to their servers, and that Viber stores my contacts' phone numbers remotely. If it hadn't been for them and my phone, say, fell in water and completely broke, I'd have lost all of that. Same argument for a hard drive death. In 2004 or so, we were expected to live with that risk. But I don't think we should have to now. Witnessing the explosion of the "cloud" buzzword everywhere, you'd think everyone in the world has convenient access to their own private space "in the cloud" (or, as those soooo-2000s people would say, "on the Internet"). And we do have access to such space, but for the most part, not quite so for the "private" thing.

Of course, ideally, I'd like to generate random keys, keep safe local backups of them and then sync encrypted data to remote servers. But I don't think the companies are eager to accept that as they probably owe a lot of their statistics and targeted advertising opportunities to big-scale mining of that plaintext data we provide them in exchange for convenience.


I agree with you.

My watch history is currently in the hands of Google.

I'm trying to copy it so that I have it saved locally.


> The Youtube API only delivers about 20 results, which is a bug that has existed for about 2 years.

That may not be a bug. It might be a way to limit the traffic created by scrapers who submit requests and then download all the videos listed on the result page.


Hrm, then why offer the feature?

I'm kind of still a beginner when it comes to APIs, but it seems disingenuous to offer an API feature and then intentionally break it so that it can't be used.


> Hrm, then why offer the feature?

Because people can still get 20 results. I mean, 20 results is way better than no results.

> I'm kind of still a beginner when it comes to APIs, but it seems disingenuous to offer an API feature and then intentionally break it so that it can't be used.

It's not broken. It returns 20 results. Maybe Google decided that was enough of a hit on their database. One could also argue that most people wouldn't want more than 20 results on their small-screen Android device served by a slow connection.


https://code.google.com/p/gdata-issues/issues/detail?can=2&s...

The number of results returned is random. Some people report seeing videos watched on computers but not locally. One person says it works.

That's not a limit. That's broken.


> Clearly I'm a biased observer, but I really think people should take steps to archive stuff that is important to them.

I recently converted all my bookmarks to saved copies of the pages in Evernote. This means I have full text search over everything I "bookmark" -- still have the links -- and never lose it regardless of what happens.

In this process, I was shocked at the number of 404s I encountered from relatively recent (last 18 months) bookmarks.

Evernote is a bit of a wreck, but so far the combination of decent copy of page, browser plugins, tags, search, accessibility (phone clients, web client, some native clients) have made it my best option.


Since 2008 or so I've been saving all pages that I would otherwise be bookmarking. Only in MHT format, but it's convenient. After reading a few quotes on linkrot years ago, and noticing many pages I'd come back to 404'ing I decided bookmarks are useless long term (or even at times short term).


Is this a feature built into Evernote? If not, what tools did you use for the conversion?


Built in. "Clip This Page" or "Clip This Selection". It is a single keypress in FireFox for me.


This is the only real solution (even if it's through an intermediate agent, which may have the same problem. Remember Google Reader). Of course, then the problem to keep that info in a good way (indexable, accessible, backed up, etc...) is another interesting problem. I'm afraid that most of the people are simply not aware of all of this (both on the content supplier or the content consumer)

This is also somehow of a problem on printed books. There are tons of books that are out of print and cannot be easily found.


Luckly if Stack Overflow goes down the one billion shady scrapper bots will have (at least) the most viewed questions.


Many people just don't have the clue to make usable local archives.

You're also proposing mass copyright infringement which - stupidly in this example - is not legal.


I don't think copyright infringement is the issue. Making a copy to view offline is always legal (well, fair use clauses sort-of make it mostly legal) and you are doing that with a web browser anyway, with cached pages, and cacheing proxies and similar devices also perform a similar function. The protections afforded by copyright all pertain to public reproduction and dissemination.

The WSJ point about restricting the number of free article views made in another reply is similarly nothing to do with copyright. This is more of a contract issue, and anyway, if I can view only N articles for free, I can only save those N, so it is really moot.


Is saving a webpage to your local harddrive copyright infringement? The data is already on your local system when you view a webpage.


Copyright is a legal construct, not a technical construct. It doesn't matter where the data is. If a judge decides it's copyright infringement to save webpages, it will be. I can't imagine that the WSJ or any other paywalled institution wouldn't consider saving pages locally to be copyright infringement; how would they enforce a limit on article views?


It doesn't matter what the WSJ thinks, it matters how the law sees this. Let's wait for one of the nice real HN lawyers to answer!


The same way they do now? I can't save articles if I can't view them.


You can view all wsj articles if you use google as referer


I'm interested in artists who release things under Creative Commons, or works that exist or are released into the public domain.

These tools would work fine for me, without risk of copyright infringement. Just because you don't use libre content doesn't mean that the rest of us should have a broken internet.


If the system itself is federated the same way as the rest of the data, then it doesn't matter if it's legal or not.

You can't make an omelette without breaking a few eggs and you can't fix the world without breaking a few laws.


>You can't make an omelette without breaking a few eggs and you can't fix the world without breaking a few laws.

Heh. The NSA should put this on a t-shirt and sell it.


> You can't make an omelette without breaking a few eggs and you can't fix the world without breaking a few laws.

I desperately want to see you present this as your defense in a court of law.


It’s interesting that Andreessen can’t see the solution, but perhaps expected.

What a weird dig. It's neither expected, nor established that he can't see a solution. I'm not as smart as Andreessen and I could come up with half a dozen solutions.

Author's favourite is fine but far, far from obvious. How viable is it to run your own federated wiki anyway? Are there packages for popular systems? Are there plugins for major browsers? Is there any federation actually happening? I skimmed the resources[1] and don't know. Does anyone here run one? That would be a solution, this seems more like an idea.

And it's not like no one's doing anything. There are services like Pocket or Readability to store an article until you want to read it, Evernotes, Google Keeps... Our very own 'idlewords will archive the contents of your bookmarks for a fee[2]. Finally, there's archive.org.

[1] https://github.com/WardCunningham/Smallest-Federated-Wiki#ho...

[2] https://pinboard.in/tour/#archive


What he seems to be getting at is a much larger, more revolutionary approach to not just "the web", but "the internet" as we know it:

https://en.wikipedia.org/wiki/Named_data_networking


Thats a very academic and static view of content - I don't see how that would work in todays hyper-dynamic environment, where the Ads that are displayed on a site are priced by millisecond real-time auctions before they are delivered to the user, and websites are single-page apps with REST APIs in the background. How would that work?


By transclusion. You'd cache the big, reusable chunks of content, and serve up a fresh transient little document that transcluded both the larger content chunks and the dynamically-included advertisements.


I've been thinking about this for a while now. Please check out my web app to solve this problem: https://www.purplerails.com/

The main idea is to use a browser extension to automatically save pages that you read to the cloud (including the images, stylesheets etc) in the background. Saved pages are searchable and sharable.


This sounded really great until I went to the website and saw that I can't use my own cloud storage, only purplerails'. As soon as purplerails disappears all my saved pages are gone. I already have this functionality with diigo and it makes me very uncomfortable not to have a copy of the data.


Excellent point. The ability to download your data in a well-documented format is coming soon. See also my reply to hollerith on a native client.

Time limitations are what is preventing me from doing this.

Thanks for your feedback! Hope you will use Purplerails. :)


An early design idea I had for Pinboard was as a browser plugin that just saved everything it saw in passing to an upstream server. But the problem that stumped me was that there's much more downstream bandwidth than upstream on a typical residential connection, so it was hard to push things to a server in anything like real time. How did you end up dealing with this issue?


Pages are saved in the background. Nothing too fancy. I dedup when uploading: that helps a lot.


Okay, but how do you handle things like big PDFs or image gallery sites? Or pages that just pull in a lot of javascript includes? That stuff downloads in parallel, but then I would find myself trying to push it upstream through a little straw of bandwidth, sequentially.


You're right, it takes longer to upload a page than to download it. And image gallery-like pages take long (I know because I save Imgur pages now than then :)). But in practice, this isn't a problem.

There are a couple of heuristics to avoid wastefully uploading pages: the full page is uploaded only if the reader expresses "sufficient interest" in the page. Currently the heuristic is 90 seconds of continuous reading of a page, or scrolling to the bottom. If a page is read for a minimum of 10 continuous seconds then only the text of the page is uploaded.

Static assets like JS files benefit from deduping: they take time to upload the first time, but subsequently processing them is much faster.

Typically, people read multiple pages in a browsing session: I rapidly open many tabs and then read each one for multiple seconds. There's a debug mode in Purple Rails in which a timer counts up when I switch to a tab. I find that typically spend 100+ (usually much more) seconds on a page that I read through to the end. This is usually enough time on a residential broadband connection (I have Sonic DSL) to finish uploading a page. I also use Purple Rails on a tethered 4G connection almost everyday: uploading is slower than DSL but it works.

Basically, by the time you finish reading a tab, the previous tab you read would have finished saving.

Like I said, nothing too fancy.


Sweet, thanks for the detailed answer! I look forward to checking it out.


How can you dedupe if the content is encrypted?


Deduping is per-user not across users. For the sort of content here, this works well. E.g., static assets of web pages are the ones that are dedup-ed.

The basic algo is to generate a HMAC of the plaintext and compare it against a table of previously uploaded blobs' HMACs. HMAC is keyed with a key derived from the user's password. When a blob is uploaded, the ciphertext and HMAC of the plain text are both uploaded.


I'd prefer for the pages to be saved to the hard drive of the machine running the web browser.

But maybe browser extensions cannot obtain permission to do that?


I see your point. I may release installable native apps that can serve as the local storage backend for the truly paranoid. Time limitations etc: the usual. :)

I anticipate that PurpleRails will be used on multiple computers and over several years. Which is why pages are sync'd to the cloud.

I've adopted the current architecture because I feel that the energy barrier that needs to be overcome to persuade somebody to install an app is much higher than the one to install an extension.


I tried giving purplerails a shot. I use LastPass for password management, and I am not going to type in a 25/30 character password every single time I want to log into an application. I think you're going way too hard on that part. This is the first time I've had this happen to me when using a web app, and it immediately made me close the page.


Thanks for taking the time to write your feedback. I appreciate it.

With PurpleRails, you will rarely be typing your password once you successfully login. It works kinda like Gmail/Facebook etc: you remain logged in for months at a time.

This page explains why password saving doesn't work yet: https://www.purplerails.com/blog/saving-passwords-why-we-hav...

I'll see if I can do something fancier than what I do now to allow saved passwords to work. I suspect that using a Javascript-based AJAX authentication system instead of a plain-old HTML form, I can achieve the privacy goals as well as ease of use.

I hope you will hang in there and use the current version until I figure out a way to fix this. Thanks!


Thanks for responding. Another 'issue' I've noticed is that it logs me out if I don't have the extension installed and click on the installation button, which, as you might figure, is a pain to deal with when you have a 25 character password. I guess you're saving the web page data from the client side and not through your servers. Another thing I've noticed is that when I add a page, it takes quite some time to show up in the web app. Also, whats with the timer? Apart from these issues, I really believe in your idea, and I have been working on making such a system for myself for months. I wish you the best, and I hope you succeed. I'll update you if I find any new issues, since if all the issues are fixed, I can see myself using this as my primary bookmarking service.


Thanks for the report!

Logout if extension not installed is related to the same reasons why saved passwords don't work yet.

The timer is supposed to be a debug feature that I thought was off by default! :) It shows the amount of time you've spent in that tab. I'll turn it off by default in the next update. For now, go into the extension options and uncheck "show page view timer" (near the bottom of the options page).

wrt taking quite some time: the first time you save a page from a site, it's likely to take some time, since things like static assets will also be uploaded. Subsequent saves should be much faster due to dedup'ing which works well on static assets. Let me know if this is not the case.


I see it also saves content from emails and all that. That certainly justifies the need for extensive security. What are your plans for pricing? I suggest you don't take the route that Pocket did, $5 for anything and everything(which seems like a norm these days) is ridiculous.


You can go into the options and disable your email host. It's not the easiest interface :) but the format is hopefully obvious: it's a JSON array; if any of the strings appears as a substring in the page URL, that page is excluded from text-only or full-page saving.

Copy and paste the same thing into "Autoindex exclude rules" and "Autosave exclude rules".

Could you expand on your feedback on pricing? Can't tell if you're saying $5 is too high/too low. If you wish you can also email me (I couldn't find your email on your site).


That wasn't too hard, but obviously a non-programmer would have difficulty understanding it. But hopefully and quite possibly you're working on a better interface, so not much of a problem for the time being.

I've got another piece of feedback. The numbers seem to be off for me in the web app. I t says I have over 300 pages bookmarked, while the interface only displays 100(which seems like the more plausible number).

$5 is kind of a norm these days, and a lot of times, its too high, depending on what the service is. For instance, there's IRCCloud. Obviously, the interface is good, and they provide easy to use mobile apps, but all that doesn't justify $60 per year. Another example is Pocket. I really like their apps, but they also followed the $5 norm. They aren't providing me much value to be worth $5(every month) to me, so I switched over to Pinboard(which, by the way, is one of those few services doing pricing right these days). There are a lot of services that I can justify paying $5 per month for, cloud storage for instance, but a bookmarking service, nope.

If you want to see the kind of backlash Pocket got for pricing their Premium option at $5, just have a look at this thread on reddit:

http://www.reddit.com/r/Android/comments/26qaif/pocket_intro...


The 300 number is probably correct. The UI shows the most-recent 100 pages by default. I'll be adding 'next' and 'previous' page links soon. I mostly use the search function since I have many pages saved, so it hasn't been a high priority.

You can list all pages you've currently saved by using adding '?n=10000' to the list page URL. Let me know if the 300 number is incorrect.

Thx for the feedback on pricing. Will take it into account.


sounds like evernote, if I'm mistaken, please enlighten me :)


1) saving is automatic

2) privacy-first architecture (e.g., plaintext is never uploaded, plaintext URLs are never uploaded etc)


ah, so it saves... everything? wow! that's nuts. I mean, where the heck are you going to store all of that?

and it is terribly inefficient to store say, 30,000 examples of the exact same article. or do you have a way to check and not store duplicates? if so, what if a blog post is saved today and has 10 comments and it is saved tomorrow by another person and it has 11 comments.

technically the page is different so it would be saved again

I think you need to explain exactly how it works a bit better or maybe I'm just not getting it

:)


> ah, so it saves... everything? wow! that's nuts. I mean, where the heck are you going to store all of that?

Please see my reply to idlewords. It isn't literally everything; there are some heuristics to detect what was interesting to you.

I understand you might find this excessive. I routinely find that useful. :) See also "As We May Think" by Vannevar Bush.

> and it is terribly inefficient to store say, 30,000 examples of the exact same article ...

You're right: no deduping is and can be done across users (HMACs chained up the user's password are used create dedup hashes). Storage is sufficiently cheap that I feel it's a acceptable tradeoff vs. privacy (i.e., server being unable to confirm that two users have saved the same page).


"The Tyranny of Print" has a nice ring to it, but mediums that give the creator more control over appearance+behavior are going to lend themselves to crafting more compelling experiences.

Sure, not disappearing in 10 years (or whenever the original server goes poof) would also be nice, but it's of little benefit if no one ever sees the thing in the first place.

And disappearing is the default, natural state of things.

If I see some people playing music on a corner and return the next night to see they've left, I may be wistful, but it would be silly to argue "playing live music is broken and we should fix it".

If you think of web sites as performances put on for a limited time by the server, it doesn't seem so terrible that they disappear after a while.


> And disappearing is the default, natural state of things.

Books, clay tablets, scrolls, engraved stone, to which humans owe their entire knowledge of their premodern history, seem to have put up pretty well against entropy. The same is not the case for information disseminated in a controlled manner from privately owned servers.

> If I see some people playing music on a corner and return the next night to see they've left, I may be wistful, but it would be silly to argue "playing live music is broken and we should fix it".

> If you think of web sites as performances put on for a limited time by the server, it doesn't seem so terrible that they disappear after a while.

Thankfully, the generations who produced and preserved knowledge on paper, clay and stone before the onset of digital technology - that is, every generation of humans that has ever lived, except ours - did not think of books and libraries as throwaway pamphlets. And it would take more than an arbitrary interchange of modes of cultural production to argue that we should be doing otherwise in the technological circumstance we find ourselves in.

The "tyrants of the server" are not thinking of server-centric aggregation and dissemination of as a performance put on for a limited time: they are betting on it as the future of all human literary activity. Google doesn't want to read you a paragraph, take your money and say goodbye; it wants to swallow all the world's books and information, chop it to tiny pieces, store and own it forever, and extract the maximum profit from each tiny piece, without having you pay a penny. And it wants you to come back for more. The persistence of the server-centric model of content dissemination is not an accident; it is dictated by the political economy of the web brought about by the Googles of the world.


"Books, clay tablets, scrolls, engraved stone, to which humans owe their entire knowledge of their premodern history, seem to have put up pretty well against entropy."

Only the ones that have survived. For every book or tablet we have, there are certainly tens of THOUSANDS of which every copy ever published has been lost - most of those are ephemera that wouldn't mean much to us anyways, but the lost also include things that would be nice: the majority of Livy, any of the original source material for the Gospels, etc.

Even considerably more modern material has been vanishing at a significant rate; for instance, most of the output of the silent film era has already been lost.


You're absolutely right. Maybe it's a fool's errand to try and hold onto the past.

But many people consider those losses to be an immeasurable tragedy.


The mental model of web browsing and bookmarking is: "If I see it, I can get to it again." There's a partial feeling of ownership. "I've read it, so I should be able to refer back to it later."

Nobody (at least, no sane regular person) reads a webpage and thinks "I have a time-limited license from the originator of this content to consume the material and only use it for their expressly condoned purposes."

The vanishing content problem is like if books in your house randomly walked away just because it's the "natural state" of things to disappear.


There should be room for a spectrum of stuff online, from evanescent to permanent. It's one thing for the musicians on that corner to be gone the next night, but you do expect the corner to still be there. Online, it's depressingly common for even large bits of infrastructure (like GeoCities) to just go poof.


> mediums that give the creator more control over appearance+behavior are going to lend themselves to crafting more compelling experiences.

There is a difference between the creator and the server. Most of the content you consume is created by people who don't own the servers. Separating appearance+behavior from content source would help actual creators because they wouldn't have to worry about their host deciding one day to delete all their content because the service is being discontinued or the creator is competing with some business interest of the host.

> If you think of web sites as performances put on for a limited time by the server, it doesn't seem so terrible that they disappear after a while.

The problem is that the web is being used for everything, even things that can and should work like books rather than like live performances.


> If I see some people playing music on a corner and return the next night to see they've left, I may be wistful, but it would be silly to argue "playing live music is broken and we should fix it".

Actually, that's exactly what the inventors of (various) recording machines did. Something might have disappeared in one form, only to return in another. Just ask the Project Gutenberg people.


I will pay good money for a Chrome extension that does the following:

1) I can select (or do select all) Chrome bookmarks that I want to keep offline page backups/archives of (saved to google drive or dropbox or some such).

2) Whenever I want, instead of seeing the current online version of that bookmarked page, I can look up the originally bookmarked archived page.

3) It allows me to choose the level of links to the bookmarked page to also backup/archive (e.g., every single page that is linked to that page, x links deep, is also automatically archived -- think httrack or wget).

As someone on Hacker News once said to me: my bookmarks are my knowledge graph. As important to me as any library.


Pinboard archiving costs about $25 a month. Not sure it does the deep link archiving.


Coming up with architectures to decentralize servers is the fun part. Convincing people outside of our bubble to use the new system is the very hard part. It has to be able to do something the regular person really wants that the previous system didn't allow. This is why Linux never caught up on the desktop.

Now excuse me while I go curate my socks collection.


This wouldn't work for any web page that has dynamic content stored in a database. If the database no longer exists a decade from now this doesn't solve that problem.

Also, wouldn't this break analytics and reporting for most websites too? It'll be much tougher to track user behavior to improve user experience. And debugging using log data? I get what the author is suggesting but "fixing the web" this way would break more things that large websites and companies rely on.


> It'll be much tougher to track user behavior to improve user experience.

That sounds tragically Orwellian. "We do [BAD THING] to help you! [and it also, just by chance, makes us more money too by exploiting you, but never you mind that]"

> break more things that large websites and companies rely on.

The historical "spy everywhere" privacy model of the web isn't a natural state anybody has a right to exploit. Breaking the current centralized curation model would be a benefit for everyone.


Well, as the Internet once said to the music companies, it's not our fault if our new technology breaks your business model. People would find new ways to solve these problems.


It's fine to be that stubborn if you can win. But against the world of business that needs analytics... I suspect their business model trumps your technology for now.


"Also, wouldn't this break analytics and reporting for most websites too?"

True but that's a "YP" not a "MP". (Bogie Nights, "Your problem" not "My Problem".

I mean we're not talking about a public health issue after all.


Link rot is a serious problem: http://www.gwern.net/Archiving%20URLs#link-rot

>In a 2003 experiment, Fetterly et al. discovered that about one link out of every 200 disappeared each week from the Internet. McCown et al. (2005) discovered that half of the URLs cited in D-Lib Magazine articles were no longer accessible 10 years after publication [the irony!], and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year.

>Bruce Schneier remarks that one friend experienced 50% linkrot in one of his pages over less than 9 years (not that the situation was any better in 1998), and that his own blog posts link to news articles that go dead in days; the Internet Archive has estimated the average lifespan of a Web page at 100 days. A Science study looked at articles in prestigious journals; they didn’t use many Internet links, but when they did, 2 years later ~13% were dead. The French company Linterweb studied external links on the French Wikipedia before setting up their cache of French external links, and found - back in 2008 - already 5% were dead. (The English Wikipedia has seen a 2010-2011 spike from a few thousand dead links to ~110,000 out of ~17.5m live links.) The dismal studies just go on and on and on (and on). Even in a highly stable, funded, curated environment, link rot happens anyway. For example, about 11% of Arab Spring-related tweets were gone within a year (even though Twitter is - currently - still around).


My own research (which I hope to publish soon) shows a slightly better link rot rate for bookmarked URLs (which are presumably ones people are most interested in keeping). The attrition rate I see so far is roughly linear and about 5% a year. Which is still shocking by any non-web standard, but a little better than the figures cited above.


That's consistent with most of those studies. The next paragraph does an interesting experiment showing a very conservative rate of 3% per year:

>My specific target date is 2070, 60 years from now. As of 10 March 2011, gwern.net has around 6800 external links (with around 2200 to non-Wikipedia websites)4. Even at the lowest estimate of 3% annual linkrot, few will survive to 2070. If each link has a 97% chance of surviving each year, then the chance a link will be alive in 2070 is 0.972070−2011=0.16 (or to put it another way, an 84% chance any given link will die). The 95% confidence interval for such a binomial distribution says that of the 2200 non-Wikipedia links, ~336-394 will survive to 20705. If we try to predict using a more reasonable estimate of 50% linkrot, then an average of 0 links will survive (0.502070−2011×2200=1.735×10−16×2200≃0). It would be a good idea to simply assume that no link will survive.


You do research using bookmarks on Pinboard as your dataset? May I ask how this data is used and disclosed to others?



To run the experiment, I am going to be drawing a few thousand links at random from the entire pool of Pinboard bookmarks. This will include private bookmarks, which make up about half the Pinboard collection.

You chose to include everyone's private bookmarks in your research without asking their consent? What?

I will publish some aggregate information about what I find, and use it to seek glory, and persuade people to sign up for archiving. But I won't release anything that could lead back to specific users or links.

There is roughly a boatload of evidence that anonymized datasets can be deanonymized in unexpected ways.

Even if you don't release any anonymized datasets, it's really not good that you decided to take such liberties with people's private links in the first place.


Why would I need consent to study the global link rot rate? Publishing it reveals no information about users, either individually or in the aggregate.

I've made an effort to let anyone who wants to opt out of the research, because I know people can have strong feelings about privacy.

I agree with you that publishing an 'anonymized' dataset would be a violation of privacy guarantees. I wouldn't even do it for public bookmarks.


"Private" means "private," not "private unless Maciej wants to study them."

I didn't get an email saying "There's a chance I might select your private bookmarks and examine them." A blog post doesn't count when you're messing with people's private data. Certainly it should be opt-in, and not opt-out?

You're doing this for a noble purpose, but for what it's worth, this is the first time my trust in you has ever felt violated.

People have entrusted you with years worth of private data, and you just asked, "Why should I ask permission to study their private links?"

Actually, as far as I can tell, your comment seems to implicitly assume that you already have consent to examine all private links, and that asking consent would only be necessary if you were planning on publishing something that might reveal some of their private links. Isn't that the opposite of privacy?


I think the tension between us is that you think 'private' means 'visible only to me', and I think private means 'never displayed to any other user, or on a public page'.

There are a thousand routine tasks that require me to have unrestricted access to bookmarks and URLs. I try to be as uninvasive as I can about it, but you have no way of verifying that.

If you want something to remain truly private to you—and I say this in full sympathy to your feelings—don't put it on a stranger's computer. Where there's a server, there's an admin.


It seems like "private" could be defined as, "This is my stuff, and if you want access to it, then come ask me. It's okay if you accidentally access it, but if you want to intentionally look through it, ask me first." It seems hard to argue that most people wouldn't feel that way about their own private stuff.

I fully understood the implications of giving you the data. I'm a fan of your work and your writing, and I had full confidence in your stewardship of my data. Essentially, I was totally okay with you being the admin, or anyone you decided to hire, and I trusted you to take reasonable steps not to look through your users' private data unless it was to track down some bug, test some new feature, or some other incidental task that was unrelated to analyzing that private data.

What I didn't expect was that you'd specifically and intentionally create a program whose sole purpose was to analyze private user data and report on the results.

Why didn't I expect that? The only answer is that I should have expected that. I just didn't realize you were that type of developer. It was a bit shocking that someone who has trumpeted the benefits of sticking with businesses that haven't taken VC investment would explicitly break their users' trust like this.

In this case, you have both the legal right and the moral high ground. But intentionally seeking through your own users' private data without getting consent isn't something that can easily be forgotten.


You are on tilt. You lost it, completely, when you wrote "specifically and intentionally create a program whose sole purpose was to analyze private user data", a statement that is not only self-evidently false but deceptive. You've gone from "aggressive good-faith commenter whose points I often do not get" to "alarming and hostile" all in the span of a single comment thread.

Re-evaluate. You can often dig your way out of these stupid message board holes by simply apologizing. It's worked for me repeatedly.


You're right. I apologize. Both to you, for getting heated, and to Maciej, for misrepresenting his actions in this matter. The misrepresentation was accidental, but it happened nonetheless. My comments were also extremely disrespectful and totally uncalled for. I'm truly sorry.


I don't share your outrage. The blog post linked clearly states the scope and purpose, the researcher says aggregated data will not be released and individual user data is unnecessary for the purposes described. He then offers those that are still wary of the research an option to not have their data used. Maybe I have a blind spot but this seems pretty straightforward and harmless to individual privacy.


Is it okay for any owner of a website to go through their userbase's private data, simply because they own that website?

I don't mean "okay" in a legal sense, but rather a moral sense.

It reminds me quite a lot of http://i.imgur.com/5quY1Iq.png except that the difference is that Maciej is a researcher and isn't disclosing the data to people. However, he's still going through people's private stuff. Notifying them that you're planning to go through their private stuff is the most basic common courtesy; it's why landlords can't simply walk in to a tenant's house whenever they feel like it, even though they own the property.

Let's put it another way: I didn't know Maciej was the type of person to trawl through people's private information that they trusted him with. If I did, I would've investigated other options for a bookmarking site a couple years ago, or would've written my own, and I wouldn't have breathlessly recommended Pinboard to whoever would listen. The recommendation would be more like "Pinboard is great, but the owner likes to look at your stuff, even if it's marked 'private,' so keep that in mind."


I like the image of myself sitting at the computer with a box of bon-bons, lazily hitting 'next' on the special admin page that selects only the juiciest private bookmarks for my delectation.

The reality is less fun. I have to look at (potentially) private bookmarks when:

- someone's import file fails to parse, or has an encoding problem

- there are garbled or missing results for a search query

- I need to answer questions like 'how much disk space does a typical bookmark use', so I can provision what I need

- there's a bug in the fulltext parser

- the twitter API client misses some tweets or mutilates a URL

- the pinboard API is misbehaving in one of a thousand ways

- I need to verify that backups I make actually contain everything they're supposed to

- I want to find and fix privacy bugs!

Along with a thousand other scenarios that will be familiar to anyone who has ever had the misfortune to import, format and store user-provided data.

Anywhere bookmarks come into or leave the site, or are displayed on the site, there will be bugs. If I tried to enforce some kind of viewing restrictions on myself, it would just introduce an additional layer of bugs while making my job completely intractable.

I'm not a landlord walking into a tenant's house without permission. I'm a hotel manager, doing my best to be discreet, but ultimately requiring full access to everything in order to do my job. I'm going in to check the sprinklers and fire alarm even if you've left the 'do not disturb' sign on.

This will be the case on any outside site you use, even one that makes sweeter promises to you than I ever did. Please think twice, and then three times, before uploading your data anywhere if you have these kinds of expectations.


I'm scratching my head wondering if I'm being unclear. Let me try again:

You created a program whose sole purpose was to analyze private user data and report on the results. (In fact, not merely "private user data," but "data which users explicitly marked as 'private.'")

What you did was equivalent to a hotel manager sending employees to peep into 1,000 random rooms and compile a detailed report of what those rooms contained and what its occupants were doing, and then claiming it was for the betterment of all hotels. Yes, that may be true, and the data may be quite helpful, but people still expected their rooms to be private.

Intentions matter. You weren't accessing the private bookmarks in order to fix a bug or test a new feature.


If you want to run with this metaphor, I sent employees to look into random rooms and tell me what proportion of occupants were dead.


Not quite. The bookmark is someone's "stuff." The fact that they were interested in a certain site is something they trusted you with.

It looks like you and I won't see eye to eye on this, and it also looks like I'm apparently the only person in the world to be naive enough to trust that you'd refrain from going through other people's private belongings that they entrusted you with.


You seem to have a weird understanding of web services.

All of pinboard's servers belong to pinboard. All of pinboards hard drives belong to pinboard. All of the bits on those hard drives belong to pinboard.

Your private data on pinboard's server belongs to pinboard. This is how web services work. If the web service couldn't access its own data, it wouldn't be able to operate.

You can only assume that some bits are yours (to use your terms: your private belongings) when it runs on your own service/when your programs are the only ones with access to it.

Pinboard is not a hosting company (a la DigitalOcean), it's a web service, i.e. it's not a hotel, it's not a bank giving you a safe.


If the web service couldn't access its own data, it wouldn't be able to operate.

He intentionally created a program to trawl through data which was explicitly marked by users as private. He didn't do that for operational reasons.


No, once again, what you just wrote is a lie. You can be cavalier with logic, consistency, tone, thoughtfulness, thoroughness, and even basic correctness. You cannot be cavalier with immediately evident facts. No amount of squirreling back to how Pinboard actually touches private links in this experiment will ever make what you wrote here truthful.

It is much better to lose an argument, even one you're right about, than to be a liar.


This is the most hurtful comment I've ever received on HN, and it's the only one that's ever made me want to leave the site altogether.

You didn't even explain how, precisely, I'm a liar.

You know what a liar is? It's someone who deliberately goes out of their way to distort the truth in order to gain some kind of advantage, even when they know they're not being truthful.

Do you think I'm sitting here trying to win this debate merely because I have a problem with being wrong? If that's the case, then I had no idea you thought so lowly of me. This isn't about even about me.

Fact: When a user submits a bookmark to Pinboard and flags the bookmark as 'private,' they have a reasonable expectation of privacy. That includes privacy from the admin: it's acceptable for them to use the data for debugging or to test new features, but not to write a script whose purpose is to delve through data marked as private.

Fact: The owner deliberately wrote a script whose purpose was to go through private data. You can read the original blogpost on the topic. It doesn't matter that it was for an experiment for the betterment of all of the web. The fact is, he didn't seek consent, and users had no idea their data that they flagged as 'private' were being examined by his script.

Now, you can call me a charlatan, a dilettante, or whatever other harmful thing you wish to call me, but I have no idea how this case warrants me being called a liar. Normally, I'd let this drop, but you have publicly targeted my character and reputation. I'd like an explanation, please.


Yes. I'm saying you've gone out of the way to distort the truth. Exhibit A: your second "fact": not a fact. What the site is actually doing is right there in the link the guy who runs the site showed you.

You have options available to you besides seething about this and leaving the site.

Curiosity: what's your Pinboard account name? Mine's the same as my HN name.


From the blogpost: "The data we will examine will include private bookmarks, which make up about half of all Pinboard bookmarks."

It's not a direct quote since I'm on mobile, because I had to get out of the house and go for a walk after one of my heroes called me a liar. But it's factually accurate. What part isn't?

I'd like an answer to this question: Would you hire a liar? Someone you believe would go out of their way to be deliberately untruthful? Rather than even try to figure out if there was some sort of misunderstanding, you went straight for calling me a liar. You, of all people.

I am not seething. I'm quite hurt.

Also, the fact that half of all Pinboard links are marked 'private' should give some indication that people commonly use Pinboard as a repository for links they don't want to associate with themselves publicly. That's what we're paying for.

EDIT: Here's the full quote, from https://blog.pinboard.in/2014/08/researching_link_rot/

To run the experiment, I am going to be drawing a few thousand links at random from the entire pool of Pinboard bookmarks. This will include private bookmarks, which make up about half the Pinboard collection.


And here is what you wrote:

He intentionally created a program to trawl through data which was explicitly marked by users as private.

And here's what you wrote earlier:

you'd specifically and intentionally create a program whose sole purpose was to analyze private user data and report on the results.

He did no such thing.

Perhaps, instead of directing all your energy into maximizing the feels you generate from being called on something, you could instead introspect and re-evaluate and consider that maybe you said something very wrong. People do that all the time. They do not ritually kill their accounts when that happens. The older, wiser ones are likely to just acknowledge it and apologize. Some of the somewhat younger, dumber ones, like me, have done that too.

We are both crudding the thread up now, so I'm going to stop posting about this.


I could have worded this more carefully. I apologize.


The fact that he thinks a federated wiki would be "simple" or "easy" leads me to believe he has not actually thought through the details of how it would work in practice.


Yes. 2 problems that immediately come to mind are

1. Copyright law.

2. Dynamic content.


It would not be simple or easy, but crypto-currency blockchains make it more possible than ever.


What does this mean? I can't understand the relationship between the two ideas?


In the most rudimentary form, this is how:

http://www.righto.com/2014/02/ascii-bernanke-wikileaks-photo...

More reasonable ways of utilizing a blockchain for undeletable data are emerging, though.


I find Bret Victor's comparison between the internet and the LOC a little weird. I've always thought of the internet as a publishing/sharing medium, not an archive.

There are plenty of books that go out of print within ten years, we just happen to have infrastructure beyond publishers (libraries) that preserve published copies.


I think it's significant that the Library of Congress has funding, an official mandate, employees, a clear legal status, and stores complete copies of the works in its catalog. A similar model would work great for the Internet (and archive.org is doing its best to fill the role).


> I find Bret Victor's comparison between the internet and the LOC a little weird. I've always thought of the internet as a publishing/sharing medium, not an archive.

The Library of Congress is archiving all tweets from the US[0], which I think is what he is referring to.

[0] http://www.businessinsider.com/library-of-congress-is-archiv...


This would not work with dynamic content.

We already have systems like this (bittorrent, freenet, etc.), and almost no one sees them as a viable replacement for the web because they can't do 99.9% of the things we want (social networks, forums, email, etc.)


> There’s actually a pretty simple alternative to the current web. In federated wiki, when you find a page you like, you curate it to your own server (which may even be running on your laptop). That forms part of a named-content system, and if later that page disappears at the source, the system can find dozens of curated copies across the web.

This is a simple and very bad idea. If it were the norm, instead of one or no copies of a particular work online, you would have any number of "curated" copies of uncertain vintage, downloaded at different times in the lifetime of an original whose content might well have changed as time passed. You would have curations of curations, and curations of those, ad infinitum.

Pages that depended on remote Web content (increasingly common) and/or that linked to online references, would gradually become unreadable or incomprehensible as its links vanished into other offline "curations".

Not to mention the copyright issues. And I'm not crazy about the term "curation" either -- it's obviously meant to try to elevate the practice of downloading anything we please, without regard to copyright.


I'd rather a page go offline than have it taken out of context. As if plagiarism wasn't bad enough already. (Yahoo Answers cough)

These crooks will even steal your copyright notice. It's quite possible the original content producers are offline because scraper thieves stole so much content that it's no longer possible to earn a living.

As an artist, this reminds me of the condescending attitude that gave us fake Rolexes, Facebook & North Korea's 28 state-approved haircuts. Either it's "just content" to stuff in a database somewhere or you understand the medium is the message too.


As someone with a teensy bit of film background, I have to disagree. The number of early Hollywood films that were lost is astounding. This is a massive part of our visual history that is completely gone. It will never be restored.

With the current environment on the internet, with DRM'd video, music and text, I have to assume that we will lose far more from this time period than we ever had before.

While I don't pirate things (I'd rather just consume Creative Commons and Public Domain content), I wholehartedly support people who are trying to archive the things that are part of our collective culture. When I have kids, I'd like to be able to show them where they came from.


I'm not against reproduction if care is taken to preserve what can be preserved, as close to the original as possible while giving credit, compensating creators, etc. In your example, I think reproduction/conservation technology was available but the studios couldn't justify spending the time/money it would cost to preserve their entire library. Who would have paid for that? I don't know. Supposedly, half of Van Gogh's entire lifetime output was burned because he was too poor to find a place to store it. At the same time, I'd rather see a bug-eaten Van Gogh with fugitive reds long faded away, than a flat, lifeless high-def copy. I suppose it's a complex subject and each work is unique. Sometimes I'm thrilled when I can find an old page in "Wayback Machine" but what they manage to save is typically broken and low quality.


I'm having a little trouble understanding your perspective. Copying films is good, letting bugs eat Van Gogh is good, but copying websites is bad?

Well, luckily, with digital technology, we can copy things flawlessly with very little cost. Unfortunately, most content creators are still stuck trying to adapt physical distribution models to the information age, which is why we're stuck with DRM. You can't say "The Medium is the Message" and then get angry because you're producing content for a medium that is infinately copyable.

We need to move to a model of perceiving data as holographic. Especially with the advent of blockchain technology, we're increasingly moving to a model where every node in a network contains the entire network. Trying to adapt 20th century Disney copyright to that paradigm is stupid.


> I'd rather see a bug-eaten Van Gogh with fugitive reds long faded away, than a flat, lifeless high-def copy

The beauty of perfect copies is that you can create your own faded version if you want, and other people can have their own perfect or artificially aged version too. You can't do it the other way.


The author is not suggesting taking others content and calling it their own - he is suggesting almost keeping a backup with a bibliography. Infact, this already exists today - do you hate archive.org?


I like the idea of the federated wiki, but search engines rank copies of pages poorly, so it is not clear how visible copies would be after the original content disappears.

I used Evernote for years, but recently canceled the service because I spent too much time curating compared to reading old material.

One option that I am considering is archiving really good web content as web archive files and saving them locally in folders indicating the year of capture. Local file search would quickly find old stuff and if I stored the yearly web archive folders in Dropbox, I would have them available on different systems.


Hosting your own server might not be a scalable solution either. There's a reason why SaaS is popular: it's not that easy.

On the downside: stuff hosted by others might go away. Web pages, web services, apps requiring server side support...

Investing a lot in a service makes it more painful to lose, like the apparently discontinued Amazon Cloud Drive (supposed to be a cheaper Dropbox): https://news.ycombinator.com/item?id=8219257


Named data networking of some kind is likely to become popular at some point. This is that kind of idea but doesn't look like a really general protocol since he mentions a specific wiki.

I wonder if there are browser extensions that do p2p caching/distribution of content. Then you could standardize a protocol used for that type of communication.

I believe there are many efforts along these lines. The trick is as usual getting everyone on the same page or at least working together more.


I'd love it if browsers natively supported URI's derived from cryptographic hashes of content by looking them up in a distributed store a la BitTorrent. Imagine if Chrome supported such a thing, for example. Perfectly reliable cache-ability (or archive-ability), P2P hosting, ... All the good stuff for any web content that its creator wants to so expose, albeit at the price of immutability.


Especially under the current copyright regime, finding some solution that preserves the intent of the creator to publish in a fixed format, would be a great component of a distributed publishing system. I don't think this proposal has as good a fair use defense as the Internet Archive wayback machine does.

In the world today, we often think of publishing online as providing access to something under our control. I think a technology that aims to solve these problems should embody a different spirit, one closer to "making public". The word "mine" doesn't need to imply ownership in the sense of exclusive control. I mean, "My children" is at least as meaningful a relationship as "my property". Some kind of copyright license ability built into a distributed document publishing system would be nice.


I think the federated wiki is a neat idea, but in it's current incarnation, I find it to be exceedingly unlikely that a page I'm looking at there will stay around 10 years.

Even if I'm making a copy of every page I see, I'm not sure I'll still run a federated wiki on my server in 10 years.

I don't think this is a real solution to the problem posed by Bret Victor.


It's a great idea for a real problem that needs to be solved. Still, for dynamic pages, what would be the desired behaviour? Updating it whenever possible, which could lead to the specific info we wanted to save potentially disappearing or changing? Leaving it outdated? It is really something that I cannot answer.


The museums, libraries and cultural heritage institutions are working hard on digital preservation. I think there's a lot to learn from there. Check out this for an instance: http://www.lockss.org/about/how-it-works/


Right to be forgotten?


>Right to be forgotten?

A ridiculous concept of a "right". I do not recognize that anyone has a right to force other people to forget things.


I forgot about that. AS REQUIRED BY EUROPEAN LAW.


Y U no archive.org?


Archive.org doesn't have everything and their archives can be cleared of specific content on request of the copyright holder.

There is a big copyright issue here as in the UK we don't have the relatively liberal Fair Use exceptions that are in the USC - we only just got permission to format shift and make personal-use backups (so MP3 players [using tracks ripped from CD] are legal as of June this year!). Copying down a website, beyond caching, is generally speaking copyright infringement for those in the UK.


> their archives can be cleared of specific content on request of the copyright holder.

Not only that. If a domain expires and is picked up by a squatter, the squatter can instruct Archive.org to delete ALL copies of content archived from that domain. Unfortunately many do so.


We changed the title to the first sentence of the article because it is less linkbaity.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: