One problem we need to solve as coders is giving people better tools for saving stuff. It's really hard right now to save a webpage (or worse, series of connected pages) with any confidence that you've captured everything you need to see it again if the original server disappears.
A project that I think has struck a really good balance between permanence and retaining authors' control over their writing is the Archive of Our Own (AO3). A bunch of fanfic authors got tired of sites falling out from under them, and decided to implement their own system, along with sensible governance and a way to fund its ongoing operations. The only broken links I've ever seen to AO3 are ones where the author consciously decided to take the material offline.
1) Never in the history of civilization has local knowledge storage (disk) and local compute been so cheap
2) Never have we had a larger free software ecosystem or more hackers to deploy free software locally
3) Never have we had more evidence about the differences in civilizational freedom between local and central storage/compute
What if backup/restore skills were taught alongside home economics? Is a home a place of both shelter & storage? Are we abandoning a set of possible futures because we want the "convenience" of "someone else" backing up our digital selves/souls?
"For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the battle was lost.
For want of a battle the kingdom was lost.
And all for the want of a horseshoe nail."
I understand why you didn't post a "self promoting" link, but I want others to know about this option:
Currently $10 for a lifetime bookmarking account, plus $25/year to archive every bookmarked page:
"Pinboard offers a bookmark archiving service for an annual fee of $25. The site will crawl and store a copy of every bookmark in your account, and display a special icon you can click to see the cached copy. If the page you bookmarked goes offline, you'll still be able to see the archived copy indefinitely."
We only need to look at early film history to know how easy it is to lose massive parts of our history.
Going back to old pages, I frequently get 404 results. For politically sensitive documents, the problem is much more widespread.
I would like something that not only archives pages I visit, but also versions them and tracks changes. If there was a bookmarking tool that did this, you could easily have an opt-in feature that shared content. This type of system would be a huge boost to something like the wayback machine.
HARDWARE: differs depending on whether you want local search/analytics or just network storage.
For mobile use, either a VPN back to your personal home/cloud server, or a hackable wifi hard drive proxy, e.g. Seagate Wirless Plus + HackGFS.
For non-analytics home use, hackable router with USB3 storage and Linux software RAID, connected to a USB3 drive chassis with room for 2-4 disks.
For analytics home use, a microserver like HP N54L, Dell T20 or Lenovo TS140. Up to Xeon processor with ECC memory, plus 4-6 internal disks and up to 32GB RAM. Sold without a Windows tax, supports hardware virtualization and Linux. Possibly FreeBSD with ZFS.
SOFTWARE: generalized multi-tier cache AND compute. Camlistore and git-annex are tackling multi-device storage sync. For archives, we need a search interface that will query a series of caches, e.g. mobile > home > trusted friends private VPN (tinc overlay) > public paid cloud archive (pinboard et al) > public free cloud archive (archive.org).
It's important for usability to have a simple, local UX that will take a search string, propagate across all private/public federated tiers of storage and compute, then aggregate the metasearch results on the client.
With this approach, we can collectively pool resoures to improve on CommonCrawl.org, without locking up the 300TB index at AWS. This would turn web search engines into a secondary source, rather than a primary source. First search your archive + trusted friends, then trusted verticals (e.g. HN, StackOverflow), then a generic web search.
Let's be clear: the goal is not to archive "everything" in the world, only that which is personally important to the viewer. This attention metadata has long-term value. With this architecture, it is always optional to escalate a query to a public archive or search engine. Most importantly, there is technical autonomy and low-latency compute for local queries.
For web pages, wget of WARC formats (per HN advice on another thread) and wkhtmltopdf (available as Firefox plugin to print to PDF) will keep local archives. Recoll.org (xapian front-end with user-customizable python filters) on Linux will search full text and provide preview snippets, or lucene/solr can be adapted.
Solves my problem, then some. Also provides an alternative to search engines as the goto for internet browsing. Brilliant stuff. I can't tell you how grateful I am for your work on this.
The only thing left is FLOSS version control for sound and video editing, and an effective "publish to BitTorrent" feature, and then we can pretty much put this "Web 2.0" crap to bed.
EDIT: Out of curiosity, is there any reason that analytics couldn't be done with a dedicated PC, or do you think it requires server hardware to run effectively?
As someone said in another comment, the challenge with these solutions is making them usable to a mass audience who won't know or care about Linux. Android & OS X both created user experiences that hid the underlying Unix OS. If someone can only afford a single computer, then virtualization allows that device to play the role of both "server" and "desktop".
With local s/w RAID, the system can be designed so that backup consists of (1) shutting down the computer, (2) removing one hard drive, replacing it with another, (3) take the (encrypted) hard drive to offsite location, e.g. trusted friend/family.
> The only thing left is FLOSS version control for sound and video editing
Coud you expand more on this use case? Do you mean archiving binary blobs and storing their metadata in git? git-annex does this and camlistore can be adapted for this purpose. Is this only for local production workflow, or is there a need for remote collaboration on pre-final artifacts?
> an effective "publish to BitTorrent" feature
Could you spec out what's needed? Are there trackers which specialize in public domain content? Strong DRM is coming to browsers and new hardware. While it will create unexpected user experiences and change the web as we know it today, it could increase demand for public domain content. We are going to need distribution channels for public domain content which treat copyrighted content like viruses, i.e. reject before they can enter the channel.
The same fingerprinting techniques used for copyrighted content can only be used to improve discovery of public domain content. This means central directories (which can be cached locally) of metadata and hash/fingerprints for public domain content (text, audio, video, raw data).
Another feature: support for software agents that take action based on computed result of (remote event + private data). A FLOSS version of IFTTT, Yahoo Pipes, e.g. https://github.com/cantino/huginn . A remote event could be a private signal from a mobile device app. This would increase flexibility since the user has a larger private dataset about themself than any web/cloud service. The user can configure FLOSS algorithms, instead of relying on remote black boxes without appeal or transparent governance.
I think a public domain or creative commons music community could develop a thriving remix community in unique ways. Remixes can be recontextualized as forks of music. I would imagine that fingerprinting could be used really effectively here, which will require some thought. As a recording artist, it would be really useful for me to be able to delete takes but retain them in version history. The music industry is filled with stories of the person who owns the recording studio retaining the masters to a session, and then refusing to cooperate with the artist. Opening that data up would be a massive boon not only to musicians but I think to recording studios as well, as the finer aspects of a recording session become much easier to access.
There's a deeper problem, though. There's a dichotomy between the binary blobs that DAWs use and the user-readable stem files that artists like Radiohead release. It's as though all the information about the studio session and all the non-temporal aspects of editing are obfuscated by assembly code. Ideally, it would be nice to replace the binary blobs that programs like Audacity and Ardour save to with human-readable stems that encode sound or video editing in metadata. Ultimately I think this is a critical UI requirement, but for now I think the best solution is to use git-annex to store .flacs and .oggs, as well as both individual exported stem files and binary project blobs. In the final analysis, different DAWs are really more like different instruments, and cross compatibility between instruments is a requirement.
Uploading files to a server is pretty easy, but what is a little more challenging is automatically licensing the content, uploading to your server, creating a torrent that uses the server as a webseed, and then publishing it to a tracker. This publishing flow would let me directly publish content even on my puny little shared web hosting account. Traffic load is automatically distributed through bittorrent, and censorship becomes much more difficult (albiet not impossible.) From there, I think, it would be fairly trivial to build a front-end to replace an interface like YouTube's "Submit Video" page.
Treating copyrighted content like viruses is a feature I would really like to see more of. While I'm a firm supporter of PopcornTime/Time4Popcorn/AsPopcornGoesBy, I don't really want to use it. I'd rather find public domain things to watch than Hollywood Movies or TV shows. I wish PopcornTime had a public domain feature.
If there are trackers that only use public domain or Creative Commons content, nobody has invited me to them. ;) I think it wouldn't be too hard to make a tracker that scanned for a machine readable public domain or CC license, but I don't have much experience with torrent trackers. I'd like to try and set one up soon, I think initially you could just moderate content. I'm kind of baffled by the direction Bittorrent, Inc. has taken, because they could just as easily be advocating for number of seeds as a surveillance-free fanbase metric, rather than just releasing DRM'd music through Bittorrent.
Huginn looks really, really interesting. I sure appreciate your posts. Thanks very much.
> I would like something that not only archives pages I visit, but also versions them and tracks changes.
That's a huge storage requirement, you must realize this. If you're an avid Web browser, and if every archived page had to look as it originally looked (i.e. all the linked resources) you could accumulate several terabytes per week.
> This type of system would be a huge boost to something like the wayback machine.
Here's the road to madness. Someone, aware of the rapidly declining cost of storage, rebuilds the Wayback machine based on your scheme, with the intent of archiving every Web page in existence, including all required resources so the pages look just as they originally did. Then, as the project approaches completion, this genius says, "For the next phase, I need to archive the Wayback machine itself." At that point, as the implications of what he's said occur to him, a strange look crosses his face and his imagination begins writing checks his intellect can't cash.
My practical experience is that you need about 1.5 MB per URL for storing large numbers of web pages, if you exclude video.
I don't have Flash Player installed. Instead, I use youtube-dl for flash videos across many websites. I'm well aware of the storage implications of this kind of activity.
Turns out it's pretty much impossible. The Youtube API only delivers about 20 results, which is a bug that has existed for about 2 years. I tried manually loading the watch history page, and was only able to get about 1000 out of ~8000 results. When I selected those 1000 results and tried to add them to a playlist, the interface crashed.
Does anyone have a solution for this, or is my watch history just at the mercy of Google?
$ youtube-dl ':ythistory' -u USER -p PWD --write-pages
[youtube:history] playlist Youtube Watch History: Collected 0 video ids (downloading 0 of them)
Your watch history should be stored by your browser on your machine. It is only your business and only locally it is fully in your control.
But if the motherboard or some other vital had component died, I would've beem locked out of my data. It's at times like this I'm glad that Google Keep syncs my notes to their servers, and that Viber stores my contacts' phone numbers remotely. If it hadn't been for them and my phone, say, fell in water and completely broke, I'd have lost all of that. Same argument for a hard drive death. In 2004 or so, we were expected to live with that risk. But I don't think we should have to now. Witnessing the explosion of the "cloud" buzzword everywhere, you'd think everyone in the world has convenient access to their own private space "in the cloud" (or, as those soooo-2000s people would say, "on the Internet"). And we do have access to such space, but for the most part, not quite so for the "private" thing.
Of course, ideally, I'd like to generate random keys, keep safe local backups of them and then sync encrypted data to remote servers. But I don't think the companies are eager to accept that as they probably owe a lot of their statistics and targeted advertising opportunities to big-scale mining of that plaintext data we provide them in exchange for convenience.
My watch history is currently in the hands of Google.
I'm trying to copy it so that I have it saved locally.
That may not be a bug. It might be a way to limit the traffic created by scrapers who submit requests and then download all the videos listed on the result page.
I'm kind of still a beginner when it comes to APIs, but it seems disingenuous to offer an API feature and then intentionally break it so that it can't be used.
Because people can still get 20 results. I mean, 20 results is way better than no results.
> I'm kind of still a beginner when it comes to APIs, but it seems disingenuous to offer an API feature and then intentionally break it so that it can't be used.
It's not broken. It returns 20 results. Maybe Google decided that was enough of a hit on their database. One could also argue that most people wouldn't want more than 20 results on their small-screen Android device served by a slow connection.
The number of results returned is random. Some people report seeing videos watched on computers but not locally. One person says it works.
That's not a limit. That's broken.
I recently converted all my bookmarks to saved copies of the pages in Evernote. This means I have full text search over everything I "bookmark" -- still have the links -- and never lose it regardless of what happens.
In this process, I was shocked at the number of 404s I encountered from relatively recent (last 18 months) bookmarks.
Evernote is a bit of a wreck, but so far the combination of decent copy of page, browser plugins, tags, search, accessibility (phone clients, web client, some native clients) have made it my best option.
This is also somehow of a problem on printed books. There are tons of books that are out of print and cannot be easily found.
You're also proposing mass copyright infringement which - stupidly in this example - is not legal.
The WSJ point about restricting the number of free article views made in another reply is similarly nothing to do with copyright. This is more of a contract issue, and anyway, if I can view only N articles for free, I can only save those N, so it is really moot.
These tools would work fine for me, without risk of copyright infringement. Just because you don't use libre content doesn't mean that the rest of us should have a broken internet.
You can't make an omelette without breaking a few eggs and you can't fix the world without breaking a few laws.
Heh. The NSA should put this on a t-shirt and sell it.
I desperately want to see you present this as your defense in a court of law.
What a weird dig. It's neither expected, nor established that he can't see a solution. I'm not as smart as Andreessen and I could come up with half a dozen solutions.
Author's favourite is fine but far, far from obvious. How viable is it to run your own federated wiki anyway? Are there packages for popular systems? Are there plugins for major browsers? Is there any federation actually happening? I skimmed the resources and don't know. Does anyone here run one? That would be a solution, this seems more like an idea.
And it's not like no one's doing anything. There are services like Pocket or Readability to store an article until you want to read it, Evernotes, Google Keeps... Our very own 'idlewords will archive the contents of your bookmarks for a fee. Finally, there's archive.org.
The main idea is to use a browser extension to automatically save pages that you read to the cloud (including the images, stylesheets etc) in the background. Saved pages are searchable and sharable.
Time limitations are what is preventing me from doing this.
Thanks for your feedback! Hope you will use Purplerails. :)
There are a couple of heuristics to avoid wastefully uploading pages: the full page is uploaded only if the reader expresses "sufficient interest" in the page. Currently the heuristic is 90 seconds of continuous reading of a page, or scrolling to the bottom. If a page is read for a minimum of 10 continuous seconds then only the text of the page is uploaded.
Static assets like JS files benefit from deduping: they take time to upload the first time, but subsequently processing them is much faster.
Typically, people read multiple pages in a browsing session: I rapidly open many tabs and then read each one for multiple seconds. There's a debug mode in Purple Rails in which a timer counts up when I switch to a tab. I find that typically spend 100+ (usually much more) seconds on a page that I read through to the end. This is usually enough time on a residential broadband connection (I have Sonic DSL) to finish uploading a page. I also use Purple Rails on a tethered 4G connection almost everyday: uploading is slower than DSL but it works.
Basically, by the time you finish reading a tab, the previous tab you read would have finished saving.
Like I said, nothing too fancy.
The basic algo is to generate a HMAC of the plaintext and compare it against a table of previously uploaded blobs' HMACs. HMAC is keyed with a key derived from the user's password. When a blob is uploaded, the ciphertext and HMAC of the plain text are both uploaded.
But maybe browser extensions cannot obtain permission to do that?
I anticipate that PurpleRails will be used on multiple computers and over several years. Which is why pages are sync'd to the cloud.
I've adopted the current architecture because I feel that the energy barrier that needs to be overcome to persuade somebody to install an app is much higher than the one to install an extension.
With PurpleRails, you will rarely be typing your password once you successfully login. It works kinda like Gmail/Facebook etc: you remain logged in for months at a time.
This page explains why password saving doesn't work yet: https://www.purplerails.com/blog/saving-passwords-why-we-hav...
I hope you will hang in there and use the current version until I figure out a way to fix this. Thanks!
Logout if extension not installed is related to the same reasons why saved passwords don't work yet.
The timer is supposed to be a debug feature that I thought was off by default! :) It shows the amount of time you've spent in that tab. I'll turn it off by default in the next update. For now, go into the extension options and uncheck "show page view timer" (near the bottom of the options page).
wrt taking quite some time: the first time you save a page from a site, it's likely to take some time, since things like static assets will also be uploaded. Subsequent saves should be much faster due to dedup'ing which works well on static assets. Let me know if this is not the case.
Copy and paste the same thing into "Autoindex exclude rules" and "Autosave exclude rules".
Could you expand on your feedback on pricing? Can't tell if you're saying $5 is too high/too low. If you wish you can also email me (I couldn't find your email on your site).
I've got another piece of feedback. The numbers seem to be off for me in the web app. I t says I have over 300 pages bookmarked, while the interface only displays 100(which seems like the more plausible number).
$5 is kind of a norm these days, and a lot of times, its too high, depending on what the service is. For instance, there's IRCCloud. Obviously, the interface is good, and they provide easy to use mobile apps, but all that doesn't justify $60 per year. Another example is Pocket. I really like their apps, but they also followed the $5 norm. They aren't providing me much value to be worth $5(every month) to me, so I switched over to Pinboard(which, by the way, is one of those few services doing pricing right these days). There are a lot of services that I can justify paying $5 per month for, cloud storage for instance, but a bookmarking service, nope.
If you want to see the kind of backlash Pocket got for pricing their Premium option at $5, just have a look at this thread on reddit:
You can list all pages you've currently saved by using adding '?n=10000' to the list page URL. Let me know if the 300 number is incorrect.
Thx for the feedback on pricing. Will take it into account.
2) privacy-first architecture (e.g., plaintext is never uploaded, plaintext URLs are never uploaded etc)
and it is terribly inefficient to store say, 30,000 examples of the exact same article. or do you have a way to check and not store duplicates? if so, what if a blog post is saved today and has 10 comments and it is saved tomorrow by another person and it has 11 comments.
technically the page is different so it would be saved again
I think you need to explain exactly how it works a bit better or maybe I'm just not getting it
Please see my reply to idlewords. It isn't literally everything; there are some heuristics to detect what was interesting to you.
I understand you might find this excessive. I routinely find that useful. :) See also "As We May Think" by Vannevar Bush.
> and it is terribly inefficient to store say, 30,000 examples of the exact same article ...
You're right: no deduping is and can be done across users (HMACs chained up the user's password are used create dedup hashes). Storage is sufficiently cheap that I feel it's a acceptable tradeoff vs. privacy (i.e., server being unable to confirm that two users have saved the same page).
Sure, not disappearing in 10 years (or whenever the original server goes poof) would also be nice, but it's of little benefit if no one ever sees the thing in the first place.
And disappearing is the default, natural state of things.
If I see some people playing music on a corner and return the next night to see they've left, I may be wistful, but it would be silly to argue "playing live music is broken and we should fix it".
If you think of web sites as performances put on for a limited time by the server, it doesn't seem so terrible that they disappear after a while.
Books, clay tablets, scrolls, engraved stone, to which humans owe their entire knowledge of their premodern history, seem to have put up pretty well against entropy. The same is not the case for information disseminated in a controlled manner from privately owned servers.
> If I see some people playing music on a corner and return the next night to see they've left, I may be wistful, but it would be silly to argue "playing live music is broken and we should fix it".
> If you think of web sites as performances put on for a limited time by the server, it doesn't seem so terrible that they disappear after a while.
Thankfully, the generations who produced and preserved knowledge on paper, clay and stone before the onset of digital technology - that is, every generation of humans that has ever lived, except ours - did not think of books and libraries as throwaway pamphlets. And it would take more than an arbitrary interchange of modes of cultural production to argue that we should be doing otherwise in the technological circumstance we find ourselves in.
The "tyrants of the server" are not thinking of server-centric aggregation and dissemination of as a performance put on for a limited time: they are betting on it as the future of all human literary activity. Google doesn't want to read you a paragraph, take your money and say goodbye; it wants to swallow all the world's books and information, chop it to tiny pieces, store and own it forever, and extract the maximum profit from each tiny piece, without having you pay a penny. And it wants you to come back for more. The persistence of the server-centric model of content dissemination is not an accident; it is dictated by the political economy of the web brought about by the Googles of the world.
Only the ones that have survived. For every book or tablet we have, there are certainly tens of THOUSANDS of which every copy ever published has been lost - most of those are ephemera that wouldn't mean much to us anyways, but the lost also include things that would be nice: the majority of Livy, any of the original source material for the Gospels, etc.
Even considerably more modern material has been vanishing at a significant rate; for instance, most of the output of the silent film era has already been lost.
But many people consider those losses to be an immeasurable tragedy.
Nobody (at least, no sane regular person) reads a webpage and thinks "I have a time-limited license from the originator of this content to consume the material and only use it for their expressly condoned purposes."
The vanishing content problem is like if books in your house randomly walked away just because it's the "natural state" of things to disappear.
There is a difference between the creator and the server. Most of the content you consume is created by people who don't own the servers. Separating appearance+behavior from content source would help actual creators because they wouldn't have to worry about their host deciding one day to delete all their content because the service is being discontinued or the creator is competing with some business interest of the host.
The problem is that the web is being used for everything, even things that can and should work like books rather than like live performances.
Actually, that's exactly what the inventors of (various) recording machines did. Something might have disappeared in one form, only to return in another. Just ask the Project Gutenberg people.
1) I can select (or do select all) Chrome bookmarks that I want to keep offline page backups/archives of (saved to google drive or dropbox or some such).
2) Whenever I want, instead of seeing the current online version of that bookmarked page, I can look up the originally bookmarked archived page.
3) It allows me to choose the level of links to the bookmarked page to also backup/archive (e.g., every single page that is linked to that page, x links deep, is also automatically archived -- think httrack or wget).
As someone on Hacker News once said to me: my bookmarks are my knowledge graph. As important to me as any library.
Now excuse me while I go curate my socks collection.
Also, wouldn't this break analytics and reporting for most websites too? It'll be much tougher to track user behavior to improve user experience. And debugging using log data? I get what the author is suggesting but "fixing the web" this way would break more things that large websites and companies rely on.
That sounds tragically Orwellian. "We do [BAD THING] to help you! [and it also, just by chance, makes us more money too by exploiting you, but never you mind that]"
> break more things that large websites and companies rely on.
The historical "spy everywhere" privacy model of the web isn't a natural state anybody has a right to exploit. Breaking the current centralized curation model would be a benefit for everyone.
True but that's a "YP" not a "MP". (Bogie Nights, "Your problem" not "My Problem".
I mean we're not talking about a public health issue after all.
>In a 2003 experiment, Fetterly et al. discovered that about one link out of every 200 disappeared each week from the Internet. McCown et al. (2005) discovered that half of the URLs cited in D-Lib Magazine articles were no longer accessible 10 years after publication [the irony!], and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year.
>Bruce Schneier remarks that one friend experienced 50% linkrot in one of his pages over less than 9 years (not that the situation was any better in 1998), and that his own blog posts link to news articles that go dead in days; the Internet Archive has estimated the average lifespan of a Web page at 100 days. A Science study looked at articles in prestigious journals; they didn’t use many Internet links, but when they did, 2 years later ~13% were dead. The French company Linterweb studied external links on the French Wikipedia before setting up their cache of French external links, and found - back in 2008 - already 5% were dead. (The English Wikipedia has seen a 2010-2011 spike from a few thousand dead links to ~110,000 out of ~17.5m live links.) The dismal studies just go on and on and on (and on). Even in a highly stable, funded, curated environment, link rot happens anyway. For example, about 11% of Arab Spring-related tweets were gone within a year (even though Twitter is - currently - still around).
>My specific target date is 2070, 60 years from now. As of 10 March 2011, gwern.net has around 6800 external links (with around 2200 to non-Wikipedia websites)4. Even at the lowest estimate of 3% annual linkrot, few will survive to 2070. If each link has a 97% chance of surviving each year, then the chance a link will be alive in 2070 is 0.972070−2011=0.16 (or to put it another way, an 84% chance any given link will die). The 95% confidence interval for such a binomial distribution says that of the 2200 non-Wikipedia links, ~336-394 will survive to 20705. If we try to predict using a more reasonable estimate of 50% linkrot, then an average of 0 links will survive (0.502070−2011×2200=1.735×10−16×2200≃0). It would be a good idea to simply assume that no link will survive.
You chose to include everyone's private bookmarks in your research without asking their consent? What?
I will publish some aggregate information about what I find, and use it to seek glory, and persuade people to sign up for archiving. But I won't release anything that could lead back to specific users or links.
There is roughly a boatload of evidence that anonymized datasets can be deanonymized in unexpected ways.
Even if you don't release any anonymized datasets, it's really not good that you decided to take such liberties with people's private links in the first place.
I've made an effort to let anyone who wants to opt out of the research, because I know people can have strong feelings about privacy.
I agree with you that publishing an 'anonymized' dataset would be a violation of privacy guarantees. I wouldn't even do it for public bookmarks.
I didn't get an email saying "There's a chance I might select your private bookmarks and examine them." A blog post doesn't count when you're messing with people's private data. Certainly it should be opt-in, and not opt-out?
You're doing this for a noble purpose, but for what it's worth, this is the first time my trust in you has ever felt violated.
People have entrusted you with years worth of private data, and you just asked, "Why should I ask permission to study their private links?"
Actually, as far as I can tell, your comment seems to implicitly assume that you already have consent to examine all private links, and that asking consent would only be necessary if you were planning on publishing something that might reveal some of their private links. Isn't that the opposite of privacy?
There are a thousand routine tasks that require me to have unrestricted access to bookmarks and URLs. I try to be as uninvasive as I can about it, but you have no way of verifying that.
If you want something to remain truly private to you—and I say this in full sympathy to your feelings—don't put it on a stranger's computer. Where there's a server, there's an admin.
I fully understood the implications of giving you the data. I'm a fan of your work and your writing, and I had full confidence in your stewardship of my data. Essentially, I was totally okay with you being the admin, or anyone you decided to hire, and I trusted you to take reasonable steps not to look through your users' private data unless it was to track down some bug, test some new feature, or some other incidental task that was unrelated to analyzing that private data.
What I didn't expect was that you'd specifically and intentionally create a program whose sole purpose was to analyze private user data and report on the results.
Why didn't I expect that? The only answer is that I should have expected that. I just didn't realize you were that type of developer. It was a bit shocking that someone who has trumpeted the benefits of sticking with businesses that haven't taken VC investment would explicitly break their users' trust like this.
In this case, you have both the legal right and the moral high ground. But intentionally seeking through your own users' private data without getting consent isn't something that can easily be forgotten.
Re-evaluate. You can often dig your way out of these stupid message board holes by simply apologizing. It's worked for me repeatedly.
I don't mean "okay" in a legal sense, but rather a moral sense.
It reminds me quite a lot of http://i.imgur.com/5quY1Iq.png except that the difference is that Maciej is a researcher and isn't disclosing the data to people. However, he's still going through people's private stuff. Notifying them that you're planning to go through their private stuff is the most basic common courtesy; it's why landlords can't simply walk in to a tenant's house whenever they feel like it, even though they own the property.
Let's put it another way: I didn't know Maciej was the type of person to trawl through people's private information that they trusted him with. If I did, I would've investigated other options for a bookmarking site a couple years ago, or would've written my own, and I wouldn't have breathlessly recommended Pinboard to whoever would listen. The recommendation would be more like "Pinboard is great, but the owner likes to look at your stuff, even if it's marked 'private,' so keep that in mind."
The reality is less fun. I have to look at (potentially) private bookmarks when:
- someone's import file fails to parse, or has an encoding problem
- there are garbled or missing results for a search query
- I need to answer questions like 'how much disk space does a typical bookmark use', so I can provision what I need
- there's a bug in the fulltext parser
- the twitter API client misses some tweets or mutilates a URL
- the pinboard API is misbehaving in one of a thousand ways
- I need to verify that backups I make actually contain everything they're supposed to
- I want to find and fix privacy bugs!
Along with a thousand other scenarios that will be familiar to anyone who has ever had the misfortune to import, format and store user-provided data.
Anywhere bookmarks come into or leave the site, or are displayed on the site, there will be bugs. If I tried to enforce some kind of viewing restrictions on myself, it would just introduce an additional layer of bugs while making my job completely intractable.
I'm not a landlord walking into a tenant's house without permission. I'm a hotel manager, doing my best to be discreet, but ultimately requiring full access to everything in order to do my job. I'm going in to check the sprinklers and fire alarm even if you've left the 'do not disturb' sign on.
This will be the case on any outside site you use, even one that makes sweeter promises to you than I ever did. Please think twice, and then three times, before uploading your data anywhere if you have these kinds of expectations.
You created a program whose sole purpose was to analyze private user data and report on the results. (In fact, not merely "private user data," but "data which users explicitly marked as 'private.'")
What you did was equivalent to a hotel manager sending employees to peep into 1,000 random rooms and compile a detailed report of what those rooms contained and what its occupants were doing, and then claiming it was for the betterment of all hotels. Yes, that may be true, and the data may be quite helpful, but people still expected their rooms to be private.
Intentions matter. You weren't accessing the private bookmarks in order to fix a bug or test a new feature.
It looks like you and I won't see eye to eye on this, and it also looks like I'm apparently the only person in the world to be naive enough to trust that you'd refrain from going through other people's private belongings that they entrusted you with.
All of pinboard's servers belong to pinboard. All of pinboards hard drives belong to pinboard. All of the bits on those hard drives belong to pinboard.
Your private data on pinboard's server belongs to pinboard. This is how web services work. If the web service couldn't access its own data, it wouldn't be able to operate.
You can only assume that some bits are yours (to use your terms: your private belongings) when it runs on your own service/when your programs are the only ones with access to it.
Pinboard is not a hosting company (a la DigitalOcean), it's a web service, i.e. it's not a hotel, it's not a bank giving you a safe.
He intentionally created a program to trawl through data which was explicitly marked by users as private. He didn't do that for operational reasons.
It is much better to lose an argument, even one you're right about, than to be a liar.
You didn't even explain how, precisely, I'm a liar.
You know what a liar is? It's someone who deliberately goes out of their way to distort the truth in order to gain some kind of advantage, even when they know they're not being truthful.
Do you think I'm sitting here trying to win this debate merely because I have a problem with being wrong? If that's the case, then I had no idea you thought so lowly of me. This isn't about even about me.
Fact: When a user submits a bookmark to Pinboard and flags the bookmark as 'private,' they have a reasonable expectation of privacy. That includes privacy from the admin: it's acceptable for them to use the data for debugging or to test new features, but not to write a script whose purpose is to delve through data marked as private.
Fact: The owner deliberately wrote a script whose purpose was to go through private data. You can read the original blogpost on the topic. It doesn't matter that it was for an experiment for the betterment of all of the web. The fact is, he didn't seek consent, and users had no idea their data that they flagged as 'private' were being examined by his script.
Now, you can call me a charlatan, a dilettante, or whatever other harmful thing you wish to call me, but I have no idea how this case warrants me being called a liar. Normally, I'd let this drop, but you have publicly targeted my character and reputation. I'd like an explanation, please.
You have options available to you besides seething about this and leaving the site.
Curiosity: what's your Pinboard account name? Mine's the same as my HN name.
It's not a direct quote since I'm on mobile, because I had to get out of the house and go for a walk after one of my heroes called me a liar. But it's factually accurate. What part isn't?
I'd like an answer to this question: Would you hire a liar? Someone you believe would go out of their way to be deliberately untruthful? Rather than even try to figure out if there was some sort of misunderstanding, you went straight for calling me a liar. You, of all people.
I am not seething. I'm quite hurt.
Also, the fact that half of all Pinboard links are marked 'private' should give some indication that people commonly use Pinboard as a repository for links they don't want to associate with themselves publicly. That's what we're paying for.
EDIT: Here's the full quote, from https://blog.pinboard.in/2014/08/researching_link_rot/
To run the experiment, I am going to be drawing a few thousand links at random from the entire pool of Pinboard bookmarks. This will include private bookmarks, which make up about half the Pinboard collection.
He intentionally created a program to trawl through data which was explicitly marked by users as private.
And here's what you wrote earlier:
you'd specifically and intentionally create a program whose sole purpose was to analyze private user data and report on the results.
He did no such thing.
Perhaps, instead of directing all your energy into maximizing the feels you generate from being called on something, you could instead introspect and re-evaluate and consider that maybe you said something very wrong. People do that all the time. They do not ritually kill their accounts when that happens. The older, wiser ones are likely to just acknowledge it and apologize. Some of the somewhat younger, dumber ones, like me, have done that too.
We are both crudding the thread up now, so I'm going to stop posting about this.
1. Copyright law.
2. Dynamic content.
More reasonable ways of utilizing a blockchain for undeletable data are emerging, though.
There are plenty of books that go out of print within ten years, we just happen to have infrastructure beyond publishers (libraries) that preserve published copies.
The Library of Congress is archiving all tweets from the US, which I think is what he is referring to.
We already have systems like this (bittorrent, freenet, etc.), and almost no one sees them as a viable replacement for the web because they can't do 99.9% of the things we want (social networks, forums, email, etc.)
This is a simple and very bad idea. If it were the norm, instead of one or no copies of a particular work online, you would have any number of "curated" copies of uncertain vintage, downloaded at different times in the lifetime of an original whose content might well have changed as time passed. You would have curations of curations, and curations of those, ad infinitum.
Pages that depended on remote Web content (increasingly common) and/or that linked to online references, would gradually become unreadable or incomprehensible as its links vanished into other offline "curations".
Not to mention the copyright issues. And I'm not crazy about the term "curation" either -- it's obviously meant to try to elevate the practice of downloading anything we please, without regard to copyright.
These crooks will even steal your copyright notice. It's quite possible the original content producers are offline because scraper thieves stole so much content that it's no longer possible to earn a living.
As an artist, this reminds me of the condescending attitude that gave us fake Rolexes, Facebook & North Korea's 28 state-approved haircuts. Either it's "just content" to stuff in a database somewhere or you understand the medium is the message too.
With the current environment on the internet, with DRM'd video, music and text, I have to assume that we will lose far more from this time period than we ever had before.
While I don't pirate things (I'd rather just consume Creative Commons and Public Domain content), I wholehartedly support people who are trying to archive the things that are part of our collective culture. When I have kids, I'd like to be able to show them where they came from.
Well, luckily, with digital technology, we can copy things flawlessly with very little cost. Unfortunately, most content creators are still stuck trying to adapt physical distribution models to the information age, which is why we're stuck with DRM. You can't say "The Medium is the Message" and then get angry because you're producing content for a medium that is infinately copyable.
We need to move to a model of perceiving data as holographic. Especially with the advent of blockchain technology, we're increasingly moving to a model where every node in a network contains the entire network. Trying to adapt 20th century Disney copyright to that paradigm is stupid.
The beauty of perfect copies is that you can create your own faded version if you want, and other people can have their own perfect or artificially aged version too. You can't do it the other way.
I used Evernote for years, but recently canceled the service because I spent too much time curating compared to reading old material.
One option that I am considering is archiving really good web content as web archive files and saving them locally in folders indicating the year of capture. Local file search would quickly find old stuff and if I stored the yearly web archive folders in Dropbox, I would have them available on different systems.
On the downside: stuff hosted by others might go away. Web pages, web services, apps requiring server side support...
Investing a lot in a service makes it more painful to lose, like the apparently discontinued Amazon Cloud Drive (supposed to be a cheaper Dropbox): https://news.ycombinator.com/item?id=8219257
I wonder if there are browser extensions that do p2p caching/distribution of content. Then you could standardize a protocol used for that type of communication.
I believe there are many efforts along these lines. The trick is as usual getting everyone on the same page or at least working together more.
In the world today, we often think of publishing online as providing access to something under our control. I think a technology that aims to solve these problems should embody a different spirit, one closer to "making public". The word "mine" doesn't need to imply ownership in the sense of exclusive control. I mean, "My children" is at least as meaningful a relationship as "my property". Some kind of copyright license ability built into a distributed document publishing system would be nice.
Even if I'm making a copy of every page I see, I'm not sure I'll still run a federated wiki on my server in 10 years.
I don't think this is a real solution to the problem posed by Bret Victor.
A ridiculous concept of a "right". I do not recognize that anyone has a right to force other people to forget things.
There is a big copyright issue here as in the UK we don't have the relatively liberal Fair Use exceptions that are in the USC - we only just got permission to format shift and make personal-use backups (so MP3 players [using tracks ripped from CD] are legal as of June this year!). Copying down a website, beyond caching, is generally speaking copyright infringement for those in the UK.
Not only that. If a domain expires and is picked up by a squatter, the squatter can instruct Archive.org to delete ALL copies of content archived from that domain. Unfortunately many do so.