Hacker News new | past | comments | ask | show | jobs | submit login
Archiving your website in the Wayback Machine (simon-frey.eu)
50 points by l1am0 on June 12, 2018 | hide | past | favorite | 39 comments

Of course, if you're going to offload hosting, GDPR concerns, or just your own backup peace of mind to the Internet Wayback Machine, don't forget to say thank you - https://archive.org/donate/

A few years back I have removed all like buttons, Disqus comments and analytics from my blog. Even though I have hosted the images myself and generated links. It took me maybe an hour and that is because I was browsing reddit or something at the same time. Now I do not need to do anything about GDPR.

Keep in mind you still need a privacy policy, even if it just says "we store anonymised access logs which we delete after X months and that's all" (paraphrased).

EDIT: Not sure why the downvotes. Prove me wrong if you disagree. Every lawyer I've heard after the GDPR echoes the sentiment that even simple sites that don't do anything fancy should have at least a minimal privacy policy.

There are certain requirements a site could theoretically meet to be entirely exempt, but if you have a public-facing website in 2018 you're probably not, even if it's a "digital business card".

Besides, the point is that a privacy policy that explicit tells a visitor that you're really not doing much at all with their data is a positive signal whereas a missing privacy policy is more likely to indicate that a) you don't care about your users' rights, b) you have no idea what you're doing or c) you don't want to tell your users what you're doing with their data.

AFAIK the fact whether IP addresses and user agents are considered personal information is still up to debate.

Nevertheless it is not much work to add that so I probably will. I've already done that kind of policy to my app: "I keep nothing because I don't collect anything"

IP addresses combined with time stamps are personal information if there's only one person using that IP address at that moment in time. If you don't want to personally verify that the IP address was used by more than one person at the given time, you should consider it personal information.

What is apparently unintuitive to some people (especially programmers) about the GDPR definition of "personal information" is that it's not a clear cut list but highly contextual: if it can be used to unambiguously identify a single individual by someone, then to that someone it is personal information and they need to treat it as such.

Absurd example: if someone called "John Smith" writes their name on a piece of paper and you find it ten years later, it's probably not personal information as you have no way to determine who that name refers to or even whether it refers to anyone at all.

Absurd counter-example: any given number can be personal information when used to identify specific individuals in the real world (e.g. numeric user IDs), so slipping your coworker a piece of paper with the number "12345" on it can be personal information if both of you know who it refers to (or can look it up).

It is personal information but is it personally identifiable?

For example in order to anonymize medical data you have to ensure that parties who do have access can not go from data to the person. This usually means that the database with data only has some identifiers and another database has the correspondence of identifier -> personal information.

Back to the IP address topic, it is true that an IP address plus a timestamp is enough to identify a specific computer. However, as a site owner I have no realistic mean to go from here to knowing who the person is. For example if somebody asked me to remove their data, they would have to tell me which IP they had when. (this is actually an interesting legal/technical question, should apache logs be purgeable?)

This being said to me it seems fair to specify that these logs are created and kept. It would be nice to have a boilerplate paragraph to paste to sites that only keep the apache logs.

I'm using the two phrases synonymously because the GDPR protects personal information, which is the superset (the distinction of PII as a subset seems to be more relevant in the US).

The GDPR is concerned with privacy. If you derive information from my personal information in a way that makes it impossible to go back from the derived information to identifying me as an individual, it's anonymous and thus not relevant to my privacy. However if you use this exact same process but only use it once and record that you only did it with my information, it becomes linked to me again and can no longer be considered anonymous.

For another example: imagine you have a closed group of 1000 participants. "One of the guys with blonde hair" is probably a fairly ambiguous identifier because you'd expect there to be more than one person in the group that description could apply to. However if I'm the only blonde guy in that group, it's now clearly referring to me as an individual and thus affecting my privacy.

The thing about storing IP plus timestamp on the other hand is that while it may be practically anonymous to you, you're storing it. Even though you can't resolve that information to a single person right now, someone else could if you gave them access to it.

You can make an argument about where exactly the line should be drawn considering it's rarely impossible that someone somewhere could use seemingly innocuous data to identify someone but it's not much of a leap to go from an IP address and a time stamp to a subscriber who might be a single individual: IPs are publicly registered to ISPs and those ISPs know who they assigned the IP to at a given point in time (especially in countries like Germany where ISPs are required to keep records of this) so you can already easily convert "IP plus timestamp" to "IP plus timestamp plus an organisation that is capable of resolving that IP plus timestamp to a subscriber". In other words: at best an IP plus timestamp isn't anonymous, it's at least pseudonymous (even if you have no legal means of resolving that pseudonym).

FWIW there are free privacy policy generators out there for small scale websites (e.g. most blogs). Here's a good one for Germany: https://datenschutz-generator.de/

EDIT: For programmers:

You have a set (IP, timestamp).

IPs are publicly registered to ISPs so you can resolve that to an ISP, which can act as a function (IP, timestamp) => subscriber.

A subscriber can be, among other things, a single individual.

So storing the IP and timestamp is the equivalent of storing an identifier from a lookup table of subscribers (some of whom are single individuals).

Whether the result of the lookup table is accessible by legal means (e.g. a warrant) or technical means (e.g. a decryption key) or practical means (e.g. a literal key to a safe) makes no difference.

"So I searched trough my websites: Remove Facebook Like Button, Remove analytics, Add privacy statement, Add cookie opt-in/out …. the list goes on."

This is exactly why GDPR is so essential.

Not a single thought about why allowing tracking or the trade offs for others gain had occurred before GDPR.

For a blog GDPR is a non-issue.

Though I'm deeply disappointed in how many large sites have interpreted "opt in" as spending 5 minutes vigorously unchecking boxes with ambiguous meaning, hopefully that will bite them hard.

This isn't rocket science. Yes, if you go out of your way trying to game your users privacy as much as possible then things will get hairy. That's a feature and the whole point.

> Though I'm deeply disappointed in how many large sites have interpreted "opt in" as spending 5 minutes vigorously unchecking boxes with ambiguous meaning, hopefully that will bite them hard.

The fun part about that one is actually that the "opt in" by default is a blatant violation of the GDPR when that's supposedly what they're trying to to comply with. If you're in the EU, make sure to file a complaint with your data protection agency (e.g. in the UK that's the ICO, in Germany that's your Landesdatenschutzbehörde).

Yeah, it is surreal.

I'm overly stressed out at the moment but I've encountered a couple of local sites that behave quite bad in this regard. I'll probably try to direct an email to the developers and see what they have to say about that and depending on the answer (or lack thereof) I'll continue with a complaint.

Have you actually tried to read and understand GDPR before getting to this level self-righteousness?

A blog writer wants an easy way of tracking popularity of their posts. Clearly I need a protection from this.

The internet is worse off after GDPR.

If you're using Google Analytics, you're paying for the analytics by letting them abuse your users. You can get some protection by using their anonymisation setting and signing a data protection addendum.

If you really want to be on the safe side and respect your users' privacy, the best choice is self-hosting a free analytics tool like Matomo (formerly Piwik). Of course you still need to make sure your hosting company respects your users' privacy too.

It's literally not the GDPR's fault if you're having problems protecting your users' privacy. It's the network effect of entire ecosystems of companies never having had any concern for privacy over decades. Don't complain to the EU, complain to companies not implementing privacy by default and only trying to shoe-horn it in as an afterthought.

Perfect use case for self-hosted analytics, such as https://usefathom.com

And they can have easy ways of tracking popularity. You can use Google Analytics and similar products in compliant ways, and the tools for that have existed for ages (at least in the case of GA).

Except that if you add "login with" button then you are suddenly start processing personal data (even if you only store social media ids).

But you don't just store social media ids.

You seem to misunderstand that you're entering into a partnership with a third party. They become a data processor on your behalf. They process a whole lot more, and you have access to a lot of that data. Fortunately, the social media platform includes the privacy policy and consent process in their onboarding of users, so you don't need to worry about it for the purposes of social login.

You actually need to worry, because ids are personal data.

Eve you abandon all these social media platforms and use emails for login, email addresses are still personal data, and you are still processing it.

There is already quite a good plugin available by Berkman, Harvard University related to what author's idea seem to be. It works with Drupal as well as Wordpress upon which I worked as my GSoC project (nginx/httpd modules as well). http://amberlink.org

It use Archive and local copies as backend, while had ideas to support IPFS among others.

What Amber does?

Amber is an open source tool for websites to provide their visitors persistent routes to information. It automatically preserves a snapshot of every page linked to on a website, giving visitors a fallback option if links become inaccessible.

If one of the pages linked to on this website were to ever go down, Amber can provide visitors with access to an alternate version. This safeguards the promise of the URL: that information placed online can remain there, even amidst network or endpoint disruptions.

I'd argue that running a webserver on a VPS/Raspberry/etc, with no logs, serving a static site or a wget mirrored version of that site is better - you still own the site, but if you don't store any information of your visitors at all, in any form, there's no way GDPR will bite you.

The tooling is ridiculous, while we can do this with some barebone bookmarklets.

Save url to archive:


Search the archive for url:



Peoples downvoting doesn't even known how to make a for loop...

A blog only displays public content on a page right? How can that be affected?

By adding a shit-ton of tracking and analytic code from 3rd parties that provide some convenience. Blog owners are worried because it's not always clear what happens through the imported 3rd-party code, and now they have to start caring about that.

There's Matomo[^1] and ancient tools like awstats[^2] which are self-hosted and can be configured to be completely GDPR friendly.

I thought the "need" for silly amount of analytics died with 3rd party website visitor counters back in the days.

[^1]: https://matomo.org/

[^2]: http://www.awstats.org/

They should've been caring about that from the start. If GDPR is what it takes to get bloggers to think twice about embedding Like buttons and analytics/trackers up the wazoo, then I for one welcome our new European overlords.

Facebook like-buttons. Google Analytics. Comments function that stores user data. Server logs.

Can all be documented appropriately or adjusted to not collect unneccessary data, but it is some amount of work. It's kind of embarrassing that the author ran a platform for privacy-friendly stuff though and didn't have some of that down before, especially since in Germany a lot of this is not new ideas with GDPR.

A static archive would also have been very easy to self-host in a privacy-friendly way, but using archive.org is interesting (not sure if a good idea, but interesting)

Facebook have changed their privacy policy and ask for consent to track via external sites through things like the 'like' button.

Google Analytics is largely anonymised and deletes data after a set time now, including any custom user data that you may set up yourself. You have a legitimate reason to track people interacting with your web property.

Also, if a site is a personal project, not a business, it doesn't even come under GDPR as there's a household use exemption. (Not sure how a household use interacts with things like FB but let's assume non-third-party processing of data is exempt - e.g. comments with named user labels on a personal blog)

This feels like a whole lot of fuss over a very shallow reading of one of the most carefully thought-out pieces of privacy legislation the world has seen. It's got flaws but these aren't them.

I've just about had it with these GDPR shitposts on HN.

The website the author talks about (datenschutzhelden.de meaning "data protection heroes") was apparently a platform to share tools and best practices for online privacy. Now it turns out the same guy running that website thinks it's too much hassle to remove Facebook integration and offer a cookie opt-out.

That's truly next level hypocrisy.

Exactly what I thought.

Yes, respecting all the legal parameters can be time consuming when you are building something new features (especially in corporate environments), but actually I think laws like the GDPR are quite important in our modern day life and as a EU citizen I am thankful we got it.

I can't get into Simon's head, but I've got a feeling that GDPR contradicts the basic maxim of hacker culture that my computer belongs to me. The website seems to teach how those who care enough can protect their privacy by getting control over one's own computer, not imposing requirements on the others'.

Your computer belongs to you. My data belongs to me. I can give you my data and you can keep that data if you tell me what you are going to keep it for and how you are going to use it and when I agree with all of that, but you don't get to abuse it for anything else and I can revoke that permission at any moment and you have to comply.

It's not "imposing requirements", it's called "respecting consent".

The more I hear arguments like this the more it reinforces my impression that "hacker culture" isn't really about experimenting with technology but more about self-entitled rich kids abusing other people and shared property for their own fun and profit (like young Zuckerberg marveling at being trusted with access to people's private information without understanding the implied mutual understanding his users assumed to be self-evident).

I feel like the GDPR is the Code of Conduct of privacy laws: it codifies a modicum of respect that should need not explicit mentioning but seems to have been entirely lost on entire generations of (aspiring) Silicon Valley hacker types and thus catches them by surprise when it really should be the least you can do.

At the very least you are now aware that when you're violating your users' privacy (if only by handing off their data to random BigCo's you have no formal contract with) you're breaking the law just as clearly as those cool '80s kids were breaking the law when they whistled into phones to cheat their way to free phone calls.

Not to start a long philosophical discussion, but hacker culture (you might not like it, but the author seems to be sympathetic to it) has been traditionally critical to the notion of 'intellectual property', that is that by creating some intellectual work I can prohibit the others from redistributing it. The idea that I 'own' my personal data seems to be another step further is diluting the notion of property: this time I don't even need to create anything to impose limitations on the others.

It is also not about 'rich' and 'poor', it's about clear rules that are the same for the rich and for the poor alike.

I would have considered myself a "hacker" in my teenage years when I was teaching myself programming by digging through language specs online and looking at other people's code to understand what makes it work.

However it seems that "hacker culture" as the author likely sees it (also as described in Steven Levy's "Hackers") is really more about privilege than anything else. A lot of the antics that have entered hacker lore were only possible because the kids performing them were in relatively risk-free environments (particularly the notorious MIT Tech Model Railroad Club). Not necessarily privilege in the modern social sense but certainly in the sense of class (unless you believe being able to study at MIT is 100% about merit and nothing else).

It doesn't matter whether the "rules" of hacker culture are the same for those with privilege and those without: just as in startup culture, you're fare freer to experiment if you have a safe environment to fall back on if you screw up. If you're an MIT kid with wealthy parents a botched prank is less likely to land you in jail and this knowledge allows you to take risks more easily.

Sure, there's a level of anarchism in hacker culture but too often the kind of "hacking" that lands you venture capital for your startup (especially "growth hacking") also includes a blatant disregard for others (again remember Zuckerberg and the "suckers").

You may argue that this is a deviation from the original hacker ethos or not "true hacking" but there doesn't seem to be anything in hacker culture to exclude these people by (which is why I mentioned the formal rules you now often find in codes of conduct, which many decry as superfluous and unnecessary because they seem to state the obvious).

As to your real point: the idea of owning data is the polar opposite of what copyright has become to be about (at least in the US): data is owned by the individual. You can grant a company usage rights but they're always highly specific and easily revocable. Personal data is not "intellectual property", it's an aspect of your own identity.

In the years since the "Social Web" we've seen many failed attempts to allow users to "reclaim" ownership of their data. Microformats, decentralisation, software like Diaspora, the Unhosted movement, and so on. Most of them failed for practical reasons. Few of them really addressed privacy concerns, even fewer really enforced data ownership. The GDPR is promising to accomplish what hundreds and thousands of hackers have tried to do for years: not by rebelling against the BigCo's, but by redefining privacy and data ownership as human rights.

If you understand hacker culture, you will also remember that before the Social Web the norm was to be anonymous: "on the Internet nobody knew you were a dog", "men were men, women were men and 14 year old girls were FBI agents". You'd go by pseudonyms by default and freely pick new ones to swap identities. Unmasking people was possible, to a degree, but difficult because of dial-up and dynamic IPs.

Nowadays every single coffee pot in your home could theoretically have a dedicated IP address and most of the Internet we use to share information is accessed using a browser that's often uniquely identifiable without even looking at the IP. It's no longer enough to rely on technology to grant us anonymity. The GDPR restores some of that early '90s anonymity. Not by outlawing technology but by enshrining new human rights and forcing us to respect them.


I would'n agree that these attempts were completely failed. Like the whole free software world works, they created better and better tools that at some point could have become good enough to actually protect one's privacy and at a later point could have become usable by non-hackers as well.

Now at the time when it's easy as never before for every (well, not every every but you get my point) schoolboy/girl to create their own standalone page with comments, own e-mail server and whatever they want, they will probably not be able to do so, without risking being drowned by an Abmahnungswelle. Not to say that all decentralized social networks projects are at risk for approximately the same reason.

One might hope that in the future we'll have a reproducible technology for creating GDPR-proof websites and the world will be a happy place again, but solving legal issues with code is a notoriously difficult problem. Legislative acts are not code, and something as vague as GDPR is not even a spec.

young Zuckerberg marveling at being trusted with access to people's private information without understanding the implied mutual understanding his users assumed to be self-evident

I disagree; he understood just fine, hence saying "they trust me". I'd say he was marveling at the naivety of those users for putting such faith in a random stranger. He probably, like so many of us into computers back then, had an understanding that you didn't use your real name online, let alone pictures or addresses. Seeing 4000 people blindly disregard basic safety rules would certainly be remarkable.

Where I diverge from him is that my action after calling them dumb fucks would be to kill the experiment and warn people of the dangers of what they were doing, not doubling down.

Sure, it was seen as a social faux pas at the time but mostly because the web was much more personal: companies had no need to do anything nefarious with your data, so they wouldn't have cared, but a lone individual running a random website could potentially know you and ridicule you in front of your friends (especially when it's some random kid at the same university).

What Facebook did was prove that you could make a business out of exploiting those users' personal data without needing to cause obvious harm to them directly (i.e. in ways that were still potentially unethical but 100% legal).

The reason I'm saying he was being unethical at the time is that he saw their misplaced trust as an opportunity to exploit them (initially by snooping for fun, later by exploiting their data for profit). The ethical response would have been to either reject the responsibility (as you describe) or acknowledge it and start thinking about how to protect that data for the users.

Hey uhnuhnuhn,

we actually dropped datenschutzhelden.org because of other (mostly time) reasons.

But after we closed it in february, we still would have been legally responsible for the site.

As the project was already dead I decided to go the described path.

So just to be clear: Datenschutzhelden.org was not closed because of GDPR. But keeping the archive of it up and running in a GDRP World was to risky for me.

That makes more sense.

Maybe better to change the article because it states the opposite:

"I found myself at a point where I got so desperate that I was close to taking all of my stuff offline. But in the end I didn’t. I decided for the middle way: For some projects it’s worth the hustle and for others not. One project I ditched was datenschutzhelden.org."

That reads as: GDPR compliance was such a hassle you almost closed all your projects, in the end you only closed some of them and replaced them with WBM.

The article is updated :D

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact