Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I made a neural net that analyzes privacy policies (useguard.com)
502 points by rameerez 21 days ago | hide | past | web | favorite | 120 comments

Hi guys!

So as basis for my thesis on AI and NLP I've been working on a RRN-based text classifier that basically reads and analyzes privacy policies. It understands that "we don't share your data with third parties" is privacy friendly while "we may share your data with anyone" is a potential threat.

I've then created this website with a bunch of analyzed services to showcase the most relevant info about each service along with other interesting stuff like recent data breaches or instructions to delete your account in said service.

Happy to answer Qs about the tech behind, it'd also be great to hear your feedback on what the site lacks and possible improvements!

Oh my god thank you so much for doing this !

I think its better (for me as a user) if you don't boil things down to a score as different people expect different things when talking privacy. It would help if you could simply highlight the potential problematic clauses in different privacy statements along with some reason why it might be problematic.

I don't agree. I just looked in my password manager, and I have roughly ~220 accounts across the web. If I want to go through that list and see which website rank well and which rank poorly, and I want to do that in under two hours, that gives about 30 seconds per service.

In other words, giving a single score plus a two-sentence highlight is probably about the right amount of information.

How about a compromise - not a score for each policy but for "each" individual, or tranche of similarly concerned individuals. I go through a list of privacy options (maybe just once) and the point at which my "okay" becomes "not okay" determines my score. Then each policy simply passes or fails based on my score. And if you want more detail, the list of failure items for a given policy can be bulleted.

Or make the rank adjustable to some personal criteria that matches different privacy expectations.

This would also be helpful in determining how to weight (or not) user feedback in the training portion. I just tried it out (the 10 questions) and there were at least a few I thought, "huh, I know some others would disagree with me on this" because I value X and they value Y more.

Having scores that weight X more than Y would give me more accurate scores, while seemingly also giving other people more accurate scores at the same time.

A good compromise would be a chrome extension that shows a 1-10 score. You click the extension to see clauses

Not for everyone, as thecleaner pointed out. You are assuming your requirements are universal among other users. Also, you are assuming the policies can be simplified to a weighted average of their parts, which is not necessarily the case.

well, just checked and i already see the majority of the web gathering around "C". so really, we can argue both ways about this score thing...

The scoring definitely needs some work – I think some factors should have more weight. Also services' data vary a lot so it's difficult to come up with a good measure for everyone. Ex: I try to take into account whether the service has had any recent data breach, so it penalizes a lot if it has but also scores rather low if it hasn't; privacy policies' length vary wildly and I think that also plays a large role... It needs some tweaking but I think with some improvements I'll reach a more accurate scoring

For a site trying to fight for privacy, don't you think it would be better to not use Google Analytics to track the people who visit your site?

Yes, definitely. I mention it in Guard's own privacy policy, I don't like using it either, but reasons are: (a) it's the simplest and as far as I know one of the few free analytics tools available, (b) not having a measure of the website activity will make me effectively blind and unable to make decisions, (c) I don't send any personally identifiable event (and, for this matter, I don't send any events apart from page loaded events). I'm also open to suggestions to replace GA.

You could just not use user-level analytics; I run the infrastructure for PrivacySpy.org, and using CloudFlare's aggregate analytics has served us perfectly well.

Have you considered hosting your own Matomo server? You could just scrape logs or use a Javascript tracker. But the data will stay local to you.

+1 for Matomo, it's a no brainer for any 'privacy service' to stay far away from Google Analytics.

I’m on mobile, so I can’t check if you’re already using this technique, but you could always use the anonimize IP function. This way the last 3 numbers of the IP data will not be send through the Google Analytics script. More info: https://www.jeffalytics.com/gdpr-ip-addresses-google-analyti...

How did you create a data set large and accurate enough to be useful in training a model?

Some friends run an AI bootcamp and helped me finding the initial set of users to help me with labelling. Initial labelled data was generated mostly through them, both manually labelling and with the approach described in https://useguard.com/experiment Also, the model I'm using relies heavily in transfer learning and achieves very reasonable results with few labelled items (the paper in which the technique is described actually maintains that with only 100 labelled examples they reached comparable results to using 10x that data in models that use older approaches)

What paper is this? Is it the UDA paper?

Important work, and this seems to be doing a decent job already. Cheers. One thing about the teaching: some sentences don’t have anything to do with privacy, so there might be a button to train the AI on that.

Looks cool. It's a small point, but not pluralising words when the number is 1 shows attention to detail (e.g. 1 scandals for instagram).

Thanks for the heads up!

> It understands that "we don't share your data with third parties" is privacy friendly while "we may share your data with anyone" is a potential threat.

Does it understand "we don't share your data with just any old third parties", or "we're not like our competitors who may share your data with anyone"?

Not OP, but curious, are those quotes actually from privacy policies or just hypotheticals?

With enough sites / privacy policies out there, even hypotheticals will end up in a real policy at some point (if not already). Is there something significant in the distinction or is it just curiosity? If the latter, I also share it. :)

How do we know it isn't just people doing the analysis if we can't actually use the AI ourselves?

I'm most probably publishing a paper later this year detailing the process. Also, 80 pages of my thesis would have loved this wouldn't have involved AI to make the whole thing simpler :)

Very nice! The fact you mentioned Tinder's T+C on Twitter got my attention.

Does your work integrate pretrained LMs like BERT or GPT2?

Yes, but not Transformer-based like these two, rather LSTM-based like ULMFiT

Interesting! Did you have to make any substantive changes to the ULMFiT approach to make it work for this problem? Did you use the fastai implementation, or write something from scratch?

(Disclaimer: I'm a co-author on the ULMFiT paper.)

off topic, but i saw in one of your fast.ai videos, you were running an autohotkey script. Made my day !

Great work - would love to read your thesis if it’s available online?

>>”along with other interesting stuff like recent data breaches”

Do you have an integration with HaveIBeenPwned?

> Happy to answer Qs about the tech behind [...]

What tech is involved to get something like this on the web?

Hey... can you reach out to me... I'm kevin ... at datastreamer.io (trying to hide that from spam but I think you can avoid that).

I'm working on something similar at the moment for a client.

Right now I'm just starting out but I've built a privacy policy classifier that is an RNN classifier based on TensorFlow that is just an 'is this a privacy policy' classifier.

I have about 650MB of privacy policies at the moment which I fetched via a crawler. I'm just about to classify the rest of them.

I'm trying to automate the whole thing so that we have a full workflow.

Anyway... ping me to discuss.

Hi, I like what you did. I am a trained German lawyer working with NLP at a Computer science faculty. I would like to talk about this with you, as I am very interested in this topic. Also, I ve been working for the German Data protection Agency...

The "game" (training) that asks you to analyse "privacy threats" is a bit strange. It feels like it takes two random excerpts from a privacy policy and asks you to compare them, but with this, it feels like it is missing some global information, you are just looking at local details.

Eg one policy might be disclosing what they do (but its actually relevant to collect data, eg password manager) while the other just says "no we don't collect anything". In this case one feels like its a better option, but its not exactly the same situation, its missing some context. I feel like this could potentially bias ratings.

I'm not sure if you could add in extra information with some of that global information, eg the type of service, classifying different "parts" of the privacy policy etc.

Yeah, it's not comparing like for like. Feels like the system is trying to collect training data from users.

To elaborate on your last sentence, context is critical in assessing whether a clause is pro- or anti- privacy. Is the collection of information critical to the provision of the service? What is collected, and how much? And so on.

Agreed with parent and gp. I gave up with the 10 questions, as the sentence comparisons were almost comically incomparable. I fear your model is going to be a random number generator.

I’m trying to help teach the AI, but some options don’t have to do with privacy at all.

For instance, I got (option A):

  The Games Press Web site can, optionally, store a Cookie on your computer in order to automatically log you into the site on each visit.
and option B:

  Back to Top ^ NO THIRD-PARTY BENEFICIARIES There shall be no third-party beneficiaries to this Agreement.
It would be great to have a skip/flag option for cases like these.

Edit to add:

Some other notes:

  - Mozilla, a tech company I consider ethical, is right down there with Netflix, LinkedIn and Waze
  - The box under “Sentence Breakdown by Risk Level” is empty when my ad blocker is enabled (Adguard on iOS Safari)
  - Telegram, a company I also consider ethical, has a score of 105%—is this an oversight?

And sometimes you get two privacy friendly policies and I'd like to say "equal".

After a few of those I picked one at random and then it puts me onto some kind of 'Game' which made me feel they were trying to train me instead of vice-versa. The game didn't respond to clicks so I closed the app.

The fact that you consider companies ethical is neither here nor there for the privacy score of their policy though. You haven't made a very strong case as to why your opinion needs to impact the metrics (or really given any justification for your feelings at all).

As to your two options having nothing to do with privacy: The fact that there cannot be any third-party beneficiaries is in fact a baldly privacy-friendly statement, because a third party is yet another party that may influence a privacy policy in a net-negative way.

I went to the homepage and ctrl-clicked a company card (Mozilla) to open the details in a new tab. Instead, the site jacked my ctrl-click and instead tried to navigate to the link in the same tab. Middle-clicking does not work at all.

It then went to an error page instead of loading the details for Mozilla, but while it's an interesting idea I'm not sure how useful it is. I don't usually create an account on a website unless I have to do so, and the privacy policy is nonnegotiable. So why would I want to check what their privacy policy is?

I only give personal data to websites when I have to (e.g. to services that work or school use) or if I already trust the company to not do anything shady with it (Mozilla has done some sketchy stuff but I believe they won't leak my passwords).

And for websites where participation is more optional, like HN or Reddit, you don't usually need to give much personal data anyway.

Edit: the website is fully working now. Mozilla has had one security breach where emails and hashed passwords were leaked, in 2014. At the bottom the sentence breakdown is 2.5/12/22% concerning/mild/friendly. Meanwhile Reddit has no breaches, but keeps your messages forever and shares data with ad companies. Their sentence breakdown at the bottom is 3.3/23/9%. Overall, the AI rates Mozilla at 33% and Reddit at 41%. That doesn't really make sense to me.

I would really like to see more details about the privacy policy sentences on the website. If 2.5% of Mozilla's privacy policy is very concerning and 12% is mildly bad, I would like to see the actual sentences to know the risks. There is a button to view the full annotated policy, but clicking it says to send an email to you. Edit: this seems like a bug, it shows a few sentences in a WebKit-based browser [Falkon] but in Firefox it just shows the chain link icons.

Finally, I took the A/B test from Guard, and quite of a few A/B choices seemed to not really have anything to do with privacy. If the dataset is kept the same, then I think a different test format would be to rate each A and B snippet as:

- Not about privacy

- Good for privacy # only if the other one is not about privacy

- Bad for privacy # only if the other one is not about privacy

- Better than [A/B] # only if neither is not about privacy

Anyway, the data itself may be somewhat useful to me if I want to learn more about a company's privacy practices. But for normal people, I think it would be helpful for the website to also explain why privacy is important and why people should care.

More and more websites are so advanced that they can’t even use an <a> tag anymore. Instead they do some convoluted onclick-scripting that breaks all standard behavior and accessibility functionality.

I think complicated is a better choice than advanced. Intentionally complicated without adding value in a lot of cases.

I'm sure "advanced" was sarcasm.

Should have italicized advanced (or put it in quotes)

I think this is mostly a problem of the tools used. An anchor tag by definition is an inline element, so it shouldn't really be a giant box that's clickable, so you default back to an onClick.

onClick in popular frameworks just means left click and nothing more, which makes sense, except for that use case of opening in a new tab with middle mouse button or right clicking etc. So you have to add a lot of logic to support all that.

If you break the html spec and make an anchor tag a block element then you have to deal with catching and stopping the event from it otherwise it would work as a normal link but you actually just want to change state in your JS app.

So I think tools like Angular, React, Vue etc. should get a better way to create links on website that just change state .

Styling an anchor as a block element has never violated the spec, and HTML5 deliberately added support for wrapping anchors around block elements because people had been doing that anyway, even though the browser wasn't required to make it work as intended (but it now is).

> So you have to add a lot of logic to support all that.

Or, to point out the thing that is obvious to anyone who grew up on the old web:

Removing a whole lot of logic also works.

> I only give personal data to websites when I have to

This is a great policy that I think more people should use. Not everything needs your real name, real birthday, or your real home address. Definitely not your real phone number. You often do need a real email address.

> you don't usually need to give much personal data anyway

This is where it gets tricky. Anonymized aggregate data can be surprisingly identifying. You only need 33 bits of information to uniquely identify any individual in the world. If your IP tells me you're from San Francisco, then I need just 20 bits of information to uniquely identify you.

Data mining 20 yes/no answers about one of your users is ... pretty easy.

Hi. A related project that takes a more human-powered approach is PrivacySpy (https://privacyspy.org). Would be neat to see how these tools intersect.

PrivacySpy is open source, community run, and more about grading policies on a standardized rubric (as opposed to entrusting that to ML), so these tools might complement one another.

(Full disclosure: I'm a contributor to PrivacySpy.)

Another one: https://tosdr.org/

ToS;DR is great, although it's more focused on terms of service so if you're looking for privacy-only info, you'll have to cut through a bit of noise.

On the off chance the various responders in this sub-thread see this comment, for a while I've hoped someone would advocate for privacy/TOS policies to follow a similar model to OSS licensing.

Normal people could conceivably read and understand a given policy if the knowledge scaled.

Any substantial adoption would help focus effort/resources on services that deviate from the terms.

Sites that are already repositories of this knowledge could play some part in codifying best-practices, advocating for adoption, and tracking progress.

We have something similar as well over at https://privacy.commonsense.org/.

We have taken different ideas from many different implementations and applied them more specifically to ed tech products.

Analyse LastPass. Even without ML it's clearly saying they spy on everything you do and share it with anyone they want. I'm surprised they get recommended so often given their privacy policy.


Hmm, I was trying to go through your A/B options but it didn't seem to register any click. So I started clicking repeatedly. Then, it processed those clicks on the first 7 items, giving you bad data. FYI.

This just hit the frontpage so the server might be a bit overloaded, I'm sorry. Trying to resize resources right now. Thanks for the heads up, in the long term (ideally) noise shouldn't be a huge problem in a sufficiently large dataset (or at least I'm already expecting some noise haha)

Hi there. This is really neat. It reminds me of a talk I listened to recently on digital privacy, where the guy was using the price of "privacy products" as a way to measure how much people value their privacy. This seems like it would be one of those.

Did you find that there's one single variable, like length or the presence of certain words, that the system relies on heavily?

This is interesting, but I wonder what a boilerplate privacy policy would score. Given the clustering of scores between 30 & 50% for what read like "our lawyers pulled the standard policy and billed us for 4 hours of minor tweaks", it seems like some of the most effective privacy advocacy would come by challenging the most common "dangerous sentences" in court.

This is a good tool, but the execution of the website is disappointing.

1. Only showing excerpts of the highest threat levels. Trying to view the less severe threats asks us to email in. If you're willing to volunteer the information, why the hoops?

2. "Play a short game to continue using this tool" ensures I'm not going to share this with anyone. Putting a stranglehold on users is _never_ the way forward. I might have volunteered my time if I were at home and browsing through. I can't when I'm quickly flicking through during taking a five minute break from looking at work. But it's left me with a final negative impression before being unceremoniously blocked off.

Holy Crap , Tinder shares your profile with potential employers ( or rather companies contracted by those employers).

So glad I gave up on online dating a long time ago.

But really this is amazing.

The next step would be be to have a lawyer write a small opinion piece on the most popular sites.

Cool tech and good use of NLP. but isn't the privacy policy system entirely broken? It's like driving on unmarked, unpaved roads, why don't we have a global template that's list in the beginning a checklist that is human understandable/comprehensible quickly like - Do we share your data : (Y/n)


Without something like that the problem is the companies could change the wordings and the neural net could not detect until trained again which is potentially more dangerous!

It's time we get some standard for user web privacy policy docs like gdpr

Hello, if the webmaster for this site is reading this, your `change.org` file is getting a Content-Type of `application/octet-stream` instead of `text/html`, which is giving me (in firefox) a prompt to download a file instead of displaying the page.

The problem is probably with your type mappings: https://nginx.org/en/docs/http/ngx_http_core_module.html#typ...

Tinder and Mozilla have the same score of 33%. I don't agree with that. Tinder willingly shares very, very personal data along with your contacts (assuming without said contact consent).

Keep in mind that it's analyzing their privacy policy, not their actions. Who a company specifically choses to share the data with and how often they do it is likely not considered.



Made a similar project many moons ago and is still kicking along. Thanks all on HN for the feedback.

I used it as recently as a month ago! Thanks.

You're doing incredibly noble work. Ive bookmarked the site and will make great use of it. Thank you for working hard to protect data privacy, I hope you never stop.

Thank you for these words :)

To the creator:

I actually just went through the trouble of resetting my Product Hunt account (I haven’t been on there in a long time) just to give you that upvote!

Thank you for this! Cheers.

This is great, I've been wanting to do a project like this for a while.

Would be great to get some insight into your data collection/labeling and model design process.

Some of the process on gathering the data to create the labelling dataset is described here: https://useguard.com/experiment

But most probably I'll be publishing a paper later this year detailing all the details and process :)

Very cool idea! One question I couldn’t seem to find the answer to on your site was are the policies featured on your a/b training exercise distinct from the polices that the ai grades? For example will a user going through your a/b trainer ever see a snippet from the Instagram privacy policy?

Yes, initially it will all draw from the same dataset, so a user in theory could definitely see all services' snippets. But, to increase statistical significance in the data it gathers, I've restricted the initial amount of items in the test so right now this will not be the case (otherwise, I'd be dealing with circa 3,500,000,000 different pairwise comparisons hehe)

Where can we follow you to know abt the paper when you do eventually publish?

I'm most active on Twitter :) https://twitter.com/rameerez

Great work. Thank you. If I may, don't show bad scored results in the first page. Bad advertising is still advertising, and we all assume they are all bad anyway. I understand the surprise people have a first but the next step is finding good apps. Alsi sort by grade and search by name is a must.

Again, thank you for your time and work.

Edit: telegram had hacking news this year, search for "Telegram voicemail account hijacking"

The need for things like this is partly why I quit my job 6 months ago to start my own company (in profile). We need to start building companies and products that provide valuable communities and services (like social networks) without the need for ads/privacy violations. My belief is that the issue is largely related to incentive misalignment (users != customers).

Congrats on the launch!

Though it's nice you built this I'd wish we wouldn't need it because of governments who do their job and protect the people they were originally intended to serve.

How long can we keep up arming us in the battle against powerful and rich entities who steal our data and buy politicians to have direct access to the process of law-making?

Letting people doing the training without any knowledge and context seems to makes no sense.

How do you ensure, that it isn’t a bot, or an army of bots?

What is the neuronal network doing?

How do you unsure your personal integrity, those of your team members and the overall integrity of your ‘system’?

Where is Facebook? I guess that would be one of the most interesting policies to analyze.

I'm slightly puzzled that Telegram is scoring more than 100%. What am I missing?

A bug in the scoring algorithm that I need to fix

Great website. Quick note - your site doesn't functions without javascript. Having enabled it and trawled through it I see no reason to require js to be enabled. Adding a non-js fallback would be great.

A great idea. Some ideas for features:

- Keep a timeline of privacy policy changes, being able to compare scores between 2 versions.

- Subscribe to notifications for changes in score.

- Browser extension that shows you the score on the site.

I will join the chorus of great work. I love it. It may even make some people more privacy conscious ( very few people read those -- usually the ones who wrote it ).

Thank you! :) I've read some recent research and looks like this is actually measured: only 0.001% of all internet users start reading them (and even a smaller amount of people likely finish reading them). On top of it, if you had to read all the privacy policies you accepted only on the past 5 years alone, you would have to use 3.040 hours of non-stop reading. Crazy. Love your privacy-oriented username btw! ;)

Love the "biggest threat" for Telegram!

I love the work, and it is directly applicable to the work I am doing now. Have you published your thesis?

I own the domain policies.dev and would be happy to hand it over to this project if you’re interested.

Great site! Would be nice to have the option to submit scandals as some seem to be missing.

what if you took this idea and used it to normalize privacy policies into a normalized form?!


Some other user reported that same thing this morning but I couldn't find any explanation to this err. It basically works for everyone except for these two precise cases. One idea I have is that you might be behind some sort of firewall that's blocking my website (because in the past either the IP or the domain got flagged by some antivirus company and now some business networks block it) – might this be the case?

I'm not an expert but I wonder if this error comes up in the case of the HTTPS handshake not being able to agree on a protocol -- one side of the transaction is trying to insist on a crypto protocol that's out of date or a little too fashion-forward?

Just a thought.

When reporting this, it's useful to include your browser and exact version number. Most browser ship updated SSL cert packs with new versions, and debugging can be hard without this information.

This is really cool, what prompted the idea to combine the two?

Thanks! Last semester I did a bootcamp on Artificial Intelligence and I had to do a final project. Last year I started becoming concerned about digital privacy when I discovered Facebook had an updated copy of your phone contacts including nicknames [1] (which basically means strangers at Facebook know the names I call my GF). I later found out this was explicitly said in FB's privacy policy. So when I did the bootcamp and discovered how powerful RNNs are to model the complexity of the English language I came up with the idea.

[1] https://news.ycombinator.com/item?id=16661735

Telegram has a 105% score? Is that expected or a bug?

Bug, one of the components of the score is not properly normalized I think. I'll fix it as soon as I handle the traffic overload :)

It should be pretty close to 100%, though, they seem to be super privacy friendly!

Yep, I read their policy. I just wasn't sure if extra credit existed.

>BIGGEST THREAT: «We never delete your funny cat pictures, we love them too much»

I don't know, seems like a GDPR violation to me ;-).

This’d be great for contracts too. E.g. an NDA.

offtopic: Is it worth using producthunt for games? Or do we concentrate our efforts on platforms like steam, appstore and play store?

Why are you posting that to this thread? I know you said off topic, but usually when someone says that, it's still related: e.g. if the parent comment mentioned something tangential, or I could imagine if the link was to producthunt and you wondered about games on that platform... are you hijacking this thread for a completely unrelated question that you want to ask the HN audience, or is there a connection I'm missing?

They are on producthunt.

Ah I missed that top banner, must have automatically scrolled past it. Fair enough; thanks for responding.

This is amazing! How would you categorize this?

This is amazing, how would you categorize this?

Great work! Would also work as a browser plugin.

This was actually one of the ideas to evolve the project! Another one is making an app that protects your digital privacy from these threats, kinda like an antivirus but for privacy threats instead of viruses (https://useguard.com/blog/future/) Would love to hear feedback on what should this project become next :)

How are you better than Firefox Monitor?

Very creative project! Congratulations!!

Thank you! :)

This is great but like tosdr.org before it, what does it tell people they already don't assume to be true? I use tosdr only to often keep ignoring what it's telling me.

Also, I'm not sure if a majority care abt privacy when the value delivered is super high. They submit to the will of the service provider, as if it was the cost of doing business without realising they could either look for alternative or exercise stricter control over what they share and how [0].

To that end, I like tools that let users take action in addition to showing what's wrong rather than simply point it out. Actions can include:

- Replace: Push the users towards alternatives and help them seamlessly take their data elsewhere.

-- Help change their usage behaviour. Most digital-wellbeing / internet de-centralization tech fall under this category?

-- Translate / Pipe data exported from one service provider and import it into another. For instance, it is tiresome to move away from wordpress to ghost.org; or from WhatsApp to Signal. Emails work great.

- Reduce: Hand-hold them as they grasp various privacy and security settings on offer and exercise them, as appropriate.

-- JumboPrivacy does this for popular social networks.

-- PrivacySettings Firefox plugin for Firefox is another example.

-- Plenty write blog posts to help others navigate arduous settings across popular web properties, and expect nothing in return.

- Restrict: Provide tools that let them control what the services can and cannot collect.

-- Application sandboxes like firejail / sandboxie, firewalls like Snitch / LuLu, DNS based content blockers like pi-hole, in-browser content blockers like uBlockOrigin are some examples.

[0] One of the first questions folks asked after an exodus-privacy (which is super nice and something I use every other month) presentation at fosdem was, 'What can I do now that you've exposed what apps on Android do with the permissions granted to them and the SDKs they embed?' exodus-privacy, as great as it is, doesn't let you take action but presents a nice overview of the dangers to your privacy due to the app you've installed. Instead, you might end up having to independently discover and install Blokada or AdAway or Pi-Hole or NetGuard or XPrivacyLua or microG or GrapheneOS or...

Very cool! Reminds me of https://tldrlegal.com/ and https://fossa.com/.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact