So as basis for my thesis on AI and NLP I've been working on a RRN-based text classifier that basically reads and analyzes privacy policies. It understands that "we don't share your data with third parties" is privacy friendly while "we may share your data with anyone" is a potential threat.
I've then created this website with a bunch of analyzed services to showcase the most relevant info about each service along with other interesting stuff like recent data breaches or instructions to delete your account in said service.
Happy to answer Qs about the tech behind, it'd also be great to hear your feedback on what the site lacks and possible improvements!
I think its better (for me as a user) if you don't boil things down to a score as different people expect different things when talking privacy. It would help if you could simply highlight the potential problematic clauses in different privacy statements along with some reason why it might be problematic.
In other words, giving a single score plus a two-sentence highlight is probably about the right amount of information.
Having scores that weight X more than Y would give me more accurate scores, while seemingly also giving other people more accurate scores at the same time.
Does it understand "we don't share your data with just any old third parties", or "we're not like our competitors who may share your data with anyone"?
(Disclaimer: I'm a co-author on the ULMFiT paper.)
Do you have an integration with HaveIBeenPwned?
What tech is involved to get something like this on the web?
I'm working on something similar at the moment for a client.
I have about 650MB of privacy policies at the moment which I fetched via a crawler. I'm just about to classify the rest of them.
I'm trying to automate the whole thing so that we have a full workflow.
Anyway... ping me to discuss.
Eg one policy might be disclosing what they do (but its actually relevant to collect data, eg password manager) while the other just says "no we don't collect anything". In this case one feels like its a better option, but its not exactly the same situation, its missing some context. I feel like this could potentially bias ratings.
To elaborate on your last sentence, context is critical in assessing whether a clause is pro- or anti- privacy. Is the collection of information critical to the provision of the service? What is collected, and how much? And so on.
For instance, I got (option A):
The Games Press Web site can, optionally, store a Cookie on your computer in order to automatically log you into the site on each visit.
Back to Top ^ NO THIRD-PARTY BENEFICIARIES There shall be no third-party beneficiaries to this Agreement.
Edit to add:
Some other notes:
- Mozilla, a tech company I consider ethical, is right down there with Netflix, LinkedIn and Waze
- The box under “Sentence Breakdown by Risk Level” is empty when my ad blocker is enabled (Adguard on iOS Safari)
- Telegram, a company I also consider ethical, has a score of 105%—is this an oversight?
After a few of those I picked one at random and then it puts me onto some kind of 'Game' which made me feel they were trying to train me instead of vice-versa. The game didn't respond to clicks so I closed the app.
I only give personal data to websites when I have to (e.g. to services that work or school use) or if I already trust the company to not do anything shady with it (Mozilla has done some sketchy stuff but I believe they won't leak my passwords).
And for websites where participation is more optional, like HN or Reddit, you don't usually need to give much personal data anyway.
Edit: the website is fully working now. Mozilla has had one security breach where emails and hashed passwords were leaked, in 2014. At the bottom the sentence breakdown is 2.5/12/22% concerning/mild/friendly. Meanwhile Reddit has no breaches, but keeps your messages forever and shares data with ad companies. Their sentence breakdown at the bottom is 3.3/23/9%. Overall, the AI rates Mozilla at 33% and Reddit at 41%. That doesn't really make sense to me.
Finally, I took the A/B test from Guard, and quite of a few A/B choices seemed to not really have anything to do with privacy. If the dataset is kept the same, then I think a different test format would be to rate each A and B snippet as:
- Not about privacy
- Good for privacy # only if the other one is not about privacy
- Bad for privacy # only if the other one is not about privacy
- Better than [A/B] # only if neither is not about privacy
Anyway, the data itself may be somewhat useful to me if I want to learn more about a company's privacy practices. But for normal people, I think it would be helpful for the website to also explain why privacy is important and why people should care.
onClick in popular frameworks just means left click and nothing more, which makes sense, except for that use case of opening in a new tab with middle mouse button or right clicking etc. So you have to add a lot of logic to support all that.
If you break the html spec and make an anchor tag a block element then you have to deal with catching and stopping the event from it otherwise it would work as a normal link but you actually just want to change state in your JS app.
So I think tools like Angular, React, Vue etc. should get a better way to create links on website that just change state .
Or, to point out the thing that is obvious to anyone who grew up on the old web:
Removing a whole lot of logic also works.
This is a great policy that I think more people should use. Not everything needs your real name, real birthday, or your real home address. Definitely not your real phone number. You often do need a real email address.
> you don't usually need to give much personal data anyway
This is where it gets tricky. Anonymized aggregate data can be surprisingly identifying. You only need 33 bits of information to uniquely identify any individual in the world. If your IP tells me you're from San Francisco, then I need just 20 bits of information to uniquely identify you.
Data mining 20 yes/no answers about one of your users is ... pretty easy.
PrivacySpy is open source, community run, and more about grading policies on a standardized rubric (as opposed to entrusting that to ML), so these tools might complement one another.
(Full disclosure: I'm a contributor to PrivacySpy.)
Normal people could conceivably read and understand a given policy if the knowledge scaled.
Any substantial adoption would help focus effort/resources on services that deviate from the terms.
Sites that are already repositories of this knowledge could play some part in codifying best-practices, advocating for adoption, and tracking progress.
We have taken different ideas from many different implementations and applied them more specifically to ed tech products.
Did you find that there's one single variable, like length or the presence of certain words, that the system relies on heavily?
1. Only showing excerpts of the highest threat levels. Trying to view the less severe threats asks us to email in. If you're willing to volunteer the information, why the hoops?
2. "Play a short game to continue using this tool" ensures I'm not going to share this with anyone. Putting a stranglehold on users is _never_ the way forward. I might have volunteered my time if I were at home and browsing through. I can't when I'm quickly flicking through during taking a five minute break from looking at work. But it's left me with a final negative impression before being unceremoniously blocked off.
So glad I gave up on online dating a long time ago.
But really this is amazing.
The next step would be be to have a lawyer write a small opinion piece on the most popular sites.
Without something like that the problem is the companies could change the wordings and the neural net could not detect until trained again which is potentially more dangerous!
The problem is probably with your type mappings: https://nginx.org/en/docs/http/ngx_http_core_module.html#typ...
Made a similar project many moons ago and is still kicking along. Thanks all on HN for the feedback.
I actually just went through the trouble of resetting my Product Hunt account (I haven’t been on there in a long time) just to give you that upvote!
Thank you for this! Cheers.
Would be great to get some insight into your data collection/labeling and model design process.
But most probably I'll be publishing a paper later this year detailing all the details and process :)
Again, thank you for your time and work.
Edit: telegram had hacking news this year, search for "Telegram voicemail account hijacking"
Congrats on the launch!
How long can we keep up arming us in the battle against powerful and rich entities who steal our data and buy politicians to have direct access to the process of law-making?
How do you ensure, that it isn’t a bot, or an army of bots?
What is the neuronal network doing?
How do you unsure your personal integrity, those of your team members and the overall integrity of your ‘system’?
- Subscribe to notifications for changes in score.
- Browser extension that shows you the score on the site.
Just a thought.
I don't know, seems like a GDPR violation to me ;-).
Also, I'm not sure if a majority care abt privacy when the value delivered is super high. They submit to the will of the service provider, as if it was the cost of doing business without realising they could either look for alternative or exercise stricter control over what they share and how .
To that end, I like tools that let users take action in addition to showing what's wrong rather than simply point it out. Actions can include:
- Replace: Push the users towards alternatives and help them seamlessly take their data elsewhere.
-- Help change their usage behaviour. Most digital-wellbeing / internet de-centralization tech fall under this category?
-- Translate / Pipe data exported from one service provider and import it into another. For instance, it is tiresome to move away from wordpress to ghost.org; or from WhatsApp to Signal. Emails work great.
- Reduce: Hand-hold them as they grasp various privacy and security settings on offer and exercise them, as appropriate.
-- JumboPrivacy does this for popular social networks.
-- PrivacySettings Firefox plugin for Firefox is another example.
-- Plenty write blog posts to help others navigate arduous settings across popular web properties, and expect nothing in return.
- Restrict: Provide tools that let them control what the services can and cannot collect.
-- Application sandboxes like firejail / sandboxie, firewalls like Snitch / LuLu, DNS based content blockers like pi-hole, in-browser content blockers like uBlockOrigin are some examples.
 One of the first questions folks asked after an exodus-privacy (which is super nice and something I use every other month) presentation at fosdem was, 'What can I do now that you've exposed what apps on Android do with the permissions granted to them and the SDKs they embed?' exodus-privacy, as great as it is, doesn't let you take action but presents a nice overview of the dangers to your privacy due to the app you've installed. Instead, you might end up having to independently discover and install Blokada or AdAway or Pi-Hole or NetGuard or XPrivacyLua or microG or GrapheneOS or...