Hacker News new | past | comments | ask | show | jobs | submit login
Personal and social information of 1.2B people discovered in data leak (dataviper.io)
1439 points by bencollier49 13 days ago | hide | past | web | favorite | 419 comments





I was at an Elasticsearch meetup yesterday where we had a good laugh about several similar scandals in Germany recently involving completely unprotected Elasticsearch running on a public IP address without a firewall (e.g. https://www.golem.de/news/elasticsearch-datenleak-bei-conrad..., in German). This beats any of that.

Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."

Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?

In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).

It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.


I've been using ES off and on since before 1.0 came out. It has always baffled me that ES doesn't require a username and password by default.

ES is a database that has to exist on a network to be usable. Heck, it expects that you have multiple nodes, and will complain if you don't. So one of the first things you do is expose it to the network so you can use it.

Yes, it takes some serious incompetence to not realize you need to secure your network, but why in the world would you not add basic authentication into ES from the start? I'd never design a tool like a database without including authentication.

I am serious about my question. Could anyone clue me in?


It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).

If you are running mysql or postgres on a public ip address it would be equally stupid and irresponsible regardless of the useless default password that many people never change unless you also set up TLS properly (which would require knowing what you are doing with e.g. certificates). The security in those products is simply not designed for being exposed on a public ip address over a non TLS connection. Pretending otherwise would be a mistake. Having basic authentication in Elasticsearch would be the pointless equivalent. Base64 (i.e. basic authentication over http) encoded plaintext passwords is not a form of security worth bothering with. Which is why they never did this. It would be a false sense of security.

At some point you just have to call out people for being utter morons. The blame is on them, 100%. The only deficiency here is with their poor decision making. Going "meh http, public IP, no password, what could possibly go wrong?! lets just upload the entirety of linkedin to that." That level of incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin is considering legal action against individuals and companies involved. IMHO they'd be well within their rights to sue these people into bankruptcy.


Software should be secure by default. Don't blame the user.

mySQL in comparison wont even let you install without setting a root password. And it only listen on localhost/unix-socket by default. Then you need to explicitly add another user if you want to allow it to login from a non local ip. I don't think it's even possible - to both set a blank root password and allow it to login from a public IP.

So you really think the solution is to blame some low level worker, and sue him/her? The blame should always be on the people in charge, usually the CEO, who set the bar for engineering practices, proper training, etc, or the lack of.


While I don't think blaming labor is constructive or ethical, it seems like most tools pose danger to users in proportion to utility. For example, cars can squish people, electricity can fry people, and power tools can remove limbs.

Typically, people start out using knives and bicycles as children, learn through experience that crashing and getting cut hurt, and carry those lessons forward when they start using tablesaws and cars later in life. How does this apply to elasticsearch? I have no idea.


We could teach our children that software is very dangerous, especially databases. Or we could make software secure by default. But we also need to teach the user how to use the software properly. Learning by getting hurt is effective, but then we also need to have playgrounds.

That MySQL stuff is all quite recent... up until 5.7 (?, one of the most recent releases, anyway) there's no root password by default and running `mysql_secure_installation` is a common (but not mandatory) step to, well, secure the installation and set a root password. I think MariaDB still works this way? Not sure.

I'm not aware of "bind to localhost" being the default, either. The skip-networking setting to only allow local socket connections is definitely not the default, and I'm pretty sure the default is still to bind to all interfaces.


I installed mySQL a couple of months ago on a Ubuntu server, and got asked to set a root password. I've also installed mySQL many times on Windows. Secure install is the default. And it doesn't annoy me a bit. I like my software to be secure by default.

This is ridiculous.

Software should be built in the best method of delivering maximum value to its users. A trade-off for usability can be made for certain cases like ease-of-use for new software. Redis was part of this a while ago http://antirez.com/news/96.

Engineers should know their tools before using them. It's a huge part of our jobs. You could introduce a ton of other vulnerabilities in software: XSS, SQL injections, insecure cryptography. Security is part of our job and matters we must know.

You don't blame a plane for a pilot mistake that was meant to be part of his training. Engineers in every other sector are responsible for their mistakes, we should be too.

Also, you don't sue the worker, you sue the company.


"Software should be built in the best method of delivering maximum value to its users."

Yes, and defaulting to insecure, thus repeatedly causing huge data breaches, is the exact opposite of delivering maximum value to users. It's delivering maximum liability.


I would argue that the single command to begin using the application and the ease of on boarding / querying data was a huge factor in expanding its usage. Elastic optimized for initial spin-up and getting things running fast. It works really well! Until you load it full of data on a public IP, that is.

That single command to spin up the application can easily generate and show a copyable random secret required to use it, so that you can use easily but there's no option to use it that insecurely.

Onions. You need layers and defense in depth. Because even the best humans make mistakes and it is inhuman to assume perfectionism. Never rely on just one engineering feature.

> You don't blame a plane for a pilot mistake that was meant to be part of his training

Did you miss that Boeing is right now risking bankruptcy for doing exactly this?


Honestly a lot of the problem is: people aren’t studying systems engineering OR security. Look at all the “learn to code in 21 days” BS and all the code academies.

There’s so much emphasis on abstracting away the systems with cloud-this and elastic-that and developers don’t know much about general systems engineering.

My recommendation to software developers: take the Network+ and Security+ exams at the bare minimum.

Honestly as much as people complain about process getting in the way of things, there should be checks and balances at any business that deals with personal information. Finance institutions are heavily regulated—these fkers should be held accountable.


> "Engineers"

Maybe the hint is right there in your comment. Nearly all the people deploying these nodes aren't engineers in the slightest despite having someone given them such a title.


It's not always engineers that use them.

Sometimes software managers have the sudden need to show statistics and other things.

Yeah, that was fun...


If security is so important, why should we accept database developers who don't understand that?

Because... they dance the devops dance with their devop hats on! Security problems can be swiftly danced around until they actually surface, and can then be handled in the next round of "continuous delivery". It's also smart to postpone solving most issues until after they occur, so sales can continue bragging about "continuous improvement".

So, after some thought, here's why I don't consider it pointless to have basic auth built in.

It would keep ES from being completely open. If you wanted to get in, you'd have to comprise some part of the network that would let you read the username and password.

The way it is now, anyone can do a scan for port 9200 and get full access right away.

It is also important to have a username and password, even on secured networks. My test instance is on an internal network, and protected by both network and host firewalls, but I still make sure to secure it beyond that.

Basic auth would not provide a false sense of security. It is simply a very basic part of overall security. Not having it is a mistake.


> At some point you just have to call out people for being utter morons. The blame is on them, 100%. [...]

Your attitude is a symptom of a broader issue that plagues this industry: Indifference to risk*probability. If you don't ship software with "secure defaults" (depending on the threat/attack model), you essentially are handing out loaded shotguns, then blaming the "dumb" user when they inevitably point it at their foot and click the trigger. Easy solution: Don't hand out the gun loaded -- make the user do specific actions that enable the usage. Yeah, it creates some friction to first time deployment, but that's a secondary concern to having your freaking DB leaking all over the place.


But ES doesn't hand over a loaded gun . Someone went out of their way to load the gun up.

Bullshit.

If firing up a piece of software creates an unauthenticated, unprotected (non-TLS) endpoint to read-write data, that's a loaded gun. That is PRECISELY the default behavior of ES.

ES has jacked around for years by making TLS and other standard security features premium. To that, I say this: Screw ES and their bullshit business model. Their business model is a leading cause to dumbasses dumping extremely sensitive PII data into a DB that is unprotected - those same folks aren't going to go the extra mile to secure the DB, either by licensing or 3rd party bolt-ons.

Thus, why it must be shipped secure by default. Anything less is a professional felony, in my eyes. Also, screw ES again, in-case I wasn't clear.


Is it a secondary concern, though? As a startup, uptake is as vital as oxygen

Tort law is going to catch up to software soon enough and people will be held accountable for negligently creating or deploying software that they should have known would cause harm.

The fact that someone else down the chain should have known better is not a perfect defense. If that misuse was foreseeable and you didn’t do enough to prevent or discourage it, then you can still be held liable.


If startups prioritize their growth over the good of society, isn't the logical conclusion that startups are a threat to society?

They're not a startup.

maybe. but there's always this....

http://www.team.net/mjb/hawg.html


There's something called defense in depth.

Even with ES deployed in an environment with proper network firewall rules...etc, I'd still want some sort of authentication/RBAC


"Defense in depth" sounds, to me, like a phrase to justify multiple layers of imperfect security.

A single layer of cloth might not hold water, adding more layers of cloth may hold water for longer, but it's probably more cost effective to start with the right material.


> "Defense in depth" sounds, to me, like a phrase to justify multiple layers of imperfect security.

That’s absolutely correct! But you seem to be missing the fact that _all_ layers of security are always imperfect.


This is a fallacy of distributed systems. Never trust the network. Best case you get packets destined for somewhere else, worst case you your network segmented wasn't actually segmented.

i agree with GP here. ES is to blame here. not long ago apache airflow had a similar vulnerability discovered about not having sensible authentication defaults. the reasoning on their mailing list was eerily similar to those defending ES here. same arguments (iirc)

history is our greatest teacher. i think ES will end up doing what that team did: they agreed to provide sensible & secure defaults.


Security in depth. If I compromise one part of your network, I shouldn't compromise it all.

PostgreSQL does the following things by default to prevent this:

    1. Only listen to localhost and unix sockets
    2. Not generate any default passwords
So the only way to connect to a default configured fresh installation of PostgreSQL is via UNIX sockets as the postgres unix user. Where PostgreSQL is lacking is that it is a bit more work than it should be to use SSL.

> It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only.

Have you ever heard of the end-to-end principle, IPv6, or number 4 of the eight fallacies? http://nighthacks.com/jag/res/Fallacies.html


> It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).

I've met at least one cloud provider in the past (small Dutch thing) that provides _only_ public IP addresses. They do have customers, though one less now. Clustering over the public Internet is a thing. It shouldn't, but I could say the same thing about this website and yet here we are.


Heroku does the same in non-enterprise tiers. Their databases are accessible by the public internet with no option to limit it to your own dynos.

Well, lets agree it's a sad thing. Very sad.

Oh sure, but sad things happen. And they can be even messier: I had a Jenkins instance "made" public because a sysadmin new to a hosting provider forgot to remove the public IP that gets automatically assigned to new things. We were lucky, being fairly sure nothing found it before I realised, but it was a strong lesson learned:

Any network may become public by accident unless you go to great lengths to make sure it doesn't. Configurations change and mistakes are made even by seasoned people. People bring devices. Unless there's an air gap, people's devices may be hacked and let stuff through. Put authentication and anti-CSRF on _all_ your stuff, always.


> Clustering over the public internet is not a thing with Elasticsearch

It is, sort of, https://www.elastic.co/guide/en/elasticsearch/reference/curr...

But it's not a feature you'd be using without a really good reason IMO.


That does give me some food for thought. Not sure I agree a username and password is pointless though.

>Having basic authentication in Elasticsearch would be the pointless equivalent.

Instead of that they could implement a PAKE. That would provide security with no certificates.


Honestly, I as a user don't give a shit what a good engineer should so. All I see is that my personal data gets leaked left and right by elasticsearch and not mysql or postgres. But its fanbois just keep shifting blame instead of reflecting about reality and going "hey yeah maybe we should try do do something about it on our end". So fuck ES.

I agree. Every anti-moronic default adds friction. I love that I can play with ES quickly via simple URL without any auth.

That's how we got PHP, Javascript, Visual Basic, MySQL (before version 5), Mongo.

You'd think that at some point we'd understand that there's way more morons out there than sensible people.


It can still bind to localhost or a local socket without auth.

> It has always baffled me that ES doesn't require a username and password by default.

because auth was a part of their paid service (and by paid i mean 'very goddamned expensive') until like half a year ago when they made it free because of freshly emerged amazons opendistro free auth plugin


They offer security as a paid feature.

Actually it comes for free now with the standard ES distribution. https://www.elastic.co/blog/security-for-elasticsearch-is-no...

>Security for Elasticsearch is now free

What a horrific title. Even simply typing that should have been a blinking neon sign to them that they had their priorities in the wrong order.


That's incorrect.

The usual way of using this service is to have backend network configured that connects your services that is not available from outside (ie you have to traverse through services to reach it).

The so called "security" is just a paid feature for companies that want to use ElasticSearch but want to use it in "legacy" way because, presumably, they don't have people to design it correctly.


That's still really insecure, because it means that as soon as someone manages to gain any access to that network or any of the services on that network has a security issue your database is wide open.

That means that if someone manages to get access to the. I'd say public internet with proper (encrypted) password auth is more secure than that.


If attacker has access to app server it is already game over. App server typically already has access to all of the data.

The pods are akin to localhost networking where there is only one externally available application with multiple networked components.


That's true, but there are usually multiple ways to compromise protected networks. You still need to protect the database against attacks that don't go through the app server.

If an attacker gets a hold of your app server, they will be able to get the connection details for that DB, including the username/password.

Having a password adds a small layer of protection to databases that the affected app wasn't meant to connect to.

It adds some protection in that case, but the user should use best judgement if it's worth doing.


If you set up elasticsearch on a cloud service like AWS, by default your firewall will prevent the outside world from interacting with it, and no authentication is really necessary. If you do use authentication, you probably wouldn't want username+password, you would probably want it to hook into your AWS role manager thing. So to me, username+password seems useful, but it isn't going to be one of the top two most common authentication schemes, so it seems reasonable that it should not be the default.

MongoDB also by default does not have username+password authentication turned on.

I think defaulting to username+password is a relic of the pre-cloud era, and nowadays is not optimal.


I don't see why, though. It's much safer to start with a secure setup and then have the user disable the security explicitly (hopefully knowing what they're doing). Yes, username/password auth is not that common, but isn't it better than having no auth at all?

Ok, let's say username/password is mandatory and enabled by default. I see to options.

Option one, they generate an unique password for every installation – non trivial to do, because at which point do you do it? It can't be before a cluster is formed, as you'll have a split brain generating a bunch of credentials. If you do it afterwards, then there is a period of time when you cluster is not yet protected. Worse yet, unprotected and handshaking authentication. So you don't do that.

You could make the user input the credentials. What is to prevent them from creating weak credentials? And worse, they have to do that for every node (or at least the masters). Not a good experience and lost credentials will probably be the subject of a good many support calls.

So most products don't do that. What they do is default passwords. Which is arguably no security at all and doesn't protect anything. It may make it just a tiny bit easier to do the right thing afterwards (by changing to better credentials). Still, there's a period of time while the cluster is unprotected (default credentials are as good as no credentials).

Authentication does little to protect against the sort of people who are exposing databases to the public. If it is easily disabled, then they will be doing just that. Because they are already doing that by forcing databases to bind to publicly accessible interfaces.


I'd say option two is the only one viable. You deny access to the service until credentials are set by the user. You print huge warning labels while the credentials are set by the user to remind them of the possible consequences of setting weak credentials.

Yes, lost credentials will be subject of many support calls. Then, it boils down to your priorities. If you care about minimizing support calls, then sure, leave everything open to everyone. It will surely result in fewer access problems.

On the other hand, if your motivation is actually preventing your end-users from doing stupid things, it makes sense to just do the most conservative thing as default. Let the user change to the more liberal option, but not before informing them of all dangers that might befall them in that case.

I refuse to believe in this narrative of the end-user just being a stupid automaton who does not have any agency, and that any default imposed upon them will just result in them overriding the default with their terrible practices and ideas. I think there is a possibility of education and risk reduction.


I'd argue that the "pre-cloud" era is still going strong. And that is a good thing. My workplace has it's own data center. There are some downsides, but I prefer it.

So username+password really is needed. And should be included by default.

Also, I'd expect the same of something like MongoDB. That it doesn't have that by default is just baffling.


Password auth over HTTP is horrible. Short of binding a public IP address to your instance, basic auth without HTTPS setup is probably the worst thing you can do.

It's a marketing ploy by ES.

They aggregated the data and published it so that the viral breach would spread their name around because all publicity is good publicity.

Just riffing of course.


This addresses entirely the wrong question. By looking at it as a technical problem you're completely missing the broader ethical problem. Why was anyone allowed by law to amass this amount of data? And why did PDS not take the security and privacy concerns of 1.2 billion people seriously enough to ensure the data was handled correctly? They obviously thought it was valuable enough to amass a huge database. Do they sell this to just anyone? If not, who can buy access to this data? How much does it cost, and what steps are involved in doing so?

This makes me want to talk to a lawyer.


> Out of the box it does not even bind to a public internet address.

Bind to all interfaces used to be the default in 1.x - it changed pretty much because people were footgunning themselves.

Coupled with lack of security in the base/free distribution, that made for a dangerous pitfall. At least now security is finally part of the free offering, but the OSS version still comes with no access control at all.


You typically use these in pods which share networking but are not available from outside.

It doesn't matter then if you bind it to 0.0.0.0.


At the time it was common to deploy on bare hosts. Deploying ES into a network namespace isn't even the most common use case today.

That still puts you a single firewall mistake away from disaster. It also places a lot of trust into the applications and hosts that can access ES on a network level: They get full access with no control at all.

To add on that: No security also means no TLS, neither in the cluster communication, no TLS speaking to the client etc.


I've come across several such ES instances that are 100% exposed to the world without even trying, and ES is by no means the first tool to have this problem. People are never going to stop doing this. Making it annoyingly difficult within ES just weakens them such that some other "wow it's so easy" search product will be better positioned to eat their lunch.

ES, Mongo, Redis used to be some of the easiest targets for production data (security vuln wise). Deployed by SWE's usually, with products that were early versions, and didn't have access control by default.

ES's practice of making its security a proprietary paid for product is the cause for these kinds of things. It's a shitty practice, and this is one of the reasons I'm glad AWS forked it.

Other databases learned that not requiring a user/password upon install is completely irresponsible. ES and other dbs need to catch up ASAP, it's ridiculous.

Documentation is not security. If you need to "RTFM" to not be in an ownable state it's ES's fault.


Trusting software you install to be secure is ridiculous and completely irresponsible, especially if you did not pay for someone else to take the blame.

The only thing you can do to secure your software is to restrict its communication channels. Once you've secured the communication channels, the software auth is decorative at best.


That doesn't absolve ES of providing basic security defaults.

Wasn't this exact same thing a huge scandal just a few years ago for Mongo on Shodan?

I can't believe anyone shipping a datastore could let it happen after that. Doesn't postgresql still limit the default listen_address to local connections only? Seems like the best approach. On a distribute store consistency operations between nodes should go on a different channel than queries and should be allowed on a node by node basis at worst. At least at that point, it requires someone who should know better to make it open to the world. Even just listening for local connections passwordless auth should never be a default.


Yes, and similar issues still exist with public MongoDB instances even though the defaults are secure.

This assumes it was incompetence and not done intentionally.

My understanding is neither company is owning this data set and there is an assumption that it is a third company that has either legally or illegally obtained the data and is using it for their own services.

Another option is that the data was exfiltrated by a loose group of people who wanted this to be freely available on a random ip. Know the ip, get sick access to a trove of PII. No logins, no accounts, no trace.

Welcome to the early 90s internet.


> It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above

I would bet that in a lot of cases, people that configure their servers like in the OP just don’t read the official docs at all.

Stack Overflow, Quora, etc. are great places to get answers, because of the huge amount of questions that have already been asked and answered there.

But when people rely solely on SO, Quora, blog posts and other secondary, tertiary, ..., nth-ary sources of information, Bad Stuff will result, because of all the information that is left unsaid on Q&A sites and in blog posts. (Which is fine on its own – the problem is when the reader is ignorant about the unsaid knowledge.)

> and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.

Again, not necessarily, for the same reason as above.

But even if they did, it is a sad fact that a lot of people dismiss concerns over security with the kinds of “counter-arguments” that I am sure we are all too familiar with. :(

Thankfully though, we are beginning to see a shift in legislation being oriented towards protecting the privacy of people whose data is stored by companies.

Ideally, the fines should cause businesses to go bankrupt if they severely mishandle data about people. Realistically that is not what happens. For the most part they will get but a slap on the wrist. But it’s a start.

Companies that can’t handle data securely, have no business handling data at all.


My favourite was Bitomat.pl's loss of 17k bitcoins in 2011 because they restarted their EC2 instance.

I understand that the "ephemeral" nature of EC2 was in the documentation, but ESL speakers may have glossed over the significance of a word they didn't fully comprehend.

https://siliconangle.com/2011/08/01/third-largest-bitcoin-ex...


Not to say this is what people are doing, but I don't think it requires much knowledge to run under Docker, and it's pretty easy to expose it to the public internet that way.

Incompetence and indifference will be the ruin of us all.

This is just another symptom of the Principal-agent problem writ large.


It's a tragedy that all of this data was available to anyone in a public database instead of.... checks notes... available to anyone who was willing to sign up for a free account that allowed them 1,000 queries.

It seems like PDL's core business model is irresponsible regarding their stewardship of the data they've harvested.


If your in Europe or California, I suggest sending both companies an erasure request: https://yourdigitalrights.org/?company=peopledatalabs.com https://yourdigitalrights.org/?company=oxydata.io

Disclaimer: I'm one of the creators of yourdigitalrights.org.


Can I use this on behalf my @company users HIBP has just emailed me about?

This is great. Thanks

Would it be better if this was a paid service? If the issue access to the data, then maybe we should ask if this data should be collected in the first place.

> If the issue access to the data, then maybe we should ask if this data should be collected in the first place.

Outlawing the collection of data would be hard and is unlikely to work, but the fact that companies like AT&T are allowed to sell your data, as they did with OP's (where else would that unused phone number come from), is an angle new legislation can use.

The EU now already has a piece of legislation aimed at stifling these practices. The US and other economies just need to follow suit.


I'm more thinking that not all data is equal. We really treat it like it is, at least from the public perspective (it clearly isn't from the perspective of those gathering data, but there's a clear disparity in how these groups view things). Some data is actually necessary to give up to have a well functioning internet (what browser you're using) and some data is not (canvas fingerprinting). There's a tough question here because the people making the decision of what data to be used is not us. It is the websites we visit. I would argue that there is no consent being given here and all is assumed to be "common consent" (which I'm using as a lack for better terms. Things like that if you walk out in public people can see you. But conversely, someone can't run up to you and measure your height with a tape measure). There has to be some balance here. What that is, I don't know. But really the only people that can figure that out are us computer nerds who at least kinda understand these things. We have to be having these discussions, or else it becomes "fuck silicon valley" (a conversation that is becoming national). So if we don't think about these things, then we clearly live in a bubble and bubbles burst. If we do think about these things, maybe we don't live in a bubble.

I was recently told how private detectives from a national agency would actually go door-to-door (over a minimal area) under the pretext of AT&T store / sales employees. They’d try to convince their target (and some incidental neighbors as cover) to switch their bundled services to AT&T.

The private agents were armed with the latest available discounts (which you could find for yourself if you tried). But their skills made them particularly more successful than a typical front-line sales employee.

The catch? It wasn’t a scam, and they really were trying to get their targets to switch. It seems that AT&T was more willing to sell consumer data than the general public is aware of. Converting their targets to AT&T granted their agency access to additional data which they then to passed onto their clients. And the target gets a discount, too. Win-Win-Win? :)


It seems like that is starting to happen with California's new data privacy law. I'm starting to get a lot of privacy policy update emails like I did when GDPR took effect.

That is OPs point.

I found a vulnerability in linkedIn a few years back that allowed anyone to access a private profile (because client side validation was enough for them I guess..?)

They didn't take my report seriously (still not completely patched) and I feel like that told me all I needed to know about their security practices.


I reported an issue to the LinkedIn competitor https://about.me two years ago where signing in with my Google credentials gives me access to some the account of some random other person with a similar name to me. I think that during registration, I attempted to register about.me/johnradio (except it's not "johnradio"), but he was already using it, and then the bug occurred that gave me this access.

I randomly check every 6 months or so and yep, still not fixed.


My gmail is my first initial followed by my last name. There are other people on this planet with same first initial and last name, some of whom seem to think that must be their email too, because I keep on getting emails where they used it to sign up for things.

I had a lady send me a zip file that contained a VPN client, certificate and a word document with usernames and passwords to the VPN and a number of industrial control systems at the factory she was a manager of.

She sent it religiously, every 90 days.


Every few months I get scans of X-rays from random clients' teeth from some dentist in South America. I've tried so many times to respond and/or unsubscribe but never hear anything back.

Do you have any clue who she thought you were?

Oh yes, she was emailing a copy of her stuff to “herself”.

Seriously?

How the hell could she think that your email address was hers? I mean, wouldn't she notice that she never got the messages?


Totally serious. There are about a dozen people who regularly do this. One guy has missed 4-5 job interviews.

So is it typos? Like one letter off?

I can imagine someone mistyping an address, and then reusing the "to" link.


I faced the same problem (though my name is not at all very common). Banks, mobile companies never did anything even after I repeatedly told them on phone and Twitter (and have kept a record of it).

One day after I had received a person's bank, mobile statement and many other bills for few months I decided to call him (his number was easily visible in many emails) and inform him of his mistake. He turned out to be lawyer and he said he will "decide" what to do about it. And the next thing I know is he sent a carefully drafted email (as a legal notice) that I should hand over my email address to him without further delay and all that.

I didn't do that. I talked to a lawyer friend and he just told me to reply with a "G F Y" card. I didn't do that either. But that pushed me to finally move my emails to my personal domain as it was/is a Gmail account and if someone complained Google would have just terminated my account and I don't know anyone who works at Google.


That lawyer sounds like a douchebag. I super agree with your point too: I'm also slowly moving all my emails to my personal domain and it feels liberating.

I get several on a weekly basis. It's amazing how many services do not verify emails and just trust their users to own the email they claim to own.

It’s a common “growth hack” to postpone email verification.

Even more baffling are the ones who use it to fill out job applications.

I get bank statements, job offers, party invitations, and lately a bunch of lets say very questionable email verifications from euro 'dating' sites- I've identified the guy in the UK but its too much (and getting embarrassing now) to keep forwarding his stuff to him.

Downside of getting in early on popular email services.


I went through several rounds of conversation with somebody's wedding planner over email.

> but its too much (and getting embarrassing now) to keep forwarding his stuff to him

What amazes me is when I get misaddressed email, and I reply to say its misaddressed (and I'm not talking about automated services, I'm talking about obviously manually sent stuff), and my reply just gets ignored and the misaddressed email just keeps on coming.


Somebody keeps phoning me and leaving messages. They don't answer their own phone (or messages clearly). I even have a sarky voicemail now, you'd think they'd notice. Nope!

Lady, whoever you think is going to be at that funeral isn't getting that message.

I've no idea if they'll get disconnected now as I've blocked their number. Hope so maybe they'll notice then.


That's the most surreal, when you try to fix it and the behavior never changes.

My gmail is two initials and last name, so theoretically less susceptible to such errors. Yet I get misaddressed mail all the time—and a surprising amount of it is job applications!

Trust me, I used my full first name, it's not enough to stop these people. One is a UK doctor, one is a US teacher, and I think there are one or two more. Been sent a few baby pictures from their relatives too.

This happened to me and I keep getting the guy's notifications on instagram and all. So annoying!


I actually had a similar thing happen with facebook, though we didnt share names.

For a while, our Comcast billing account accessed some other person’s account. Comcast didn’t take it seriously, and just told us to create a new account and not use the old one. (!!!)

We had full access. I could have signed this person up for the most expensive package, or even canceled their service.


Let's be realistic here. Everyone knows it's not possible to cancel Comcast service.

I managed to cancel my dad's after he died. They STILL tried to upsell me! One of my favorite phrases ever uttered: "He's dead, you asshole, he doesn't need more channels!" And that actually did it. Felt sorry for the salesperson, who didn't have much of a choice in the matter...

Surely by making it difficult to cancel they’re really just making it easier for people to get discounts. If I were a Comcast customer I’d be calling up to cancel every few months.

He's dead, he doesn't need discounts.

Obviously. Which is why I used a plural—I was referring to Comcast’s overall customer base.

Nice one. However, I cancelled in person a couple years ago (because I had equipment to return).

The first thing I said at the counter was "I know it's really hard to cancel Comcast, and I'm not going to accept anything but a cancel."

The girl at the counter smiled and said "We know ..." and immediately cancelled my account.


"Ah yes, cancelling requires a call because of security. A feature for the user!"

To be fair, internets would have been equally outraged if there wasn't such requirement, because sure as hell somebody would have found an exploit and cancelled a bunch of account, just for funzies

That sounds like white hat hacking from all I've heard of Comcast...

Maybe that's how we drive their customer count and revenue down and put them out of business.


I signed up for a disposable Gmail account using my real name at one point, and accepted the randomly suggested address it offered. Gmail loaded with someone else's obviously in use mailbox

IIRC I logged out again and back in, same thing, my credentials worked. Went back to it a few days later and the password no longer worked


Hash collisions most likely.

Have heard this so many times about Gmail...

How have they not resolved this?


I think it's like EC2 instance IDs. When they first came up with it, they never thought there would be literally billions of unique email addresses/EC2 instances eventually.

I can only imagine about.me mass-creating profiles for names found on other web pages, and opening a way for someone to "claim" those profile with a matching Google account sign-in.

About.me's business model was quite unsettling to me and they have made little to no effort to protect the user data from scrapers.


I had a similar experience. In 2014 I reported an issue where you could take over someone's account by adding an email you control to it and having them complete the flow by sending them a link (which, unless they looked very carefully, looked exactly like the regular log-in flow at the time - especially if they used a public email service and you registered a similar-looking account).

I tried it on a friend and it worked, but LinkedIn's response was basically "meh".

My life has only gotten better since I deleted LinkedIn a few years ago. I know I'm in a privileged position to be able to do that, but I strongly recommend everyone here consider whether what they gain from their account is worth the crap and spam they have to put up with.


LI is terrible if you actually try to use it, but it's harmless enough if you just use it as a profile hosting service, where people are likely to look. I just auto-archive their emails and only visit the site a couple of times per year.

While not good, what's the connection to this story?

The article says some LinkedIn data was scraped, but I don't see anywhere that it specifically says a LinkedIn security flaw was used in the scraping. Although it is vague about what data was scraped and how, so it doesn't preclude that either.

In other words, are you saying a LinkedIn vulnerability was exploited here, or suggesting that it probably was, or are you just mentioning LinkedIn because it's tangentially related?


I signed up for an API key to see what they have on me, and the data it returned looks awfully close to what I have on linked in.

A few years of heads up is sufficient to disclose publicly. Full disclosure helps keep companies honest about security.

I deleted my linkedin a few years back when they had some bug where I would randomly get page views as some other person, with all their connections and account details and whatnot. It would only last a few minutes then switch me back to my account, but they aggressively ignored my attempts to reach out to them about this bug so I just gave up.

[flagged]


Could you please stop posting unsubstantive comments to Hacker News? We're trying for a bit better than internet default here.

No it is not.

The number in the HN headline was changed from 1.2 billion to 1 billion (despite the original source's headline saying 1.2). It is kind of amazing that leaking the personal data of 200 million people is now just a rounding error that can be dropped from headlines.

Imho, it's more impressive that it's basically a non-story outside of it security news.

The general public just shrugs upon hearing such news. They still think there is nothing dangerous if their data gets leaked.

I think the solution here is laws which require anonymity, and that includes in banking (where it will never happen).

That is because a couple days ago, I got a text message from tmobile (which seemed genuine) basically saying that my account was one of a larger subset of prepaid phone accounts which had been compromised and that my personal information had been potentially taken by "hackers".

To which I got a good chuckle, because tmobile is one of the few phone companies that will let you create completely anonymous prepaid accounts using cash and without filling out any information. AKA you buy a sim card for $$$ and that is it. So, basically the only information they lost of mine as far as I can tell, is the phone number and type of phone I'm using (which they gather from their network). If they got the "meta" data about usage/location/etc that would have been different but it didn't sound like the hacker got that far.

Had this been a post-paid account they would have my name/address/SSN/etc.


Do you think it’s reasonable to believe your name / address / SSN / DOB / etc is already out there?

I’m of the opinion it’s too late for prevention and we need, instead, mitigation.


Exactly. The very reason for existence of the two companies, pdl and oxy, is to tie n pieces of data with m pieces of data.

So depending on how the "anonymous" phone number was used, it's plausible that the number can be connected with other PII.

In fact I wonder if there is any such thing as non-PII, given the existence of such companies.


Companies need to stop treating knowledge of this information as proof that you are who you say you are. I would have no problem publicly posting my name, social security number, birthday, mother's maiden name, etc., if not for the fact that someone can actually use this information to open a bank account or take out a loan in my name. It's ridiculous that this is all it takes in most cases.

> Companies need to stop treating knowledge of this information as proof that you are who you say you are.

If we assume that isn't happening in the very immediate future due to the latency of introducing new legislation...

Do we have any other options to protect ourselves?

I've personally worked myself in to a bad credit rating. I have a home loan and a credit card, but any new credit applications auto-reject. Not the ideal scenario though!


> Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIn data, including recruiter information.

"Oxy" most likely stands for Oxylabs[1], a data mining service by Tesonet[2], which is a parent company of NordVPN.

It is probably safe to assume, that LinkedIn was scraped using a residential proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all around the globe with zero IP blocking".

[1] https://oxylabs.io/

[2] https://litigation.maxval-ip.com/Litigation/DetailView?CaseI...


The article says it is "Company 2: OxyData.Io (OXY)"* (http://oxydata.io)

OxyData and OxyLabs seem to be sister companies[1]: the former sells data as a product, the latter sells scraping as a service.

[1] https://vpnscam.com/wp-content/uploads/2018/08/2018-08-24-09...


Tesonet is true cancer. I am amazed how unethical (and successful) they are.

Knowing how quickly it's expanding, do the employees are just as unethical or they do not connect the dots (company got too big)?

I hate fb, et al as any other person here, but most of people know that "if it's free - you are the product". Though with NordVPN users are paying money and are getting stabbed in the back.


> do the employees are just as unethical

Most people's ethics are easily bought. Does working for a company that operates with questionable integrity outweigh providing a stable income for your family?

Remember Facebook is still a very highly desirable company to work at.


> NordVPN users are paying money and are getting stabbed in the back.

could you please expand on this claim?


From the comment they replied to: https://vpnscam.com/

"My name is Ripoff Reporter." For all that their schtick is about how they're "educating" the public about how shady VPN services are this could be anyone, including a front for a VPN service that isn't mentioned on the site.

How is that possible? LinkedIn blocked mining the data this way several years ago.

Is it still possible if you pay LinkedIn enough? Or is this old data?


It is strictly impossible to "block mining data" on the public web. Double that if the miner has free access to a pool of residential IPs.

[source: experience]


A large number residential proxies and fake LinkedIn accounts would look the same to LinkedIn as normal browsing.

There's information on the leak that wouldn't be widely available without accessing LinkedIn data using their APIs. Phone numbers and emails, for example.

The article mentions it is a blend of data from http://oxydata.io/ and https://www.peopledatalabs.com/

Both are aggregators that get data from many sources, correlate them, and sell it. The phone numbers and emails could have come from anywhere.

See this screenshot from PeopleDataLabs: https://d1ennknj6q36vm.cloudfront.net/images/cblead.png


I'm a nordvpn user. Practices like this scares me though. I guess it's time to switch to a new vpn?


Ah... but that is very inconvenient :( I guess comfort comes at a cost.

Is there at least a less shady provider if I would like to compromise myself but a bit less than nordvpn? How far do we go in assuming all are bad?


Mullvad seems trustworthy (I used to share an office with one of their IT infrastructure staff), but it is impossible to say for sure.

You could set up your own VPN on a server you run.

Yes. This. And is free to setup on big cloud services. Like free 24/7 with whatever amount of data. Guides are online.

All the way. It isn’t as if all VPN providers are part of a shadowy cabal to steal your data from an otherwise valuable service; the very premise of commercial VPNs is flawed. Any VPN service is inherently harmful.

Out of curiosity how do you guys think they managed to scrape LinkedIn on such a large scale?

I've been wanting to do some social graph experimentation on it (small scale - say 1000 people near me) but concluded I probably couldn't scrape enough via raw scraping without freaking out their anti-scraping. (And API is a non-starter since that basically says everything is verboten).


I've crawled a popular social network on a large scale, currently doing the same for dating services as a hobby. God, wish I'd still got paid for webscraping.

Here are some tricks which may or may not work today:

- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.

- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.

- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.


> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Nice tip!!

> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).


>Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).

Sure it will work, but I personally don't like Elasticsearch for anything high-intensity because of its HTTP REST API and the overhead it carries. Take a look at Cassandra's [1] "CQL binary protocol", it simple and always on point.

[1] https://github.com/apache/cassandra/blob/trunk/doc/native_pr...


You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?


Tbh I'm still selling that data.

For a while I've had reoccurring nightmares that my DB had been stolen and published together with an article on how stupid and incompetent I am.


If I've understood you right, you break the TOS on other websites to collect users personal info, and then you have nightmares about people taking that data from you? Doesn't that raise ethical concerns in your eyes?

>You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

I'll cut straight to the chase and post it on hn. This intermediate step of waiting for someone to discover it takes too long


The use case is in a local datacenter, with a NAT-ed IP not exposed to the main web

A firewalled IP would be much more appropriate, and NAT is not a firewall or a security mechanism.

Same thing, more-or-less. And NAT is effectively a firewall for inbound traffic, even if a lot of people say it isn't.

> Have an app where user logs in through said website, then scrape their friends using this user's token.

That's some extremely shady thing to do.


Welcome to the internet!

> Don't be too kind on the big websites.

I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.


1000ms is a massive slowdown when revenue-noticeable impacts are far, far smaller. I don't know the legality, but hitting a site hard enough to cause 1000ms slowdowns seems like it's approaching DOS legality issues.

Don't you consider this unethical -- if not against the site itself, than against the other users of the site whose data you're scraping?

Wow these are some hot tips!

YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.

Pretending to be Googlebot also helps.


>- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Clever. VMs with IPV6 are cheap as a bonus :)

Same for non-js mobile. Thanks for the tips


- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

How would someone do that using node.js? Asking for a friend.


So far, the answers have contained non-technical answers like "Distributed Scraping." Well, yes, obviously.

A more useful answer is: I did this once, many years ago. Back then it was a matter of hooking up PhantomJS and making sure your user string was set correctly. Since PhantomJS was – I think – essentially the same as what headless chrome is today, the server can't determine that you're running a headless browser.

Now, it's not so easy nowadays to do that. There are mechanisms to detect whether the client is in headless mode. But most websites don't implement advanced detection and countermeasures. And in the ideal case, you can't really detect that someone is doing automated scraping. Imagine a VM that's literally running chrome, and the script is set up to interact with the VM using nothing but mouse movements and keyboard presses. You could even throw in some AI to the mix: record some real mouse movements and keyboard presses over time, then hook up some AI to your script such that it generates movements and keyboard presses that are impossible to distinguish from real human inputs. Such a system would be almost impossible to differentiate vs your real users.

The other piece of the puzzle is user accounts. You often have to have "aged" user accounts. For example, if you tried to scrape LinkedIn using your own account, it wouldn't matter if you were using 500 IPs. They would probably notice.

It's hard to counter a determined scraper.


I wrote a chrome headless framework that types using semi-realistic key presses (timing, mistakes, corrections) and does semi-realistic scrolling / swiping and clicking / tapping.

It's not very hard to get something that would be too hard for almost every website beside Google and Facebook to bother with. If it's a 1 on a 0-9 scale in difficulty, most websites just don't have the resources to detect it

It took me like ~3 hours to write it, but I guarantee it would take months for someone to detect it, and even then, they'd have a lot of false positives and negatives.


I think there's also a lot of bot-detection-as-a-service around here that can be used by sites smaller than Google and Facebook, like WhiteOps or IAS anti-fraud.

These are highly questionable under GDPR, many of them rely on tracking users wherever they go (e.g. Recaptcha is known for this).

> These are highly questionable under GDPR

How many fines has GDPR resulted in?


Not many yet, general consensus is to first warn and get companies to implement better compliance - only those who really openly shit on GDPR get the fines.

then release it!

Headless chrome cat and mouse game is a lot of fun. We need more players.


LinkedIn doesn't protection doesn't seem to be that sophisticated at the moment. Someone I know maintains ~weekly up-to-date profiles of a few million users via a headless scraper that uses ~10 different premium accounts and a very low number of different IPs.

That is a violation of ToS (using registerd accounts for scrape) and could carry potential legal implications.

So is leaking PII? ToS isn't a legal contract: it's not signed by anyone and it's changed every other week without consent of users. ToS is just a formal excuse why someone's account may be suspended.

As long as you are able to source more than one provider, this can work well enough. If you're dependent on a single data source, e.g., because that source is the only possible source of said data, you'll get nuked from orbit by legal rather than technical means.

I had a business that was generating more money than my full-time job for a while. We helped and greatly simplified matters for several thousand independent proprietors while having a positive effect on the load of the data source, since we were able to batch/coalesce requests, make better use of caches, and take notification responsibilities on ourselves.

Once in a while someone would get worried and grumpy at the data source and there were a couple of cat-and-mouse games, but we easily outwitted their scraping detection each time. When they got tired of losing the technical game, they sent out the lawyers, which was far more effective. We were acquiring facts about dates and times from the place that issued/decided those dates and times, so there wasn't really any reliable alternative data source, and we had to shut down.

The glimmer of hope on the horizon is LinkedIn v. HiQ, which seems poised to potentially finally overturn 4 decades of anti-scraping case law, but not holding my breath too hard there.


The US courts decided that scraping is legal, even if against EULA:

> In a long-awaited decision in hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA). This is an important clarification of the CFAA’s scope, which should provide some relief to the wide variety of researchers, journalists, and companies who have had reason to fear cease and desist letters threatening liability simply for accessing publicly available information in a way that publishers object to. It’s a major win for research and innovation, which will hopefully pave the way for courts and Congress to further curb abuse of the CFAA.

https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...


That is a blatant misrepresentation of that decision. That decision was upholding a lower court's preliminary injunction that prevents LinkedIn from blocking hiQ while the main case between the two is litigated. It is not a final decision and it doesn't purport to say that scraping is legal (it even points out other laws besides the CFAA that might be used to prohibit scraping.)

LinkedIn Sales Navigator is a paid tool which allows you to search their whole database. Then depending on how much you pay you can get all their personal details (Email address, phone number, even their address sometimes.) https://business.linkedin.com/sales-solutions/sales-navigato...

I've always been a little confused how this works. If I got all that info for free, it's a "data leak", but if I pay to get the same detailed personal information it's...

In either case my personal data is given away without my consent, but there's this implication that it's only an issue when someone doesn't pay for it.


You're right, my take on this is that a company scraped a bunch of publicly available information, that people left open (consciously or not.) That's why only a subset have phone numbers. The profile URLs, emails, most people don't even try to protect those.

Normally the company sells this data, but now they've given it away. It's not good this data got out because the curation has some value to spammers or whoever. But using the word "leak" here undermines the severity of a real leak where passwords and social security numbers are exposed. Data that was never meant by anyone to be open.

Everyone likely has (technically) provided consent for every piece of information here being shared with partners. Buried in fine print that it wasn't really expected they'd read, of course. It's the cost of being online, and that sucks, but it seems only a leak of what had already been given out.


> In either case my personal data is given away without my consent

You gave that consent when you put your info in Linkedin in the first place, according to their ToS.


I think everyone is confused. Everyone just wants their slice of the pie (aka $$$).

If you get drivers info by hacking a DMV database, it's prison. If you got the same details by paying a few millions for FOIA requests, you're a good citizen and a model tax payer.

Unless you're the state of Florida, and you make millions by selling the DMV database to private buyers... [0]

[0] https://www.abcactionnews.com/news/local-news/i-team-investi...


Jokes aside, can you really file FOIA requests to get personal driver details from DMV? I thought FOIA would only apply for stuff that is meant to be public, but isn't due to difficulties of hosting, putting it up, etc.

Mind you, I didn't research the topic of what can or cannot be requested with FOIA, so I might be totally wrong.


LinkedIn gives away email id and phone number (even if you had given just for 2FA) to all your contacts. I checked PDL, it has all the information from LinkedIn except for phone number, which I promptly removed once I identified the 2FA issue (now TOTP is available).

'Mobile Proxies' like https://oxylabs.io/mobile-proxies (no affiliation) allow you to use large pools of mobile or domestic IPs to scrape. It's expensive, but not prohibitively so. Once you've got a mobile IP you become incredible hard to throttle, since you're behind a mobile NAT gateway.

You probably have to be highly distributed. At least that’s what I did when I tried to scrape a large site some years ago. I had around 100 machines in different countries and gave each of them random pages to scrape.

Distributed bot and scraper networks. Thousands of IPs geographically dispersed throughout the world. There is only so much you can do with rate limiting.

They asked about LinkedIn, where the content is gated behind a login. If it was a rate limiting problem, that would be trivial.

Needing to be logged in as the same user defeats the purpose of proxying to hide your physical origin.

Registering thousands of different users to use in a distributed way is hard now that they require a text message verification for new accounts.


Public LinkedIn profiles (which is many of them) are open to scrapers and they lost a court case about it.

https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...


I go to LinkedIn without being logged in and nearly always get a login gate instead of the profile.

They were ordered to unblock hiQ specifically, they were not ordered to open up content to scrapers generally.

They can still throttle high volume traffic and put up captchas. I think the only specific thing the court ordered was for them to unblock hiQ IP ranges.


Proxies can also work well for cheaper than buying distributed compute.

Scraping LinkedIn is so common you can usually hire people with years of experience in it. It is not as complicated as you might think. There are at minimum hundreds of companies that sell LinkedIn data they have scraped.

You use a proxy botnet and route your scraping requests through that. Use something like hola proxy or crawlera for example.

I scraped 10 million records from linkedin a few years ago from a single ip by using their search function. I got a list of the top 1000 first names and top 1000 last names and wrote a script to query all combinations and scrape the results.

This may or may not still work.


It looks like the purpose was data enrichment, so maybe it was pieced together over time from multiple sources. My linkedin from PDL only had 1 bit of wrong info. I wasn't able to find anything on my personal email addresses which is good.

once worked on a project that tried to do just that, but at the time the LinkedIn api was already limited to seeing the authenticated users connections connections, which was too limited for what we wanted to do, can only imagine it got worse. It's also the reason recruiters really want to connect to you on LinkedIn because even if you are not interested, your connections might be.

A very large distributed network of machines.

Hey - not related to your comment (apologies) but wanted to get in touch . You left a note on a previous post of mine about wanting to simplify FTP. I'd love to work on this project and wanted to see if you'd be willing to connect so I can understand the problem better. Feel free to email me at kunal@mightydash.com, and thanks in advance!

People data labs's data is pretty accurate. Here is mine: https://api.peopledatalabs.com/v4/person?api_key=9c6a1382204...

You can try it for yourself by changing the email. All of the information is public, so I don't mind. They are basically doing data integration.


Haha, when I was a kid and scared to use my real name for things, for some reason I used my email... which had my real name in it, to open a Github account with a fake name

So the api knows me as the famous architect, Art Vandelay


Reminds me of when I used to get free magazine subscriptions (and the subsequent junk mail/robocalls) addressed to Santos L. Halper.

There is a way to get every developer’s email on github thanks to git commits adding it :))

In your github account you can add a new email address that doesn't even exist or have a valid TLD, like "name@mail.fake". Don't use it as your primary email and it won't require confirmation. You can now set your git user.email to this fake address and any commits you make will be attributed to your account without exposing your actual email address.

You can use yourgithubusername@users.noreply.github.com instead of adding a fake email, and your commits will still show up on your contribution graph and be linked to your username.

That must have been a long time ago, Boorish Bears.

Wow.. I checked with an email address I use for disposable purposes. The only thing they had on it was a blank LinkedIn profile -- meaning that LinkedIn cancer has trawled some pretty questionable sites, harvesting email addresses as placeholders for their accounts. WTF.

Ah, looks like everyone's using that API key, I got 2 queries for my addresses and got a "rate limit exceeded" message.

Strangely it only says I work in real estate (no I don't) when I looked up the email address I use for LinkedIn...


You, and others can use my api key, just signed up.

e75ac28b25480e60071b24d819d4692a0b315c037046b9ff6ec9dfb1e99a895c


Status 429, Rate limit error.

yours gone too now. very curious about this API lol

Try changing v4 to v3 in the URL.

Yup, that worked for me.

Indeed they do have a profile on me - a bare minimum, scaped from GitHub. That makes sense, since that's about the only social platform I use, aside from HN.

EDIT: My GMail address has the most amount of information gathered, which makes sense. It's gathered Facebook, LinkedIn, Pinterest, GitHub..

It lists my skills as: firefighting and emergency planning/management/services. I suppose, with a stretch of imagination..


Here's mine eaca37c25ca1a9c5d85efb8cbaf1742b4fbfeee0054d713961176ab9500c2f2b

It returned a 404 for my personal email account, so that appears to be sufficiently protected.

More surprisingly it had data such as my name, title and work email address which was connected to old work email account (Okta managed - GSuite) that I never associated with external services, and absolutely never used on a social networking site like LinkedIn.


That API key is now public, too! Rate limited.

Yeah no kidding. Though if you wait until it flips to a new minute and refresh, that helps. Though it takes all of a minute to register a free key, so probably no big deal.

Your api key is now permanently in public. After few days, people will still be able to use this for their own usage.

a few days? its already hit its limit :)

I'm actually a bit surprised at how little data they have on me. They've associated my main email with an old junk email, they've got my first and last name, and know that I'm male, but there's little more.

Nothing for most of my accounts, except one which somehow was falsely attributed to someone else. Odd given I do have a LinkedIn profile; Their scraping must be far from perfect.

Wait, so is this mostly just Linkedin data in JSON form?

My personal email seems to be based on Github and Gravatar, while my job search and work emails got linked together and appear to be based on LinkedIn.

This seems exceptionally unethical

Displaying public information publicly, or sharing your API key?

It would be really surprised if this were compliant with the GDPR. I live in the US but I tried email accounts of relatives in Europe and they had data in there.

It looks like it's a US-based company without enough of a European presence to fall under their jurisdiction.

https://gdpr.eu/companies-outside-of-europe/ it looks like it would? I'm no expert though.

Right, they can say it applies... but if a company does no business in Europe, how can a judgement be enforced?

> The whole point of the GDPR is to protect data belonging to EU citizens and residents. The law, therefore, applies to organizations that handle such data whether they are EU-based organizations or not, known as "extra-territorial effect."

They can say this all they want, but if you have no presence in the EU, and your jurisdiction does not have any agreement to apply GDPR regulations to you, then this is at most a strongly worded request.

Barring explicit agreements to the contrary (treaties, extradition agreements, etc), by definition a country's laws are only enforceable there.

If PDL has no business in Europe, no plans to expand there, and there's no treaty or other agreement making the provisions enforceable against them, the EU can say whatever it wants but PDL has no legal obligation to do anything about it.


One obvious answer in that case would be to establish who is buying the data from them and treat any PDL data as potentially tainted. If you find a downstream customer who does have a presence, then investigate accordingly. You might not be able to fine PDL directly, but you could certainly make the offending data risky or unprofitable...

Sure, but how do you propose doing that? Send another strongly worded letter to PDL demanding their customer list?

Usually you'd either track known errors in the dataset (implying that the companies had either bought it from PDL or copied the leak), or you'd ask the banks (who do have a presence) which accounts were paying them and who owned the accounts. If Bitcoin's involved at all, you assume there's something fishy going on and investigate accordingly.

(Assuming anyone were bothered enough to actually do this, of course.)


I’m also not an expert, but my understanding is that it applies but would be hard for the EU to take action against them

A law isn't a law if you can't enforce it, so "applies" has kind of a strange meaning in this context then, doesn't it?

A law always has a jurisdiction. EU laws generally don't apply to the US, even if the EU wants them to. There are exceptions, of course.

Theoretically, if it were egregious enough, the EU could say to the owners or management of the company that if they went to the EU they would be arrested. That’s enough of a threat that it might convince them.

Legal jurisdiction is a separate matter than the specific text of laws. The "this applies to non-European companies" things just means that if you fall under the jurisdiction of European courts, you can't absolve yourself of responsibility of complying with this law simply by being a foreign-registered company.

On the other hand, if you never fall under European jurisdiction in the first place, you're free to ignore them, just as you can ignore Thai laws against insulting their king. One very important thing to note is that setting foot in European soil will expose you to their jurisdiction, so you've significantly limited your freedom of movement, but if GDPR compliance is a bigger deal than that then "just never go to Europe" can be a viable strategy.


Oh yes, I'm going to try and see if they have data on me and send a number of GDPR requests if they do. For others from the EU, it's very easy to do using: https://www.mydatadoneright.eu/request

So... if the owner is known, it will be quite costly ;-)

It's no secret who is behind that website [1].

Good luck to the EU on enforcing their law against an American company, though.

[1] https://angel.co/company/peopledatalabs/people


I don't know how accurate the coordinates of your address in India are, but it's 5 minutes away from me. Small world, huh?

I'm glad they don't have jack shit on me besides my email, is there a list of their data source(s) ?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: