
Personal and social information of 1.2B people discovered in data leak - bencollier49
https://www.dataviper.io/blog/2019/pdl-data-exposure-billion-people/
======
jillesvangurp
I was at an Elasticsearch meetup yesterday where we had a good laugh about
several similar scandals in Germany recently involving completely unprotected
Elasticsearch running on a public IP address without a firewall (e.g.
[https://www.golem.de/news/elasticsearch-datenleak-bei-
conrad...](https://www.golem.de/news/elasticsearch-datenleak-bei-
conrad-1911-145091.html), in German). This beats any of that.

Out of the box it does not even bind to a public internet address. Somebody
configured this to 'fix' that and then went on to make sure the thing was
reachable from the public internet on a non standard port that on most OSes
would require you to disable the firewall or open a port. The ES manual
section for network settings is pretty clear about this with a nice warning at
the top: "Never expose an unprotected node to the public internet."

Giving read access is one thing. I bet this thing also happily processes curl
-X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data
breach when somebody of the general public cleans up your mess like that?

In any case, Elasticsearch is a bit of a victim of its own success here and
may need to act to protect users against their own stupidity since clearly
masses of people who arguably should not be taking technical decisions now
find it easy enough to fire up an Elasticsearch server and put some data in it
(given the amount of companies that seem to be getting caught with their pants
down).

It's indeed really easy to setup. But setting it up properly still requires
RTFMing, dismissing the warning above, and having some clue about what ip
addresses and ports are and why having a database with full read write access
on a public ip & port is a spectacularly bad idea.

~~~
jerrac
I've been using ES off and on since before 1.0 came out. It has always baffled
me that ES doesn't require a username and password by default.

ES is a database that has to exist on a network to be usable. Heck, it expects
that you have multiple nodes, and will complain if you don't. So one of the
first things you do is expose it to the network so you can use it.

Yes, it takes some serious incompetence to not realize you need to secure your
network, but why in the world would you not add basic authentication into ES
from the start? I'd never design a tool like a database without including
authentication.

I am serious about my question. Could anyone clue me in?

~~~
jillesvangurp
It has to exist on a private network behind a firewall with ports open to
application servers and other es nodes only. Running things on a public ip
address is a choice that should not be taken lightly. Clustering over the
public internet is not a thing with Elasticsearch (or similar products).

If you are running mysql or postgres on a public ip address it would be
equally stupid and irresponsible regardless of the useless default password
that many people never change unless you also set up TLS properly (which would
require knowing what you are doing with e.g. certificates). The security in
those products is simply not designed for being exposed on a public ip address
over a non TLS connection. Pretending otherwise would be a mistake. Having
basic authentication in Elasticsearch would be the pointless equivalent.
Base64 (i.e. basic authentication over http) encoded plaintext passwords is
not a form of security worth bothering with. Which is why they never did this.
It would be a false sense of security.

At some point you just have to call out people for being utter morons. The
blame is on them, 100%. The only deficiency here is with their poor decision
making. Going "meh http, public IP, no password, what could possibly go
wrong?! lets just upload the entirety of linkedin to that." That level of
incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin
is considering legal action against individuals and companies involved. IMHO
they'd be well within their rights to sue these people into bankruptcy.

~~~
z3t4
Software should be secure by default. Don't blame the user.

mySQL in comparison wont even let you install without setting a root password.
And it only listen on localhost/unix-socket by default. Then you need to
explicitly add _another user_ if you want to allow it to login from a non
local ip. I don't think it's even possible - to both set a blank root password
_and_ allow it to login from a public IP.

So you really think the solution is to blame some low level worker, and sue
him/her? The blame should always be on the people in charge, usually the CEO,
who set the bar for engineering practices, proper training, etc, or the lack
of.

~~~
m00x
This is ridiculous.

Software should be built in the best method of delivering maximum value to its
users. A trade-off for usability can be made for certain cases like ease-of-
use for new software. Redis was part of this a while ago
[http://antirez.com/news/96](http://antirez.com/news/96).

Engineers should know their tools before using them. It's a huge part of our
jobs. You could introduce a ton of other vulnerabilities in software: XSS, SQL
injections, insecure cryptography. Security is part of our job and matters we
must know.

You don't blame a plane for a pilot mistake that was meant to be part of his
training. Engineers in every other sector are responsible for their mistakes,
we should be too.

Also, you don't sue the worker, you sue the company.

~~~
CydeWeys
"Software should be built in the best method of delivering maximum value to
its users."

Yes, and defaulting to insecure, thus repeatedly causing huge data breaches,
is the exact opposite of delivering maximum value to users. It's delivering
maximum _liability_.

~~~
sailfast
I would argue that the single command to begin using the application and the
ease of on boarding / querying data was a huge factor in expanding its usage.
Elastic optimized for initial spin-up and getting things running fast. It
works really well! Until you load it full of data on a public IP, that is.

~~~
PeterisP
That single command to spin up the application can easily generate and show a
copyable random secret required to use it, so that you can use easily but
there's no option to use it _that_ insecurely.

------
shadowgovt
It's a tragedy that all of this data was available to anyone in a public
database instead of.... _checks notes..._ available to anyone who was willing
to sign up for a free account that allowed them 1,000 queries.

It seems like PDL's core business model is irresponsible regarding their
stewardship of the data they've harvested.

~~~
yoaviram
If your in Europe or California, I suggest sending both companies an erasure
request:
[https://yourdigitalrights.org/?company=peopledatalabs.com](https://yourdigitalrights.org/?company=peopledatalabs.com)
[https://yourdigitalrights.org/?company=oxydata.io](https://yourdigitalrights.org/?company=oxydata.io)

Disclaimer: I'm one of the creators of yourdigitalrights.org.

~~~
Already__Taken
Can I use this on behalf my @company users HIBP has just emailed me about?

------
sparkywolf
I found a vulnerability in linkedIn a few years back that allowed anyone to
access a private profile (because client side validation was enough for them I
guess..?)

They didn't take my report seriously (still not completely patched) and I feel
like that told me all I needed to know about their security practices.

~~~
john-radio
I reported an issue to the LinkedIn competitor
[https://about.me](https://about.me) two years ago where signing in with my
Google credentials gives me access to some the account of some random other
person with a similar name to me. I think that during registration, I
attempted to register about.me/johnradio (except it's not "johnradio"), but he
was already using it, and then the bug occurred that gave me this access.

I randomly check every 6 months or so and yep, still not fixed.

~~~
skissane
My gmail is my first initial followed by my last name. There are other people
on this planet with same first initial and last name, some of whom seem to
think that must be their email too, because I keep on getting emails where
they used it to sign up for things.

~~~
Spooky23
I had a lady send me a zip file that contained a VPN client, certificate and a
word document with usernames and passwords to the VPN and a number of
industrial control systems at the factory she was a manager of.

She sent it religiously, every 90 days.

~~~
mirimir
Do you have any clue who she thought you were?

~~~
Spooky23
Oh yes, she was emailing a copy of her stuff to “herself”.

~~~
mirimir
Seriously?

How the hell could she think that your email address was hers? I mean,
wouldn't she notice that she never got the messages?

~~~
Spooky23
Totally serious. There are about a dozen people who regularly do this. One guy
has missed 4-5 job interviews.

~~~
mirimir
So is it typos? Like one letter off?

I can imagine someone mistyping an address, and then reusing the "to" link.

------
slg
The number in the HN headline was changed from 1.2 billion to 1 billion
(despite the original source's headline saying 1.2). It is kind of amazing
that leaking the personal data of 200 million people is now just a rounding
error that can be dropped from headlines.

~~~
class4behavior
Imho, it's more impressive that it's basically a non-story outside of it
security news.

~~~
trickstra
The general public just shrugs upon hearing such news. They still think there
is nothing dangerous if their data gets leaked.

------
StillBored
I think the solution here is laws which require anonymity, and that includes
in banking (where it will never happen).

That is because a couple days ago, I got a text message from tmobile (which
seemed genuine) basically saying that my account was one of a larger subset of
prepaid phone accounts which had been compromised and that my personal
information had been potentially taken by "hackers".

To which I got a good chuckle, because tmobile is one of the few phone
companies that will let you create completely anonymous prepaid accounts using
cash and without filling out any information. AKA you buy a sim card for $$$
and that is it. So, basically the only information they lost of mine as far as
I can tell, is the phone number and type of phone I'm using (which they gather
from their network). If they got the "meta" data about usage/location/etc that
would have been different but it didn't sound like the hacker got that far.

Had this been a post-paid account they would have my name/address/SSN/etc.

~~~
TheSpiceIsLife
Do you think it’s reasonable to believe your name / address / SSN / DOB / etc
is already out there?

I’m of the opinion it’s too late for prevention and we need, instead,
mitigation.

~~~
a3n
Exactly. The very reason for existence of the two companies, pdl and oxy, is
to tie n pieces of data with m pieces of data.

So depending on how the "anonymous" phone number was used, it's plausible that
the number can be connected with other PII.

In fact I wonder if there is any such thing as non-PII, given the existence of
such companies.

~~~
ryandrake
Companies need to stop treating knowledge of this information as proof that
you are who you say you are. I would have no problem publicly posting my name,
social security number, birthday, mother's maiden name, etc., if not for the
fact that someone can actually use this information to open a bank account or
take out a loan in my name. It's ridiculous that this is all it takes in most
cases.

~~~
TheSpiceIsLife
> Companies need to stop treating knowledge of this information as proof that
> you are who you say you are.

If we assume that isn't happening in the very immediate future due to the
latency of introducing new legislation...

Do we have any other options to protect ourselves?

I've personally worked myself in to a bad credit rating. I have a home loan
and a credit card, but any new credit applications auto-reject. Not the ideal
scenario though!

------
krn
> Analysis of the “Oxy” database revealed an almost complete scrape of
> LinkedIn data, including recruiter information.

"Oxy" most likely stands for Oxylabs[1], a data mining service by Tesonet[2],
which is a parent company of NordVPN.

It is probably safe to assume, that LinkedIn was scraped using a residential
proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all
around the globe with zero IP blocking".

[1] [https://oxylabs.io/](https://oxylabs.io/)

[2] [https://litigation.maxval-
ip.com/Litigation/DetailView?CaseI...](https://litigation.maxval-
ip.com/Litigation/DetailView?CaseID=Epee88Womxg%3D&logstat=false&Party=Luminati%20Networks%20Ltd.%20v.%20UAB%20Tesonet)

~~~
gorbachev
How is that possible? LinkedIn blocked mining the data this way several years
ago.

Is it still possible if you pay LinkedIn enough? Or is this old data?

~~~
tyingq
A large number residential proxies and fake LinkedIn accounts would look the
same to LinkedIn as normal browsing.

~~~
gorbachev
There's information on the leak that wouldn't be widely available without
accessing LinkedIn data using their APIs. Phone numbers and emails, for
example.

~~~
tyingq
The article mentions it is a blend of data from
[http://oxydata.io/](http://oxydata.io/) and
[https://www.peopledatalabs.com/](https://www.peopledatalabs.com/)

Both are aggregators that get data from many sources, correlate them, and sell
it. The phone numbers and emails could have come from anywhere.

See this screenshot from PeopleDataLabs:
[https://d1ennknj6q36vm.cloudfront.net/images/cblead.png](https://d1ennknj6q36vm.cloudfront.net/images/cblead.png)

------
Havoc
Out of curiosity how do you guys think they managed to scrape LinkedIn on such
a large scale?

I've been wanting to do some social graph experimentation on it (small scale -
say 1000 people near me) but concluded I probably couldn't scrape enough via
raw scraping without freaking out their anti-scraping. (And API is a non-
starter since that basically says everything is verboten).

~~~
kaivi
I've crawled a popular social network on a large scale, currently doing the
same for dating services as a hobby. God, wish I'd still got paid for
webscraping.

Here are some tricks which may or may not work today:

\- Have an app where user logs in through said website, then scrape their
friends using this user's token. That way you get exponential leverage on the
number of API calls you can make, with just a handful of users.

\- Call their API through ipv6, because they may not yet have a proper, ipv6
subnet-based rate limiter.

\- Scrape the mobile website. Even Facebook still has a non-js mobile version.
This single WAP/mobile website defeats every anti-scraping measure they may
have.

\- From a purely practical perspective, start with a baremetal transaction-
isolation-less database like Cassandra/ScyllaDB. Don't rely on googling
"postgres vs mongodb" or "sql vs nosql", those articles will all end in
"YMMV". What you really need is massive IOPS, and a multi-node ring-based
index with ScyllaDB will achieve that easily. Or just use MongoDB on one
machine if you're not in hurry.

\- Don't be too kind on the big websites. They can afford to keep all their
data in hot pages, and as a one man you will never exhaust them.

~~~
davidhyde
You forgot the part about exposing your finished database to unprotected
elasticsearch http endpoint ;)

In all seriousness does anyone know why you can even host an elasticsearch
database as http and without credentials? Seems to be the default. What is the
use case for this?

~~~
kaivi
Tbh I'm still selling that data.

For a while I've had reoccurring nightmares that my DB had been stolen and
published together with an article on how stupid and incompetent I am.

~~~
prawnsalad
If I've understood you right, you break the TOS on other websites to collect
users personal info, and then you have nightmares about people taking that
data from you? Doesn't that raise ethical concerns in your eyes?

------
anilshanbhag
People data labs's data is pretty accurate. Here is mine:
[https://api.peopledatalabs.com/v4/person?api_key=9c6a1382204...](https://api.peopledatalabs.com/v4/person?api_key=9c6a13822048c34091683bc4ed9a6528d41e2529532cb6c20cc73287f731ef17&email=anilashanbhag@gmail.com&pretty=true)

You can try it for yourself by changing the email. All of the information is
public, so I don't mind. They are basically doing data integration.

~~~
BoorishBears
Haha, when I was a kid and scared to use my real name for things, for some
reason I used my email... which had my real name in it, to open a Github
account with a fake name

So the api knows me as the famous architect, Art Vandelay

~~~
EGreg
There is a way to get every developer’s email on github thanks to git commits
adding it :))

~~~
blotter_paper
In your github account you can add a new email address that doesn't even exist
or have a valid TLD, like "name@mail.fake". Don't use it as your primary email
and it won't require confirmation. You can now set your git user.email to this
fake address and any commits you make will be attributed to your account
without exposing your actual email address.

~~~
brobinson
You can use yourgithubusername@users.noreply.github.com instead of adding a
fake email, and your commits will still show up on your contribution graph and
be linked to your username.

------
mjparrott
It should be illegal for any company to store my private information like
this. The 'anonymous' sharing of my information is easily de-anonymized. Sites
asking for your phone number for "security purposes" are a joke.

You just have to accept that absolutely everything you've done online is
public information. If it isn't now, it is being stored and future tools /
databases will make what is either difficult to access or difficult to
interpret very easy to use in the future.

~~~
breischl
Using phone number as an example of private information is pretty hilarious.
Remember when the phone company used to literally print your name and phone
number in a book and send it to everyone in your town? Man, their security was
_terrible_!

But it works perfectly fine as a two-factor auth mechanism to prove that
whoever setup the account is the same person trying to log into it at some
later time.

~~~
perl4ever
Birthday is commonly used to verify people despite the practice of
broadcasting it to people on Facebook.

------
rohan1024
Firefox monitor can tell you if your information was leaked in data breaches.
I don't think they have this data set though.

[https://monitor.firefox.com/](https://monitor.firefox.com/)

~~~
mcbutterbunz
Does this cover more leaks than haveibeenpwned.com?

~~~
pmh
Maybe in the future it will, but it uses Have I Been Pwned. From the FAQ[0]:

How does Firefox Monitor know I was involved in these breaches?

Firefox Monitor gets its data breach information from a publicly searchable
source, Have I Been Pwned. If you don’t want your email address to show up in
this database, visit the opt-out page.

[0] [https://support.mozilla.org/en-US/kb/firefox-monitor-
faq#w_h...](https://support.mozilla.org/en-US/kb/firefox-monitor-faq#w_how-
does-firefox-monitor-know-i-was-involved-in-these-breaches)

------
arbuge
> 400 million+ phone numbers. 200 million+ US-based valid cell phone numbers.

Sounds like a nightmare in the making for those cell phone users and their
carriers when those begin to get SIM jacked.

~~~
pc86
Is that all you need to SIM jack a phone? The phone number?

~~~
peterwwillis
Yes and no. You need a phone number, but you still need to carry out a
variation of an attack that replaces the SIM associated with that phone
number. Sometimes this is carrier-specific. Sometimes it's trivial, sometimes
it requires a menial amount of work, and in extreme cases you might have to
access an actual network. Most of the time there is nothing stopping the
attack if they have your personal information.

------
kitotik
Yet another Elasticsearch server wide open. This is going to make the flurry
of open mongodb servers look trivial.

~~~
d33
I wouldn't be surprised if the starting point for this vulnerability wasn't
ES, but Docker. Docker by default modifies iptables and if you hack together a
system that uses both software running directly on the host and in containers,
it's going to expose the forwarded containers to the Internet - which you
might not be expecting, since a bind to localhost would be enough to expose a
service. It's always a good idea to have a separate firewall running outside
of the your system - this is the one Docker can't fool.

~~~
arpa
No. It's not dockers' fault you did not read the manual and expose the ports
wrong: you can bind the port to specific ips for export and tjat address
should be 127.0.0.1

~~~
d33
I see where you're coming from, but I disagree. I believe that good software
and abstractions should take little training to use - everything unintuitive
is a design failure and should be fixed. "Reasonably secure" should be the
implicit default, not something you need to explicitly added. E.g., it's
better to force authentication and force the administrator to add an account
than let everyone in by default. Or it's better to bind to 127.0.0.1 than to
0.0.0.0 by default, like most web servers built into frameworks I saw do.

Unfortunately, instead of good intuition, Docker is built on caveats, be it
networking, storage, caching, image sharing, container/image distinction,
authentication, deployment or building a cluster. Every subsystem I
experimented with "works", but fails in weird ways in some situations. In my
opinion, that means that Docker is a good idea, but has terrible
UX/functionality/error handling. I kind of think the same way of Git.

------
narrator
This is why I lie about my birthdate by a couple of days on anything where
it's not something like a medical record or where I am required to tell the
truth for whatever reason. I also never provide my social security number
unless it is required by law.

~~~
karlding
One of my coworkers generates a fake middle name for every service they sign
up with. According to him, this serves as a unique identifier allowing them to
determine when a service is selling their data to a third party (or data is
being leaked).

~~~
input_sh
Fastmail has subdomain addressing, so if your email is jondoe@example.com, you
can use hn@jondoe.example.com to sign up for HN.

That way you'll know for sure who leaks your data, and nobody's going to strip
it away like some services would strip away plus addressing (as in,
johndoe+hn@example.com).

~~~
archi42
I have excellent results with a subdomain. Even though PDL probably has a lot
of data on me, they have (not yet?) been able to glue it to my primary mail
address. That one only has my name, gender, github, country and name of my
employer. They can't seem to map the remainder to anything else.

------
zmmmmm
From what I could see the data returned on me was all derived from publicly
available sources (eg: my "public" LinkedIn page, my public github page etc).
Perhaps others have more but this looks more like an aggregator of public
information than a breach of non-public information.

Having said that, I find these companies unspeakably evil - their intent is to
make money by harming people (eroding their privacy by making otherwise
private personal information easier to get, obviously a gold mine for identity
thieves etc).

------
imglorp
In retrospect, it would have been interesting to have a bunch of accounts each
containing a unique "map trap", at all of the larger services. Then years
later, when the aggregator/broker guys get hacked/sold/leaked, you'd have some
picture of the genealogy involved.

~~~
3fe9a03ccd14ca5
The problem is that you often can’t find access to the actual “password” used
in the breach. Does anyone know where I can see if it was an actual password
or just some made up thing?

~~~
cannonedhamster
There was no password on the original ES instance it was open to the web.

~~~
3fe9a03ccd14ca5
I meant my password.

~~~
opless
There’s a torrent going around

~~~
spitfire
Do you know where I can find a torrent for this leak?

~~~
opless
I don't think it's wise to add a magnet link here.

But as I recall looking for Breach Compilation may help finding the requisite
gist on GitHub.

------
manigandham
This is all scrapped public social media data. No credentials or govt
information. It's very easy to download or buy this data legally.

~~~
greyfox
are you sure? how did you come to that conclusion. thanks for the info though,
very glad to hear it.

~~~
manigandham
Yes, there are dozens of these data enrichment companies. They scrape public
sites and use browser extensions, SaaS tools, inbox addons, etc. They mix it
together into profiles, and pretty much have the same dataset by now.

Clearbit is one of them and even a YC company.

~~~
istorical
Yep! And as someone who has worked with these data sets and worked on the
scraping tools on services like LinkedIn, a lot of the data is outdated,
incorrect, or mixing together different entities with the same name into one
person or splitting the same person into separate entities incorrectly.

------
badrequest
Genuinely hope somebody goes to prison for this, but not gonna hold my breath.

~~~
roywiggins
This data is accessible at small scales just by registering for a free api key
at People Data Labs and making a GET request, and if you want more robust
access you could just pay PDL for it.

~~~
badrequest
Sorry, I should have been clearer, I'm talking about whoever is responsible
for leaving it completely open to the public internet.

~~~
cortesoft
I mean it is INTENTIONALLY exposed to the public... the only mistake is they
are giving it away instead of charging for it. If you don't like it when they
give out all the information for free, it doesn't make it better if they
charge money.

------
octocop
Where can i download the data?

------
Antoninus
Linkedin the last social media membership I have. I’ve been mulling over
whether to delete my account because I’m not sure how it will look to
prospective employers.

~~~
pcmaffey
Hope this helps: [https://www.pcmaffey.com/finally-i-closed-my-
linkedin](https://www.pcmaffey.com/finally-i-closed-my-linkedin)

~~~
Antoninus
Thank you for writing this. Much like the fear you expressed, I'm going to
delete my account as soon as I lock in my next job.

------
rm_-rf_slash
Good thing I just updated my LinkedIn profile. Wouldn’t want hackers to think
I have gaps in my resume.

------
tempsy
I've gotten some strange spam phone calls this last week, including like 3
from Egypt. Wonder if this is why.

~~~
giarc
Probably unrelated. These security researchers found this open database, it
doesn't necessarily mean someone else found it.

------
hnick
I guess it's time to start leaking billions of records of junk data to pollute
the waters.

------
codeulike
There's estimated to be 4.4 billion internet users in 2019, so this is over
25% of people on the internet.

------
hutzlibu
Maybe I am missing something here, but I do not really see the scandal here
with the "leak" and I rather think the term is missleading in this context.

What happened?

As far as I understand, there are companies who search the web for public data
of people like me, without my consent.

Then they sell that data. Also without my consent.

So that data was avaiable anyway, allmost for free. If this data would contain
sensitive information, then I see this buisness practice as a scandal.

But the mere fact that all this data which was gathered without consent is now
avaiable for free because of possible db missconfiguration .. is not a scandal
to me.

And a leak is usually when a company loses sensitive data of its customers,
who expected that data to remain confident, like emails. Not what happened
here. Feels more like PR.

------
ars
I don't know about other people, but I have zero personal info with LinkedIn
and Facebook.

They only info they have about me is info I don't mind being public. If I want
something to be private I don't tell it to them. It's as simple as that.

Google on the other hand, knows lots of private things.

~~~
nontoxyc
Facebook has a lot of personal information about you even if you have never
had a Facebook account. For example: your GPS location data, approximate age,
gender, ethnicity....

Welcome to the future komrade. Sadly, it's not a matter of just "not giving
them" your location data. Your devices supply it.

~~~
mixmastamyk
And your friends too. I dutifully kept a new number out of FB until a friend
messaged me with, is this your number right? Xxx-xxx-xxx. They can also tag
you and auto tag you through face recognition.

------
aww_dang
Cyber alarmists would call a telephone directory: 'A verified threat
incident'. Yet these are the same companies selling OSINT data. These alarmist
groups need to put down the buzzwords, step off from their white horse and
take a look at the hypocrite in the mirror.

If you use social networks, you don't have a reasonable expectation of
privacy. You've published your data publicly. If you want to keep this
information private, then don't publish it on the Internet.

From: [http://www.dmlp.org/legal-guide/publication-private-
facts](http://www.dmlp.org/legal-guide/publication-private-facts)

>2\. Private Fact: The fact or facts disclosed must be private, and not
generally known.

------
retSava
> In order to test whether or not the data belonged to PDL, we created a free
> account on their website which provides users with 1,000 free people lookups
> per month.

Well that's very generous of them. Now I know what I'm gonna do next.

------
johnchristopher
Is there a way for an individual to check if he's in the dataset ? I am
curious about what kind of data they'd have aggregated around me.

------
83-qw-13-f4-as
When this type of leak happens, where does this data actually appear? On the
dark web? Who has access to this and how does one get it?

------
kerng
Seems like the ball is with Google at the moment, the exposed data is on their
GCP servers. So, they can figure out next steps.

~~~
Scoundreller
Imagine the equivalent in another industry:

“Hello, Bank of America? There’s an ATM machine of yours that’s spitting out
cocaine.

Yes, I understand that it’s probably not your cocaine and that’s not your
business, but don’t you think you should maybe shut it down?”

~~~
lucb1e
But would you call VendingMachinesCo because there is a vending machine
outside the local supermarket, operated by said supermarket, that spits out
cocaine? Pretty sure that whatever you put in there is the machine owner's
responsibility, not the manufacturer. GCP does not put content in their VPSes
themselves the way that a bank operates an ATM.

I think it's more like the responsibility of an ISP to poke their noses in
what they transfer, since it might be illegal content (similar to whether
Google should poke their noses into people's VPSes). I'm not sure if we should
want to require them to do that.

------
outworlder
Why are there people running _anything_ publicly accessible?

If you are running on the cloud, there is no need for any VMs to have any
public IPs at all. Exception for your Bastion host, and even that should be
restricted to known networks.

All incoming traffic needs a layer of indirection. On cloud providers that's
usually their load balancers.

------
DavideNL
I wonder whether FB/Linkedin can manipulate the timing of negative news like
this, for strategic reasons...

~~~
apetresc
Facebook/LinkedIn are not implicated in the breach at all; it was some random
third-party data enrichment service. The Facebook/LinkedIn in the title refers
to the fact that people's FB/LI accounts were one of the fields in the
database. So were their Github, and basically any other public-facing account
that these scrapers can gather.

------
_-___________-_
ES is the new Mongo. If you make software this easy to use, then people with
little or no experience are going to use it. Just have secure defaults, like
authentication, how many times do we have to learn this lesson...

------
neiman
Isn't it creepy that People Data Labs, "a data aggregator and enrichment
company", collected data on 1.2 billion people?

Isn't it exactly what GDPR came to prevent? Are there no Europeans among this
group?

------
lonelappde
Welp, time to change all my passwords, maiden names, and friendships.

------
OkGoDoIt
The IP address in question does not seem to be working at this time. Clearly
whoever runs the server has shut off access. I wonder if someone managed to
save a data dump somewhere?

------
harikb
Unless we go after every customer who used the services of PDL, nothing is
going to change. We will see a $3 fine per individual after 1 or 2 years of
talking about this.

------
bogwog
When will people start going to prison for stuff like this?

~~~
sdinsn
For what? For scraping public data?

------
chiefalchemist
"including close to 260 million in the US."

So basically, _everyone_ in the USA minus those not online. And I bet this
will go unreported by the mainstream media.

------
eyeball
Is there any way to see what data they had on me?

------
naniwaduni
"1B" is a surprisingly bad abbreviation here, considering its resemblance to a
much ... less impressive number.

------
ngneer
Ugh. To whomever is currently wasting their time and effort on differential
privacy, take a good long look.

~~~
padraigoleary
Why? Interested in why you think differential privacy would make any
difference... The fault here seems to be an open es server.

~~~
ngneer
That is precisely my point. Differential privacy would NOT make any
difference, and I was pointing the many folks who are working on it to the
much simpler issues that are in fact being encountered in the field. This past
IEEE S&P had quite a few theoretical privacy talks.

------
CriticalCathed
Is this public facing information that's been crawled, collected, and
categorized?

------
aussieguy1234
Is it illegal to download/scrape data from a wide open database like this one?

------
rcarmo
I just had a look at “my” data on this and it is almost hilariously wrong.

~~~
sideshowb
Where can we look up our data?

------
roshanravan
To check if you're affected use haveibeenpwned.com

------
greyfox
does anyone know how we can search the data to find info about our (more than
likely) entries in this database? or did they simply find it but not release
the info?

~~~
bovermyer
Not the aggregate data set, but one of the two data sources (People Data Labs)
offers free access for under 1,000 searches per month.

------
otakucode
Does this mean we don't need to do a census any more?

------
thephyber
It would be a shame if someone corrupted these ES indexes.

------
musicale
>According to their website, the PDL application can be used to search: Over
1.5 Billion unique people, including close to 260 million in the US. Over 1
billion personal email addresses. Work email for 70%+ decision makers in the
US, UK, and Canada. Over 420 million Linkedin urls Over 1 billion facebook
urls and ids. 400 million+ phone numbers. 200 million+ US-based valid cell
phone numbers.

Too bad there aren't any laws regulating this sort of private data aggregation
and sale. Well, besides GDPR (which apparently isn't enforced) and CCPA (which
won't be enforced either.)

------
magnusss
Let me make sure I understand: If I take gigabytes of “enriched” personal
information and make it available to the public for free, then I’m an
irresponsible, idiotic, incompetent buffoon. But if I put a paywall in front
of it and sell that same data for a fair price, then I’m a business genius?

Seems to me that if the data is legally acquired and can be legally
distributed, doing so at a cost of zero does not constitute a data leak. It
may be bad business, but since when is that a crime?

------
roshanravan
that IP:9200 address is down, any mirrors?

------
Pywarrior
where cam i download the leaked data?

------
thelittleone
Data Enrichment Companies. Marketing speak for highly vulnerable privacy
eradication service.

Vote #1 for some sort of global GDPR where these businesses are no longer
profitable.

------
4ad
Can I do a GDPR request for the data about myself? How?

~~~
Havoc
>Can I do a GDPR request for the data about myself? How?

And send it where? It's unclear who owns this server

~~~
4ad
People Data Labs?

~~~
vlz
From the article it seems that you can just create a free account and query
your own name.

> In order to test whether or not the data belonged to PDL, we created a free
> account on their website which provides users with 1,000 free people lookups

------
WA
I wonder how high the GDPR fine will be.

------
avocado4
Is Elastic going to be punished under GDPR especially given that it's a Dutch
company?

~~~
arpa
That is a terrifying thought with terrible chilling effect should somebody
official would even voice this thought in any way.

~~~
avocado4
Was this an AI-generated sentence?

------
jka
There's a video at
[https://www.youtube.com/watch?v=VNLEEogFo18](https://www.youtube.com/watch?v=VNLEEogFo18)
where People Data Labs' chief executive speaks at an insurance conference this
year about their business.

They describe the data as being sourced from a 'data co-op' of over 1k
companies which share data. It wasn't clear whether that means that those
companies are collaborating and pooling data, or whether it's a
roundabout/wordy way of saying that they scrape public personal information
from thousands of sites.

They also claim that they're GDPR and CCPA compliant; I'm no expert but I do
find one or two references that seem to suggest that scraping EU citizens'
personal data without consent hasn't been GDPR-compliant for some time.

It does also raise another question: even if PDL themselves aren't GDPR-
compliant, would any resulting fines against them reclaim a significant
portion of the utility captured from the distribution of that data? As per
comments on this thread, PDL API keys seem to be free to create.

Hypothetically speaking it could be within the interests of a group of
businesses to provide a small amount of funding towards operation(s) that
harvest and redistribute personal data: if the revenue base is low, the
operation(s) can eventually fail (once legal proceedings catch up with them)
and the group as a whole incurs little cost.

The speaker also takes a question from the audience regarding potential use-
cases for this kind of personal data, and answers that knowing about an
individual's life events (such as marriage) can be an opportunity to sell
products to them, as can differentiating pricing if they'd just started
smoking cigarettes.

Although I'm no expert, my understanding of insurance has been that risk is
spread across a large pool of customers, allowing them each to pay similar
premiums despite potentially slightly different backgrounds, with the
understanding that they mutually benefit by paying into a shared fund so that
the (random, potentially high-cost) risk of loss to each member is greatly
softened.

We're seeing a situation here where more precise, per-individual data is being
collected across large populations and could potentially be used for price
differentiation.

If the insurance industry doesn't defend itself, this could lead to premiums
which are essentially calculations based on 'pre-existing data' \--
information which the consumer may not have consented to sharing, and which an
insurance company might not be able to collect from application forms.

We don't seem to be particularly good, collectively, at escaping from cycles
which seem to introduce or further wealth disparity at the moment and I worry
that this kind of tech-driven attempt to optimize revenue efficiency of the
insurance industry would only lead to further inequality.

