Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."
Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?
In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).
It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.
ES is a database that has to exist on a network to be usable. Heck, it expects that you have multiple nodes, and will complain if you don't. So one of the first things you do is expose it to the network so you can use it.
Yes, it takes some serious incompetence to not realize you need to secure your network, but why in the world would you not add basic authentication into ES from the start? I'd never design a tool like a database without including authentication.
I am serious about my question. Could anyone clue me in?
If you are running mysql or postgres on a public ip address it would be equally stupid and irresponsible regardless of the useless default password that many people never change unless you also set up TLS properly (which would require knowing what you are doing with e.g. certificates). The security in those products is simply not designed for being exposed on a public ip address over a non TLS connection. Pretending otherwise would be a mistake. Having basic authentication in Elasticsearch would be the pointless equivalent. Base64 (i.e. basic authentication over http) encoded plaintext passwords is not a form of security worth bothering with. Which is why they never did this. It would be a false sense of security.
At some point you just have to call out people for being utter morons. The blame is on them, 100%. The only deficiency here is with their poor decision making. Going "meh http, public IP, no password, what could possibly go wrong?! lets just upload the entirety of linkedin to that." That level of incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin is considering legal action against individuals and companies involved. IMHO they'd be well within their rights to sue these people into bankruptcy.
mySQL in comparison wont even let you install without setting a root password. And it only listen on localhost/unix-socket by default. Then you need to explicitly add another user if you want to allow it to login from a non local ip. I don't think it's even possible - to both set a blank root password and allow it to login from a public IP.
So you really think the solution is to blame some low level worker, and sue him/her? The blame should always be on the people in charge, usually the CEO, who set the bar for engineering practices, proper training, etc, or the lack of.
Typically, people start out using knives and bicycles as children, learn through experience that crashing and getting cut hurt, and carry those lessons forward when they start using tablesaws and cars later in life. How does this apply to elasticsearch? I have no idea.
I'm not aware of "bind to localhost" being the default, either. The skip-networking setting to only allow local socket connections is definitely not the default, and I'm pretty sure the default is still to bind to all interfaces.
Software should be built in the best method of delivering maximum value to its users. A trade-off for usability can be made for certain cases like ease-of-use for new software. Redis was part of this a while ago http://antirez.com/news/96.
Engineers should know their tools before using them. It's a huge part of our jobs. You could introduce a ton of other vulnerabilities in software: XSS, SQL injections, insecure cryptography. Security is part of our job and matters we must know.
You don't blame a plane for a pilot mistake that was meant to be part of his training. Engineers in every other sector are responsible for their mistakes, we should be too.
Also, you don't sue the worker, you sue the company.
Yes, and defaulting to insecure, thus repeatedly causing huge data breaches, is the exact opposite of delivering maximum value to users. It's delivering maximum liability.
Did you miss that Boeing is right now risking bankruptcy for doing exactly this?
There’s so much emphasis on abstracting away the systems with cloud-this and elastic-that and developers don’t know much about general systems engineering.
My recommendation to software developers: take the Network+ and Security+ exams at the bare minimum.
Honestly as much as people complain about process getting in the way of things, there should be checks and balances at any business that deals with personal information. Finance institutions are heavily regulated—these fkers should be held accountable.
Maybe the hint is right there in your comment. Nearly all the people deploying these nodes aren't engineers in the slightest despite having someone given them such a title.
Sometimes software managers have the sudden need to show statistics and other things.
Yeah, that was fun...
It would keep ES from being completely open. If you wanted to get in, you'd have to comprise some part of the network that would let you read the username and password.
The way it is now, anyone can do a scan for port 9200 and get full access right away.
It is also important to have a username and password, even on secured networks. My test instance is on an internal network, and protected by both network and host firewalls, but I still make sure to secure it beyond that.
Basic auth would not provide a false sense of security. It is simply a very basic part of overall security. Not having it is a mistake.
Your attitude is a symptom of a broader issue that plagues this industry: Indifference to risk*probability. If you don't ship software with "secure defaults" (depending on the threat/attack model), you essentially are handing out loaded shotguns, then blaming the "dumb" user when they inevitably point it at their foot and click the trigger. Easy solution: Don't hand out the gun loaded -- make the user do specific actions that enable the usage. Yeah, it creates some friction to first time deployment, but that's a secondary concern to having your freaking DB leaking all over the place.
If firing up a piece of software creates an unauthenticated, unprotected (non-TLS) endpoint to read-write data, that's a loaded gun. That is PRECISELY the default behavior of ES.
ES has jacked around for years by making TLS and other standard security features premium. To that, I say this: Screw ES and their bullshit business model. Their business model is a leading cause to dumbasses dumping extremely sensitive PII data into a DB that is unprotected - those same folks aren't going to go the extra mile to secure the DB, either by licensing or 3rd party bolt-ons.
Thus, why it must be shipped secure by default. Anything less is a professional felony, in my eyes. Also, screw ES again, in-case I wasn't clear.
The fact that someone else down the chain should have known better is not a perfect defense. If that misuse was foreseeable and you didn’t do enough to prevent or discourage it, then you can still be held liable.
Even with ES deployed in an environment with proper network firewall rules...etc, I'd still want some sort of authentication/RBAC
A single layer of cloth might not hold water, adding more layers of cloth may hold water for longer, but it's probably more cost effective to start with the right material.
That’s absolutely correct! But you seem to be missing the fact that _all_ layers of security are always imperfect.
history is our greatest teacher. i think ES will end up doing what that team did: they agreed to provide sensible & secure defaults.
1. Only listen to localhost and unix sockets
2. Not generate any default passwords
Have you ever heard of the end-to-end principle, IPv6, or number 4 of the eight fallacies? http://nighthacks.com/jag/res/Fallacies.html
I've met at least one cloud provider in the past (small Dutch thing) that provides _only_ public IP addresses. They do have customers, though one less now. Clustering over the public Internet is a thing. It shouldn't, but I could say the same thing about this website and yet here we are.
Any network may become public by accident unless you go to great lengths to make sure it doesn't. Configurations change and mistakes are made even by seasoned people. People bring devices. Unless there's an air gap, people's devices may be hacked and let stuff through. Put authentication and anti-CSRF on _all_ your stuff, always.
It is, sort of, https://www.elastic.co/guide/en/elasticsearch/reference/curr...
But it's not a feature you'd be using without a really good reason IMO.
Instead of that they could implement a PAKE. That would provide security with no certificates.
You'd think that at some point we'd understand that there's way more morons out there than sensible people.
because auth was a part of their paid service (and by paid i mean 'very goddamned expensive') until like half a year ago when they made it free because of freshly emerged amazons opendistro free auth plugin
What a horrific title. Even simply typing that should have been a blinking neon sign to them that they had their priorities in the wrong order.
The usual way of using this service is to have backend network configured that connects your services that is not available from outside (ie you have to traverse through services to reach it).
The so called "security" is just a paid feature for companies that want to use ElasticSearch but want to use it in "legacy" way because, presumably, they don't have people to design it correctly.
That means that if someone manages to get access to the. I'd say public internet with proper (encrypted) password auth is more secure than that.
The pods are akin to localhost networking where there is only one externally available application with multiple networked components.
Having a password adds a small layer of protection to databases that the affected app wasn't meant to connect to.
It adds some protection in that case, but the user should use best judgement if it's worth doing.
MongoDB also by default does not have username+password authentication turned on.
I think defaulting to username+password is a relic of the pre-cloud era, and nowadays is not optimal.
Option one, they generate an unique password for every installation – non trivial to do, because at which point do you do it? It can't be before a cluster is formed, as you'll have a split brain generating a bunch of credentials. If you do it afterwards, then there is a period of time when you cluster is not yet protected. Worse yet, unprotected and handshaking authentication. So you don't do that.
You could make the user input the credentials. What is to prevent them from creating weak credentials? And worse, they have to do that for every node (or at least the masters). Not a good experience and lost credentials will probably be the subject of a good many support calls.
So most products don't do that. What they do is default passwords. Which is arguably no security at all and doesn't protect anything. It may make it just a tiny bit easier to do the right thing afterwards (by changing to better credentials). Still, there's a period of time while the cluster is unprotected (default credentials are as good as no credentials).
Authentication does little to protect against the sort of people who are exposing databases to the public. If it is easily disabled, then they will be doing just that. Because they are already doing that by forcing databases to bind to publicly accessible interfaces.
Yes, lost credentials will be subject of many support calls. Then, it boils down to your priorities. If you care about minimizing support calls, then sure, leave everything open to everyone. It will surely result in fewer access problems.
On the other hand, if your motivation is actually preventing your end-users from doing stupid things, it makes sense to just do the most conservative thing as default. Let the user change to the more liberal option, but not before informing them of all dangers that might befall them in that case.
I refuse to believe in this narrative of the end-user just being a stupid automaton who does not have any agency, and that any default imposed upon them will just result in them overriding the default with their terrible practices and ideas. I think there is a possibility of education and risk reduction.
So username+password really is needed. And should be included by default.
Also, I'd expect the same of something like MongoDB. That it doesn't have that by default is just baffling.
They aggregated the data and published it so that the viral breach would spread their name around because all publicity is good publicity.
Just riffing of course.
This makes me want to talk to a lawyer.
Bind to all interfaces used to be the default in 1.x - it changed pretty much because people were footgunning themselves.
Coupled with lack of security in the base/free distribution, that made for a dangerous pitfall. At least now security is finally part of the free offering, but the OSS version still comes with no access control at all.
It doesn't matter then if you bind it to 0.0.0.0.
To add on that: No security also means no TLS, neither in the cluster communication, no TLS speaking to the client etc.
Documentation is not security. If you need to "RTFM" to not be in an ownable state it's ES's fault.
The only thing you can do to secure your software is to restrict its communication channels. Once you've secured the communication channels, the software auth is decorative at best.
I can't believe anyone shipping a datastore could let it happen after that. Doesn't postgresql still limit the default listen_address to local connections only? Seems like the best approach. On a distribute store consistency operations between nodes should go on a different channel than queries and should be allowed on a node by node basis at worst. At least at that point, it requires someone who should know better to make it open to the world. Even just listening for local connections passwordless auth should never be a default.
My understanding is neither company is owning this data set and there is an assumption that it is a third company that has either legally or illegally obtained the data and is using it for their own services.
Another option is that the data was exfiltrated by a loose group of people who wanted this to be freely available on a random ip. Know the ip, get sick access to a trove of PII. No logins, no accounts, no trace.
Welcome to the early 90s internet.
I would bet that in a lot of cases, people that configure their servers like in the OP just don’t read the official docs at all.
Stack Overflow, Quora, etc. are great places to get answers, because of the huge amount of questions that have already been asked and answered there.
But when people rely solely on SO, Quora, blog posts and other secondary, tertiary, ..., nth-ary sources of information, Bad Stuff will result, because of all the information that is left unsaid on Q&A sites and in blog posts. (Which is fine on its own – the problem is when the reader is ignorant about the unsaid knowledge.)
> and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.
Again, not necessarily, for the same reason as above.
But even if they did, it is a sad fact that a lot of people dismiss concerns over security with the kinds of “counter-arguments” that I am sure we are all too familiar with. :(
Thankfully though, we are beginning to see a shift in legislation being oriented towards protecting the privacy of people whose data is stored by companies.
Ideally, the fines should cause businesses to go bankrupt if they severely mishandle data about people. Realistically that is not what happens. For the most part they will get but a slap on the wrist. But it’s a start.
Companies that can’t handle data securely, have no business handling data at all.
I understand that the "ephemeral" nature of EC2 was in the documentation, but ESL speakers may have glossed over the significance of a word they didn't fully comprehend.
This is just another symptom of the Principal-agent problem writ large.
It seems like PDL's core business model is irresponsible regarding their stewardship of the data they've harvested.
Disclaimer: I'm one of the creators of yourdigitalrights.org.
Outlawing the collection of data would be hard and is unlikely to work, but the fact that companies like AT&T are allowed to sell your data, as they did with OP's (where else would that unused phone number come from), is an angle new legislation can use.
The EU now already has a piece of legislation aimed at stifling these practices. The US and other economies just need to follow suit.
The private agents were armed with the latest available discounts (which you could find for yourself if you tried). But their skills made them particularly more successful than a typical front-line sales employee.
The catch? It wasn’t a scam, and they really were trying to get their targets to switch. It seems that AT&T was more willing to sell consumer data than the general public is aware of. Converting their targets to AT&T granted their agency access to additional data which they then to passed onto their clients. And the target gets a discount, too. Win-Win-Win? :)
They didn't take my report seriously (still not completely patched) and I feel like that told me all I needed to know about their security practices.
I randomly check every 6 months or so and yep, still not fixed.
She sent it religiously, every 90 days.
How the hell could she think that your email address was hers? I mean, wouldn't she notice that she never got the messages?
I can imagine someone mistyping an address, and then reusing the "to" link.
One day after I had received a person's bank, mobile statement and many other bills for few months I decided to call him (his number was easily visible in many emails) and inform him of his mistake. He turned out to be lawyer and he said he will "decide" what to do about it. And the next thing I know is he sent a carefully drafted email (as a legal notice) that I should hand over my email address to him without further delay and all that.
I didn't do that. I talked to a lawyer friend and he just told me to reply with a "G F Y" card. I didn't do that either. But that pushed me to finally move my emails to my personal domain as it was/is a Gmail account and if someone complained Google would have just terminated my account and I don't know anyone who works at Google.
Downside of getting in early on popular email services.
What amazes me is when I get misaddressed email, and I reply to say its misaddressed (and I'm not talking about automated services, I'm talking about obviously manually sent stuff), and my reply just gets ignored and the misaddressed email just keeps on coming.
Lady, whoever you think is going to be at that funeral isn't getting that message.
I've no idea if they'll get disconnected now as I've blocked their number. Hope so maybe they'll notice then.
We had full access. I could have signed this person up for the most expensive package, or even canceled their service.
The first thing I said at the counter was "I know it's really hard to cancel Comcast, and I'm not going to accept anything but a cancel."
The girl at the counter smiled and said "We know ..." and immediately cancelled my account.
Maybe that's how we drive their customer count and revenue down and put them out of business.
IIRC I logged out again and back in, same thing, my credentials worked. Went back to it a few days later and the password no longer worked
How have they not resolved this?
About.me's business model was quite unsettling to me and they have made little to no effort to protect the user data from scrapers.
I tried it on a friend and it worked, but LinkedIn's response was basically "meh".
My life has only gotten better since I deleted LinkedIn a few years ago. I know I'm in a privileged position to be able to do that, but I strongly recommend everyone here consider whether what they gain from their account is worth the crap and spam they have to put up with.
The article says some LinkedIn data was scraped, but I don't see anywhere that it specifically says a LinkedIn security flaw was used in the scraping. Although it is vague about what data was scraped and how, so it doesn't preclude that either.
In other words, are you saying a LinkedIn vulnerability was exploited here, or suggesting that it probably was, or are you just mentioning LinkedIn because it's tangentially related?
That is because a couple days ago, I got a text message from tmobile (which seemed genuine) basically saying that my account was one of a larger subset of prepaid phone accounts which had been compromised and that my personal information had been potentially taken by "hackers".
To which I got a good chuckle, because tmobile is one of the few phone companies that will let you create completely anonymous prepaid accounts using cash and without filling out any information. AKA you buy a sim card for $$$ and that is it. So, basically the only information they lost of mine as far as I can tell, is the phone number and type of phone I'm using (which they gather from their network). If they got the "meta" data about usage/location/etc that would have been different but it didn't sound like the hacker got that far.
Had this been a post-paid account they would have my name/address/SSN/etc.
I’m of the opinion it’s too late for prevention and we need, instead, mitigation.
So depending on how the "anonymous" phone number was used, it's plausible that the number can be connected with other PII.
In fact I wonder if there is any such thing as non-PII, given the existence of such companies.
If we assume that isn't happening in the very immediate future due to the latency of introducing new legislation...
Do we have any other options to protect ourselves?
I've personally worked myself in to a bad credit rating. I have a home loan and a credit card, but any new credit applications auto-reject. Not the ideal scenario though!
"Oxy" most likely stands for Oxylabs, a data mining service by Tesonet, which is a parent company of NordVPN.
It is probably safe to assume, that LinkedIn was scraped using a residential proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all around the globe with zero IP blocking".
Knowing how quickly it's expanding, do the employees are just as unethical or they do not connect the dots (company got too big)?
I hate fb, et al as any other person here, but most of people know that "if it's free - you are the product". Though with NordVPN users are paying money and are getting stabbed in the back.
Most people's ethics are easily bought. Does working for a company that operates with questionable integrity outweigh providing a stable income for your family?
Remember Facebook is still a very highly desirable company to work at.
could you please expand on this claim?
Is it still possible if you pay LinkedIn enough? Or is this old data?
Both are aggregators that get data from many sources, correlate them, and sell it. The phone numbers and emails could have come from anywhere.
See this screenshot from PeopleDataLabs: https://d1ennknj6q36vm.cloudfront.net/images/cblead.png
Is there at least a less shady provider if I would like to compromise myself but a bit less than nordvpn? How far do we go in assuming all are bad?
I've been wanting to do some social graph experimentation on it (small scale - say 1000 people near me) but concluded I probably couldn't scrape enough via raw scraping without freaking out their anti-scraping. (And API is a non-starter since that basically says everything is verboten).
Here are some tricks which may or may not work today:
- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.
- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.
- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.
> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).
Sure it will work, but I personally don't like Elasticsearch for anything high-intensity because of its HTTP REST API and the overhead it carries. Take a look at Cassandra's  "CQL binary protocol", it simple and always on point.
In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?
For a while I've had reoccurring nightmares that my DB had been stolen and published together with an article on how stupid and incompetent I am.
I'll cut straight to the chase and post it on hn. This intermediate step of waiting for someone to discover it takes too long
That's some extremely shady thing to do.
I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.
YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.
Pretending to be Googlebot also helps.
Clever. VMs with IPV6 are cheap as a bonus :)
Same for non-js mobile. Thanks for the tips
How would someone do that using node.js? Asking for a friend.
A more useful answer is: I did this once, many years ago. Back then it was a matter of hooking up PhantomJS and making sure your user string was set correctly. Since PhantomJS was – I think – essentially the same as what headless chrome is today, the server can't determine that you're running a headless browser.
Now, it's not so easy nowadays to do that. There are mechanisms to detect whether the client is in headless mode. But most websites don't implement advanced detection and countermeasures. And in the ideal case, you can't really detect that someone is doing automated scraping. Imagine a VM that's literally running chrome, and the script is set up to interact with the VM using nothing but mouse movements and keyboard presses. You could even throw in some AI to the mix: record some real mouse movements and keyboard presses over time, then hook up some AI to your script such that it generates movements and keyboard presses that are impossible to distinguish from real human inputs. Such a system would be almost impossible to differentiate vs your real users.
The other piece of the puzzle is user accounts. You often have to have "aged" user accounts. For example, if you tried to scrape LinkedIn using your own account, it wouldn't matter if you were using 500 IPs. They would probably notice.
It's hard to counter a determined scraper.
It's not very hard to get something that would be too hard for almost every website beside Google and Facebook to bother with. If it's a 1 on a 0-9 scale in difficulty, most websites just don't have the resources to detect it
It took me like ~3 hours to write it, but I guarantee it would take months for someone to detect it, and even then, they'd have a lot of false positives and negatives.
How many fines has GDPR resulted in?
Headless chrome cat and mouse game is a lot of fun. We need more players.
I had a business that was generating more money than my full-time job for a while. We helped and greatly simplified matters for several thousand independent proprietors while having a positive effect on the load of the data source, since we were able to batch/coalesce requests, make better use of caches, and take notification responsibilities on ourselves.
Once in a while someone would get worried and grumpy at the data source and there were a couple of cat-and-mouse games, but we easily outwitted their scraping detection each time. When they got tired of losing the technical game, they sent out the lawyers, which was far more effective. We were acquiring facts about dates and times from the place that issued/decided those dates and times, so there wasn't really any reliable alternative data source, and we had to shut down.
The glimmer of hope on the horizon is LinkedIn v. HiQ, which seems poised to potentially finally overturn 4 decades of anti-scraping case law, but not holding my breath too hard there.
> In a long-awaited decision in hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA). This is an important clarification of the CFAA’s scope, which should provide some relief to the wide variety of researchers, journalists, and companies who have had reason to fear cease and desist letters threatening liability simply for accessing publicly available information in a way that publishers object to. It’s a major win for research and innovation, which will hopefully pave the way for courts and Congress to further curb abuse of the CFAA.
In either case my personal data is given away without my consent, but there's this implication that it's only an issue when someone doesn't pay for it.
Normally the company sells this data, but now they've given it away. It's not good this data got out because the curation has some value to spammers or whoever. But using the word "leak" here undermines the severity of a real leak where passwords and social security numbers are exposed. Data that was never meant by anyone to be open.
Everyone likely has (technically) provided consent for every piece of information here being shared with partners. Buried in fine print that it wasn't really expected they'd read, of course. It's the cost of being online, and that sucks, but it seems only a leak of what had already been given out.
You gave that consent when you put your info in Linkedin in the first place, according to their ToS.
Mind you, I didn't research the topic of what can or cannot be requested with FOIA, so I might be totally wrong.
Needing to be logged in as the same user defeats the purpose of proxying to hide your physical origin.
Registering thousands of different users to use in a distributed way is hard now that they require a text message verification for new accounts.
They were ordered to unblock hiQ specifically, they were not ordered to open up content to scrapers generally.
They can still throttle high volume traffic and put up captchas. I think the only specific thing the court ordered was for them to unblock hiQ IP ranges.
This may or may not still work.
You can try it for yourself by changing the email. All of the information is public, so I don't mind. They are basically doing data integration.
So the api knows me as the famous architect, Art Vandelay
Strangely it only says I work in real estate (no I don't) when I looked up the email address I use for LinkedIn...
Indeed they do have a profile on me - a bare minimum, scaped from GitHub. That makes sense, since that's about the only social platform I use, aside from HN.
EDIT: My GMail address has the most amount of information gathered, which makes sense. It's gathered Facebook, LinkedIn, Pinterest, GitHub..
It lists my skills as: firefighting and emergency planning/management/services. I suppose, with a stretch of imagination..
More surprisingly it had data such as my name, title and work email address which was connected to old work email account (Okta managed - GSuite) that I never associated with external services, and absolutely never used on a social networking site like LinkedIn.
They can say this all they want, but if you have no presence in the EU, and your jurisdiction does not have any agreement to apply GDPR regulations to you, then this is at most a strongly worded request.
Barring explicit agreements to the contrary (treaties, extradition agreements, etc), by definition a country's laws are only enforceable there.
If PDL has no business in Europe, no plans to expand there, and there's no treaty or other agreement making the provisions enforceable against them, the EU can say whatever it wants but PDL has no legal obligation to do anything about it.
(Assuming anyone were bothered enough to actually do this, of course.)
On the other hand, if you never fall under European jurisdiction in the first place, you're free to ignore them, just as you can ignore Thai laws against insulting their king. One very important thing to note is that setting foot in European soil will expose you to their jurisdiction, so you've significantly limited your freedom of movement, but if GDPR compliance is a bigger deal than that then "just never go to Europe" can be a viable strategy.
Good luck to the EU on enforcing their law against an American company, though.