Hacker News new | past | comments | ask | show | jobs | submit login
Lawsuit filed against GitHub in wake of Capital One data breach (thehill.com)
101 points by danso 75 days ago | hide | past | web | favorite | 72 comments

As my lawyer friend said, a good class action is a lawyer’s startup. Litigate one good one and you’re set for life.

In a case like this, you have to be the first to file so that everyone else gets merged with yours.

It’s sort of sad that they were so quick on this but that’s why.

Is the allegation that Github was just supposed to know about this? I am sure it’s plenty busy trying to keep up with reports, less proactively seeking out questionable content. What’s next, should there be lawsuits against Pastebin sites?

When it comes to seeking damages, it's standard procedure to throw everything against the wall and see what sticks. "The wall" being everyone with deep pockets, such as Microsoft. I'm surprised Amazon hasn't been sued over this yet.

Expect GitHub to be dropped from the case pretty quickly, I presume. They have a Safe Harbor exemption from responsibility for content posted by users on their site. The complaint doesn't even allege that anyone ever warned GitHub about the presence of the data whatsoever. They just baselessly assert that GitHub "should have known" without providing any support. Section 230 of the Communications Decency Act provides the Safe Harbor protection to platforms as long as they do not do things which amount to them having editorial control (things like holding posts for moderation before posting can do this and lose their safe harbor, as happened to LiveJournal years ago, but taking posts down afterwards and responding to flagged content does not).

Capital One could have been proactive by running a script to scrape every github account to see if they had data that was compromised. I’m sure a lot of governments, hackers and graduate student researchers knowingly or unknowingly collected this data.

Are you suggesting they should have been doing this as a matter of course, or once they were notified of the breach and the data available on GitHub?

Wouldn't be a bad idea to throw a few dummy records into the database with unique names and other information, and constantly search for them.

Search for them where? Github? Pastebin? Numerous other known and unknown websites? In what format? Do ToS for each of the sites being scraped support this? Do you stick dummy records in every single data set?

That's why I'm asking if the GP was talking about proactively searching or searching once they were aware.

Well, my main point is I think it is somewhat absurd to blame GitHub for this. Sure, they can scan for SSN numbers, etc. but how many hackers are going to post stolen information to their Github account? A second matter though is there might be a business model around a service that scans for sensitive information or proprietary code on Github, Bitbucket, etc. My guess is though it is much easier to detect a breach upon data exfiltration versus scanning the web.

Or, if you use UUIDs: just do a web search for some real ones and scrape for real ones on places you are aware of that aren't indexed.

(Don't know how common it is for leaks to be scrubbed before uploading.)

Useless information like UUIDs is normally removed prior to a leak going public. Public dumps rarely resemble the form of the original data.

(IANAL) Seems to be that yeah, GitHub should have known. I'd recommend going to the complaint[1] and skipping the news article. So, the complaint states[1],

> This outside individual (“the hacker”) posted this Personal Information on GitHub.com, GitHub’s website, which encourages (at least friendly) hacking and which is publicly available. As a result of GitHub’s failure to monitor, remove, or otherwise recognize and act upon obviously-hacked data that was displayed, disclosed, and used on and by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months.

First, I do not like that the lawyers are punning between hacker "a person who uses computers to gain unauthorized access to data" ("the hacker") and hacker "an enthusiastic and skillful computer programmer or user" ("GitHub's website, which encourages […] hacking").

They allege it was the actual data posted:

> Not surprisingly, therefore, the hacker, a software developer, posted the breached data on GitHub.com

This part is interesting:

> According to the timestamp on the file containing certain Capital One customers’ breached data, the hacker posted the data on GitHub.com on or about April 21, 2019.

Because AFAIK, GitHub does not display upload times on its website, so I'm curious how the plaintiffs came to this conclusion. IIRC, the times that GitHub displays, e.g., in its listing of files are the times in the git data, which do not necessarily reflect the time the data was first uploaded to GitHub. (E.g., compare that a commit has a commit timestamp, but you can commit, wait 2 days, and then upload the commit to GitHub.)

> Nevertheless, Capital One did not even begin to investigate the data breach until or around July 17, 2019, when it received an email apparently from a GitHub.com user alerting Capital One that there “appear[ed] to be some leaked” customer data publicly available on GitHub.com.

> GitHub, meanwhile, never alerted any victims that their highly sensitive Personal Information—including Social Security numbers—was displayed on its site, GitHub.com. Nor did GitHub timely remove the obviously hacked data. Instead, the hacked data was available on GitHub.com for three months.

> 22. GitHub apparently did not even suspend the hacker’s GitHub account or access to the site, even though it knew or should have known that the hacker had breached GitHub’s own Terms of Service, which state that: “GitHub has the right to suspend or terminate [a user’s] access to all or any part of the [GitHub.com] Website at any time, with or without cause, with or without notice, effective immediately.”

It seems likely that GitHub wasn't aware. Nowhere in the complaint do I see where GitHub is made aware of this issue prior to it being public.

This part is interesting:

> 28. GitHub had an obligation, under California law, to keep off (or to remove from) its site Social Security numbers and other Personal Information.

> 29. Further, pursuant to established industry standards, GitHub had an obligation to keep off (or to remove from) its site Social Security numbers and other Personal Information.

I don't know if "established industry standards" holds up in court, but the 28 there is interesting. Lawyers writing this complaint, y u no cite what part of CA law? CA's law is actually really easy to browse/lookup if you know what code and what section you're looking for.

[1]: https://www.dropbox.com/s/cjdflk7rh4z8ery/TZ_GitHub_CapitalO...

> First, I do not like that the lawyers are punning between hacker

The term here is equivocating, not punning. It's an informal logical fallacy [1]. People use it a lot when they don't have a good argument.

[1] https://en.wikipedia.org/wiki/Equivocation

If the 'established industry standards' were real, they would hold up in court. Software has no established industry standards. This has been an issue in many court cases. There are many "standards" but none of them are official which results in any claims of negligence when it comes to software, no matter how egregious the behavior was, failing. We saw this with the Toyota 'unintended acceleration' scandal. The court acknowledged that out of 90+ automotive industry 'recommended' and 'suggested' coding practices, Toyota's code only followed 4. They acknowledged that Toyota let software engineers play no role in deciding scheduling. They acknowledged that Toyota software engineers did not have static analysis and other tools, or even a bug tracker. But, the court had to find them not guilty of criminal negligence because there simply aren't any legal standards or regulations which they could be said to be negligent of.

If you were building a bridge and you hired unqualified engineers, deprived them of tools they needed, ignored them when it came time to determine scheduling, didn't follow established regulations and standards, etc, the companies executives would be prosecuted for criminal negligence and be sent to prison. If software is involved, however, the situation couldn't be more different. It's an issue that has been debated for well over a decade in the ACM at least. Companies don't want to have to pay more for talent, and most software engineers don't want to raise the barrier to entry. The real danger is that if the software industry waits too long to establish some way of handling these issues, some public tragedy will inspire a kneejerk government response that results in a suffocating set of standards that makes everyone unhappy.

How is a Toyota car not "automotive engineering", regardless of whether the flaw was hardware or software?

I assume 28 is a reference to CCPA, which doesn't take effect until Jan 1 2020 (and won't apply to safe harbor content I'm sure)

The law firm seems to claim the compromised data was hosted on github, but in the article it seems like it was only the hacking tools?

I don't see where you get the impression that it was hacking tools, from the article, which states,

> obviously-hacked data that was displayed, disclosed, and used on or by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months,” the law firm alleged in its complaint against GitHub and Capital One.

Why not sue Cisco for transiting the stolen data on routers they made?

Don't give the MPAA / RIAA any ideas ...

Intel too while we're at it, someone had to calculate those bits!

I believe this is covered by Section 230. Would expect to see this thrown out, but law is crazy so who knows.


This is incredible: they're suggesting that, in the same way that YouTube has content moderators, GitHub should moderate every repository that has a 9-digit sequence. They also say that GitHub "promotes hacking" without any nuance regarding modern usage of the word, and they claim that GitHub had a "duty" to put processes in place to monitor submitted content, and that by not having such processes they were in violation of their own terms of service.

I hope that this gets thrown out. If not, it could have severe consequences for any site hosting user-generated content.

Reminds me of the time our security team tried to add a hook preventing any high-entropy strings from being pushed to git. It lasted half a day, since they forgot about public keys, UUIDs, hex codes...

I've added a few hooks to try to improve code quality over the years. I can't imagine doing that without first running it over the whole repo to make sure it wasn't flagging good code and cleaning up any legitimate bad code that was caught.

Our code quality checks run as a CI make target instead of a hook. We then have our repo set up to disallow direct push- everything has to go through a PR and therefore CI.

Either way, it's important to validate the checks on existing code before making them a blocker.

Yes, which is why I like the make target method. You can make the target apply to new files exclusively at first, then incrementally add old files as they are touched in the course of normal work.

I could see that working if it wasn't done naively. Prompting for confirmation and then whitelisting that high-entropy string if it is deemed safe would probably only be annoying very occasionally and catch slipups. Did they ever revisit it with a more nuanced strategy?

Maybe it's because it's late and I'm tired, but I'm really struggling to come up with a plausible security reason to prevent high-entropy strings from going into version control. Do you recall what the rationale was?

The idea is that a string being high entropy is a good signal that it's a password, API key, cert, etc..

>GitHub should moderate every repository that has a 9-digit sequence

and this is assuming it even works, which it won't. It's a cat and mouse game. Once they start to monitor repos for 9 digit sequences, hackers would base64 encode their payload. Once github adds automatic base64 decoding, people would encrypt the payload.

Try explaining that to the lawyer who filed this.

Tangentially, this reminds me of Github's (very useful) feature that auto-notifies you when your Gemfile contains a gem known to have security vulnerabilities. Maybe someone will someday sue Github for not forcing a repo owner to fix a vuln, thereby causing people to unwittingly clone and use the codebase in production.

Don't most OSS licenses have a "no warranties" clause that protects the author against this? I'm sure that github also has a clause in their ToS saying they make no warranties about the code posted to their platform.

I think the general direction we're moving in is towards a world where freemium and advertising supported services will be phased out and walled gardens are all that will remain on what was once a free web.

Nah... We're going towards a silly lawsuit being thrown out.

> This is incredible: they're suggesting that, in the same way that YouTube has content moderators, GitHub should moderate every repository that has a 9-digit sequence

Youtube does not manually moderate content either, they just have some automated bots for their contentid and I would argue that's not working very well.

I don't see how Github would imitate that, surely the companies which are not on Github don't want to send their secrets to them to check if it's there, it makes no sense.

Youtube has content moderators


And, if anything, github is almost all text, and most of the interesting stuff is structured. That should be relatively straightforward compared to raw video and freetext on Youtube.

I'm aware of those but realistically, what's the percentage of youtube videos really reviewed by moderators? I'd put it in the single digit percentage in the best case scenario. To me it's more a legal excuse of "we are doing something" than actually impacting videos.

What fraction of videos are actually problematic? What if it's one in a thousand, and they're reviewing 10x over that? I mean, they also have $120B in the bank. If they thought more reviewers would fix things, I'm pretty sure they'd get more reviewers.

Those numbers only matter if you're very good at finding the problematic videos via some automatic means. You are vastly underestimating the size of the problem.

See also: https://en.m.wikipedia.org/wiki/Base_rate_fallacy

How do you know I'm vastly underestimating the size of the problem? Do you have a source? Random sampling from the ingest stream sounds like a good start for your base rate. Once you have some metadata about bad actors, you can also start hunting their videos and compare your random sample collection stream to the ever-improving bad-actor collection stream.

To be clear, "size" here refers not to the volume of undesirable content, but rather to the total volume of videos submitted to YouTube. In order to use automatic classification, one needs to examine every new video submitted. YouTube gets (last I heard) double-digit hours of video submitted per second. You have to scan an appreciable portion of the video to determine if it's objectionable.

Edit: the common suggestion that you can just ban the few bad actors effectively ignores decades of research into spam detection. Accounts are too easy to create, you have to go after content.

Let's say I have a river. I can only examine 1 drop under a microscope at a time. How do I find the cholera in the river? I take drops and look at them under the microscope. Over and over. I may not sample the whole river, but I can definitely build a representative sample. I can buy more microscopes and hire a bunch of microscopists and increase the sampling rate. Eventually, I'll figure out the amount of cholera per cubic liter, or accept that it's below the limit of detection. In this case, you're arguing that the bad thing is above the limit of detection, so I should be able to find some using random sampling.

The metadata I refer to could be all sorts of things. Fingerprints of key frames in the videos, patterns of IP addresses, cities where those streams are coming from, browser fingerprints, etc.

If the videos are drops of water, then you can imagine they start as rain. Every drop falls from the sky, somewhere in the river's basin. You can't possibly examine every drop before it gets to the ground. But you can sample them in the river.

In this case, I bet Google does, in fact, monitor the browser fingerprints of uploaders. Monitors all sorts of data around the upload event. Every single time. And does use all sorts of fancy algorithms to aid in assessment. And, and, they sample the river, with actual human microscopists/moderators, because they surely assume new patterns of plague are continuously evolving.

First, the goal here isn't to find the cholera. We know there's cholera in this river, the goal is to remove it before it gets to the village and drunk by the town's people. Same method though, you can only remove one dropper of water per person sampling.

Now imagine you have not a river, but the bering strait. As Napoleon said, quantity has a quality all it's own. Please, please read about the base rate fallacy.

You can't remove all the bad videos any more than you can remove all the cholera from the river. Do you know how we protect populations from cholera?

* isolate the source and provide supportive care (infected people).

* make sure the source is downstream of the population (livestock).

* On inland waterways where multiple cities depend on the same river, you filter the water until your sampling demonstrates it is below the limit of infectivity.

It took a while to figure out the limit of infectivity for cholera. You still do the work. It's going to take a while to figure out the limit of infectivity for videos, github, etc.

Also, you really have to let go of this base rate fallacy bit. The methods I describe are entirely general.

Also, since you keep talking about the base rate fallacy, please understand I live in the realm of diagnostics. I understand the problem. So, since you brought it up:

The difference between the Bering strait and the Mississippi river (not far from where I grew up) is trivial on the scale of cholera. Flow rate of Bering strait: ~ 2.1e4 m3/s. Flow rate of Mississippi: ~ 1.7e4 m3/s. Volume of a V. cholera: 1.17e-19 m3. In either case, the difference is 21 orders of magnitude. If the Bering strait was the lifestream of all 8 billion humans alive today, finding a vibrio in the flow would be equivalent to finding a single person generating half a picosecond of violent video.

The point is not finding all of them. The point is sampling well enough that you can calculate the limit of infectivity. Once you can do that, you can start engineering solutions at social scale, e.g. move the village, move the cattle, install filtration. Maybe even fund development of new filtration systems.

> You can't remove all the bad videos any more than you can remove all the cholera from the river.

Good, we agree. YouTube cannot win here.

> Do you know how we protect populations from cholera?

Yes. It's a fundamentally different problem, and the analogy is deeply flawed. Objectionable YouTube videos are currated and concentrated by the very population you're trying to protect. They're also created and circulated by thinking humans who are actively trying to evade the censors and get their videos to the population. The minimum number of videos needed to achieve significant damage in a vulnerable person is shockingly low (more than one, but frequently less than five).

Take these points together and you'll realize that unlike biological examples it is infact necessary to virtually, if not entirely, eliminate objectionable videos to avoid their negative consequences. And now we're back to my original point; this is unattainable, even though it seems like it would be. Because of the base. rate. fallacy.

Source: I and many of my colleagues have researched and attacked almost this exact problem.

Bots being right 99.999% of the time and .0001% of cases having to be reviewed manually is much more practical than hiring an inordinate amount of manual viewers

This assumes that you know which .0001% the bot labeled incorrectly. To do that you would need to be able to label everything correctly, i.e. you would also need a perfect bot. More realistically you would need to sample much more than 0.001% of the input to find the part that is wrong.

Incorrect results are brought to attention with an appeals process by the uploader, then reviewed manually (in some cases, else might just be black holed by google)

This is not true. ContentId filters copyrighted content.

YouTube also moderates content and demonetizes video with certain content. I am pretty sure at least part of that must be done manually.

You also have automated filters to some extent. Anything featuring 9-11 towers crash for example is automatically demonetized and hidden from the suggestions regardless of the context.

There are probably several orders of magnitude less github repos than videos on youtube (almost anyone can record a video, few people know how to use git). It's also easier to review source code/text files than videos where you'd have to watch every frame.

I am not saying that github should be liable for content, but it just goes to show how hard of a problem this is.

GitHub is a touch more difficult I’d think though. How can GitHub monitor for things like credit card numbers, addresses and SSN (as the complaint alleges they were obligated) when there are legitimate reasons for a repo to have completely benign seed data belonging to no one that would constantly trip those “filters.” Even if GitHub were alerted to the data they would have no way of knowing whether they’ve identified criminal activity or practice data.

I wouldn't say this is a solved problem, but there is a lot of prior art in this space. They're a little more sophisticated than grepping for a nine digit number.

Identity Finder, Varonis, CPUSpider, GroundLabs, StealthBITS

I don't know. I would imagine that if a github employee came across a repository with realistic-looking data from millions of people I am guessing they would probably suspend the account.

> GitHub should moderate every repository that has a 9-digit sequence.

I mean when I worked in med and edu tech this was very literally our mandate -- we had to scan every ounce of internal data looking for PII with tools of widely varying quality and scrub what we found.

So I agree that it would be a burdensome and annoying requirement -- I can attest personally -- but it's not impossible. I would imagine the requirement would boil down to a question on a audit that just requires that you're taking some reasonable measures to identify, notify owners, and take down PII.

Luckily for Github they already have vulnerability scanning so at least some of the infrastructure to build this is already there.

Google already monitors Github repositories and notify's if a Firebase configuration file was accidentally committed.

The same could be done for credit cards, passwords etc, not sure it's Github's responsibility though...

Actually, github monitors for that. They discuss it a bit at https://help.github.com/en/articles/about-token-scanning, although the list of companies is a bit outdated.

This is not a new battle. Essentially the same legal arguments justified the MegaVideo takedown, where the MPAA convinced US military to invade their servers/HQ in New Zealand -- a legal battle that is still ongoing. When something goes wrong, shoot the messenger. You can always claim they should have done more.

But I have faith in the legal team from Microsoft. Capital One though...

I don't think it's exactly same. The MegaVideo case was about copyright. That is normally protected by DMCA, but they have some proof that MegaVideo employees uploaded copyrighted material themselves.

This case, on the hand, is not about copyright but privacy sensitive data.

Do you mean Megaupload? If so, it isn't essentially the same thing at all. Megaupload was deliberately hosting content without permission of the copyright holders, pretending to remove it when notified but actually just invalidating the particular URLs that the copyright holders knew about but keeping the material available at different URLs. Their business model was to be a piracy site.

If you can find a copy of the original Megaupload indictment, it includes several emails between Kim Dotcom and other top Megaupload people that lay this all out.

"Convinced the US military"?? They filed a request for extradition, and local police arrested him.

How difficult would it be to gather all the software engineers who are Capital One customers and use Github, than have everyone cancel their account? It's not like their product is particularly unique.

> GitHub should moderate every repository that has a 9-digit sequence

You'd need some kind of enterprise data loss prevention system, or at least a regular expression, and how could a little startup like Microsoft afford that?


It's a typical "how much we can pull to settle" lawsuit.

I hope I can claim another $10,000,000/500,000,000 = jack shit from my PII being released again!

Frivolous. Throw it out.

Most websites and their owners on the internet are broke.

Github is flush with cash ( relatively speaking ).

Makes sense why they would go after them ( Microsoft ) , not that it will stick.

They learned from the Steve Dallas example of suing Nikolta Camera:


But Why to sue Amazon?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact