In a case like this, you have to be the first to file so that everyone else gets merged with yours.
It’s sort of sad that they were so quick on this but that’s why.
That's why I'm asking if the GP was talking about proactively searching or searching once they were aware.
(Don't know how common it is for leaks to be scrubbed before uploading.)
> This outside individual (“the hacker”) posted this Personal Information on GitHub.com, GitHub’s website, which encourages (at least friendly) hacking and which is publicly available. As a result of GitHub’s failure to monitor, remove, or otherwise recognize and act upon obviously-hacked data that was displayed, disclosed, and used on and by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months.
First, I do not like that the lawyers are punning between hacker "a person who uses computers to gain unauthorized access to data" ("the hacker") and hacker "an enthusiastic and skillful computer programmer or user" ("GitHub's website, which encourages […] hacking").
They allege it was the actual data posted:
> Not surprisingly, therefore, the hacker, a software developer, posted the breached data on
This part is interesting:
> According to the timestamp on the file containing certain Capital One customers’ breached data, the hacker posted the data on GitHub.com on or about April 21, 2019.
Because AFAIK, GitHub does not display upload times on its website, so I'm curious how the plaintiffs came to this conclusion. IIRC, the times that GitHub displays, e.g., in its listing of files are the times in the git data, which do not necessarily reflect the time the data was first uploaded to GitHub. (E.g., compare that a commit has a commit timestamp, but you can commit, wait 2 days, and then upload the commit to GitHub.)
> Nevertheless, Capital One did not even begin to investigate the data breach until or
around July 17, 2019, when it received an email apparently from a GitHub.com user alerting Capital
One that there “appear[ed] to be some leaked” customer data publicly available on GitHub.com.
> GitHub, meanwhile, never alerted any victims that their highly sensitive Personal Information—including Social Security numbers—was displayed on its site, GitHub.com. Nor did GitHub timely remove the obviously hacked data. Instead, the hacked data was available on GitHub.com for three months.
> 22. GitHub apparently did not even suspend the hacker’s GitHub account or access to the
site, even though it knew or should have known that the hacker had breached GitHub’s own Terms of
Service, which state that: “GitHub has the right to suspend or terminate [a user’s] access to all or any
part of the [GitHub.com] Website at any time, with or without cause, with or without notice, effective
It seems likely that GitHub wasn't aware. Nowhere in the complaint do I see where GitHub is made aware of this issue prior to it being public.
> 28. GitHub had an obligation, under California law, to keep off (or to remove from) its site Social Security numbers and other Personal Information.
> 29. Further, pursuant to established industry standards, GitHub had an obligation to keep off (or to remove from) its site Social Security numbers and other Personal Information.
I don't know if "established industry standards" holds up in court, but the 28 there is interesting. Lawyers writing this complaint, y u no cite what part of CA law? CA's law is actually really easy to browse/lookup if you know what code and what section you're looking for.
The term here is equivocating, not punning. It's an informal logical fallacy . People use it a lot when they don't have a good argument.
If you were building a bridge and you hired unqualified engineers, deprived them of tools they needed, ignored them when it came time to determine scheduling, didn't follow established regulations and standards, etc, the companies executives would be prosecuted for criminal negligence and be sent to prison. If software is involved, however, the situation couldn't be more different. It's an issue that has been debated for well over a decade in the ACM at least. Companies don't want to have to pay more for talent, and most software engineers don't want to raise the barrier to entry. The real danger is that if the software industry waits too long to establish some way of handling these issues, some public tragedy will inspire a kneejerk government response that results in a suffocating set of standards that makes everyone unhappy.
> obviously-hacked data that was displayed, disclosed, and used on or by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months,” the law firm alleged in its complaint against GitHub and Capital One.
I hope that this gets thrown out. If not, it could have severe consequences for any site hosting user-generated content.
and this is assuming it even works, which it won't. It's a cat and mouse game. Once they start to monitor repos for 9 digit sequences, hackers would base64 encode their payload. Once github adds automatic base64 decoding, people would encrypt the payload.
Youtube does not manually moderate content either, they just have some automated bots for their contentid and I would argue that's not working very well.
I don't see how Github would imitate that, surely the companies which are not on Github don't want to send their secrets to them to check if it's there, it makes no sense.
And, if anything, github is almost all text, and most of the interesting stuff is structured. That should be relatively straightforward compared to raw video and freetext on Youtube.
See also: https://en.m.wikipedia.org/wiki/Base_rate_fallacy
Edit: the common suggestion that you can just ban the few bad actors effectively ignores decades of research into spam detection. Accounts are too easy to create, you have to go after content.
The metadata I refer to could be all sorts of things. Fingerprints of key frames in the videos, patterns of IP addresses, cities where those streams are coming from, browser fingerprints, etc.
If the videos are drops of water, then you can imagine they start as rain. Every drop falls from the sky, somewhere in the river's basin. You can't possibly examine every drop before it gets to the ground. But you can sample them in the river.
In this case, I bet Google does, in fact, monitor the browser fingerprints of uploaders. Monitors all sorts of data around the upload event. Every single time. And does use all sorts of fancy algorithms to aid in assessment. And, and, they sample the river, with actual human microscopists/moderators, because they surely assume new patterns of plague are continuously evolving.
Now imagine you have not a river, but the bering strait. As Napoleon said, quantity has a quality all it's own. Please, please read about the base rate fallacy.
* isolate the source and provide supportive care (infected people).
* make sure the source is downstream of the population (livestock).
* On inland waterways where multiple cities depend on the same river, you filter the water until your sampling demonstrates it is below the limit of infectivity.
It took a while to figure out the limit of infectivity for cholera. You still do the work. It's going to take a while to figure out the limit of infectivity for videos, github, etc.
Also, you really have to let go of this base rate fallacy bit. The methods I describe are entirely general.
Also, since you keep talking about the base rate fallacy, please understand I live in the realm of diagnostics. I understand the problem. So, since you brought it up:
The difference between the Bering strait and the Mississippi river (not far from where I grew up) is trivial on the scale of cholera. Flow rate of Bering strait: ~ 2.1e4 m3/s. Flow rate of Mississippi: ~ 1.7e4 m3/s. Volume of a V. cholera: 1.17e-19 m3. In either case, the difference is 21 orders of magnitude. If the Bering strait was the lifestream of all 8 billion humans alive today, finding a vibrio in the flow would be equivalent to finding a single person generating half a picosecond of violent video.
The point is not finding all of them. The point is sampling well enough that you can calculate the limit of infectivity. Once you can do that, you can start engineering solutions at social scale, e.g. move the village, move the cattle, install filtration. Maybe even fund development of new filtration systems.
Good, we agree. YouTube cannot win here.
> Do you know how we protect populations from cholera?
Yes. It's a fundamentally different problem, and the analogy is deeply flawed. Objectionable YouTube videos are currated and concentrated by the very population you're trying to protect. They're also created and circulated by thinking humans who are actively trying to evade the censors and get their videos to the population. The minimum number of videos needed to achieve significant damage in a vulnerable person is shockingly low (more than one, but frequently less than five).
Take these points together and you'll realize that unlike biological examples it is infact necessary to virtually, if not entirely, eliminate objectionable videos to avoid their negative consequences. And now we're back to my original point; this is unattainable, even though it seems like it would be. Because of the base. rate. fallacy.
Source: I and many of my colleagues have researched and attacked almost this exact problem.
YouTube also moderates content and demonetizes video with certain content. I am pretty sure at least part of that must be done manually.
I am not saying that github should be liable for content, but it just goes to show how hard of a problem this is.
Identity Finder, Varonis, CPUSpider, GroundLabs, StealthBITS
I mean when I worked in med and edu tech this was very literally our mandate -- we had to scan every ounce of internal data looking for PII with tools of widely varying quality and scrub what we found.
So I agree that it would be a burdensome and annoying requirement -- I can attest personally -- but it's not impossible. I would imagine the requirement would boil down to a question on a audit that just requires that you're taking some reasonable measures to identify, notify owners, and take down PII.
Luckily for Github they already have vulnerability scanning so at least some of the infrastructure to build this is already there.
The same could be done for credit cards, passwords etc, not sure it's Github's responsibility though...
But I have faith in the legal team from Microsoft. Capital One though...
This case, on the hand, is not about copyright but privacy sensitive data.
If you can find a copy of the original Megaupload indictment, it includes several emails between Kim Dotcom and other top Megaupload people that lay this all out.
You'd need some kind of enterprise data loss prevention system, or at least a regular expression, and how could a little startup like Microsoft afford that?
Github is flush with cash ( relatively speaking ).
Makes sense why they would go after them ( Microsoft ) , not that it will stick.