Hacker News new | past | comments | ask | show | jobs | submit login

Github is failing the same way usenet failed: everybody could post stuff to usenet just like everybody can create a github repository and there is nothing that sets an official repository apart from a spammers repository.

When Amazon has "the everything store" as main strategic goal, they get hit by "90% of everything is junk". So they end up being a store of mostly junk.

Github should figure out if their product is "a repository for everybody" or it is "I can trust this code".

E.g. look at the official PG JDBC: nothing here couldn't be reproduced by a spammer. How do I know that I can trust this and that it is not an infected repos? https://github.com/pgjdbc




> Github should figure out if their product is "a repository for everybody" or it is "I can trust this code".

I'm pretty sure they decided on "repository for everybody" when they first launched the company 16 years ago.


That's a Java library, so you would download it from Maven Central, not GitHub (unless you're doing something non-default)... And Sonatype requires that you prove ownership of the reversed domain used in the groupId, which in this case is `<groupId>org.postgresql</groupId>`. You can see how to do that here: https://central.sonatype.org/faq/how-to-set-txt-record/

For extra piece of mind, you can also check the GPG signatures as all artifacts are signed when published to Maven Central... you need to get the key used by Postgres to sign that somehow independently from Sonatype. That's a downside of this mechanism, you just need to know for each publisher, where to get their GPG keys from. In the case of PG, I couldn't even find it with a quick google search.


I don't think you truly grasp how small this number is, this is actually good, like really really good. Github has about half a billion repositories.


Not only that, millions of these type of repos get created, and the vast majority are caught and deleted. The article mentions this: "Most of the forked repos are quickly removed by GitHub, which identifies the automation. However, the automation detection seems to miss many repos, and the ones that were uploaded manually survive. Because the whole attack chain seems to be mostly automated on a large scale, the 1% that survive still amount to thousands of malicious repos."


Notice that as it seems, the vast majority are caught and deleted due to the intense automation, not the detection of malicious contents. If the actor was to run a smoother automation process, probably nothing would have been deleted. (disclaimer: author this article)


Getting the actual number is probably very hard. These are the infected repos the OP found during their research.


For public repos you can get an approximate number by querying various public datasets.

    SELECT uniqHLL12(repo_name) FROM github_events;
Against https://play.clickhouse.com/play?user=play#U0VMRUNUIHVuaXFIT... returns:

    361648383


They probably mean that the actual number of malicious repos is probably very hard to get.

The article reaches the 100K number by searching for repos with patches with a particular string contained in this specific attack, so it's likely missing many malicious repos that use different methods of infection.


Exactly, GitHub claims to have 400M+ repos making this number 0.025% of repos. I'm sure they could get it lower but less than half of 1% is pretty damn good.

As a developer I have to do some due diligence about where I'm getting my data from. If I'm slurping in random repos because the name matches that's a people problem, not a github specific problem.


Although finding over 100k infected repos is not good, it does not mean github is failing because the kind of programmer who would include an infected repo can find many other ways to create an insecure product if there weren't infected repos on github.


To be fair, the kind of programmer who would include an infected repo is almost everyone. Many infected repos have no indicators except for username to help you notice without a careful examination, especially in niche repos. When you have to move fast, it's natural to make such mistakes.


Further, transitive dependencies are a real risk. If A depends on B depends on C depends on D depends on E depends on F, and F is compromised which the author of E does not catch, everyone depending on any of the deps in the chain are at risk.

It's why the JavaScript ecosystem of micro packages is absolutely insane. If someone infected isEven, they'd have a blast radius of 90% of JavaScript devs.

It's much like having a single password protecting everything. JavaScript has way too many of these high value packages that find their way into every modern JavaScript project.


It’s possible to get a verified badge on your org page if you prove you own your domain. This can go a long way to improve trust. Your example just seems to have not done it.


Strong disagree. It’s not GitHub’s job to tell you what’s good or bad. Only the user of the code can do that because it’s context specific. “I can trust this code” is a fantasy that won’t happen. Don’t trust code, test it.


This seems like the “don’t use seat belts, drive safely” argument.

Trust mechanisms in GitHub/etc can’t solve the whole problem, for sure.

But some automated safety mechanisms at scale can reduce the risk for those who don’t follow perfect security practices, which has value to the world at large.

Very few of us have the capacity to do even cursory validation for every update to every dependency of every bit of software we use.


I’m not saying don’t scan code for vulnerabilities, I’m saying GitHub shouldn’t be the place that the scanning happens. A good place would be where the code is getting compiled /executed.


That’s simply not possible. How do define a vulnerability? That’s all context dependent. It could be something as subtle as skipping an auth check if a magic string is part of the payload.

The main benefit of reusing software packages is that you don’t want to spend the effort of writing/reviewing all the internals of the component.

At some point, to trust an abstraction blindly, you need to instead follow reputation. Who has authority to say what is reputable or not is the difficult dilemma.

As seen with CVE authorities lately, it’s not easy. As much as they undermine their own authority by declaring everything as a CVE, vice versa, declaring every org in GitHub as “Verified” may eventually be easy for scammers to get as well.

Back in the days, just having an SSL certificate on your web site was a big stamp of trust. Now everybody has it and it doesn’t mean anything.


Are you saying github should not scan? I don’t think there’s a central planner who will enforce that scanning is only in one place.


Comparing github to usenet feels is a reach. Github has always been filled with junk since day 1, people post the same projects, coding exercises etc. The small N% of the repos are actually the interesting ones. This is by design.


It is possible to be a repository for junk, but harmless junk only.


> or it is "I can trust this code".

what might be better would be some kind of trust layer built into package managers so they (optionally) only allow verified repos to be installed


There are countless of solutions that try to do this, both official and non official, both at package and repository level, npm from NodeJS comes with a security audit tool for example, and most code hosting solutions nowadays have at least a SAST tool built in, but expecting more from free services it's a bit of pipe dream.

Obviously it's hard to make a one-size-fits-all solutions, bottom line is that if you use third party code for anything serious you have to do your due diligence from a security pov, a vulnerability assessment at the bare minimum.

Lots of big companies are in fact maintaining their own versions of whole package ecosystems just to manually address any security concern, which is a crazy effort.


Doing that well would cost money, and people are used to getting their package managers for free.


If only there was some sort of system of named domains within which the products and services of various organizations could be located...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: