Hacker News new | past | comments | ask | show | jobs | submit login
Self-Service SBOMs (github.blog)
63 points by zvr on March 31, 2023 | hide | past | favorite | 54 comments



> The resulting JSON file [...] can then be [... blah blah blah...] or reviewed in Microsoft Excel (use a JSON-to-CSV converter for compatibility with Google Sheets)

GitHub's status as a Microsoft-owned subsidiary is showing. What a dig to manage to slip in—that one of your competitors, ostensibly a vanguard for openness and interoperability (and known esp. for favoring tech solutions that have a Web-y flavor in particular), hasn't taught its competing product to interoperate with JSON.


Isn’t that saying the opposite of what you’re concerned about? It sounds like Microsoft has added support to Excel for the industry-standard format SPDX files which many tools, including GitHub, generate but is acknowledging that Google has not yet done so. Since both tools are proprietary I’d much prefer that they use standard formats like SPDX. SBOMs are the kind of thing large organizations ask for - note the U.S. executive order linked – and that means lots of non-technical people are going to be exposed to them.


> Isn’t that saying the opposite of what you’re concerned about?

What do you think I'm concerned about?

Also: no.


Aren’t you criticizing GitHub for mentioning Excel? I’d be more sympathetic to that criticism if it wasn’t the most popular tool in the world for data analysis, especially in the target market.


> Aren’t you criticizing GitHub for mentioning Excel?

No.


I actually found that callout quite useful. I've often had to send SBOMs to lawyers, auditors etc. and typically in Excel.


Aren't these things basically incompatible formats, since JSON can be arbitrarily nested and spreadsheets have no depth?


There is a JSON 'schema' that is popular that makes it more compatible.

Sorry, forgotten the name. And I'm not sure if this adhere's to it.


Kinda sounds to me like the format should just be CSV in the first place, then. If anyone has reasons why they prefer constrained JSON, I'd like to hear it, though. :)


No, the format is SPDX which can be serialized in JSON or in a spreadsheet (as well as RDF, XML, YAML, tag-value text, etc.)


Although you can't reliably handle arbitrary JSON, they're not ipso facto incompatible formats; the fact that you can run a CSV-to-JSON is proof. (If you have a tool that operates on the kind of data that CSV is used for, then you can, of course, implement JSON support in the same tool at least to the extent that it is able to handle any JSON of the form that a CSV-to-JSON converter would produce...)


How much of this functionality only works when both the dependency and the code that depends on it are both hosted on GitHub?

I'm not looking forward to a "Code must be hosted (or at least mirrored) on Github" future.


As much as that's always a possibility, there's so much competition in this space that Github is just entering into that I don't see it happening anytime soon. The angle for Microsoft here is they're probably going to charge enterprise GH customers a severe premium to avail of this - it will only be freely available to users who aren't the primary target audience.


They note that you can upload SBOMs to GitHub, which makes sense given how many places have their own build infrastructure and may not be able or comfortable posting all of their dependencies.


You know that Microsoft would love this outcome and will gladly use the need for regulatory compliance as a mechanism to push for it.


Too bad the ability to understand dependencies is quite poor and not extensible. The dependency graph in GitHub Enterprise is useless, especially for the price - which makes any SBOM also equally useless.


What does "understand dependencies" mean to you? What kinds of things would help you understand dependencies better?


Could you describe its defects in more detail?


Simple example: using advanced security in GitHub Enterprise with dependabot, it understands our usage of actions and their depndencies - it can understand that action workflow X depends on Y and there's a new version. But if we use a docker image in those workflows, hosted in GitHub packages, it isn't able to understand that.

This is a fairly basic case I would have expected to work - but it doesn't. For anything C++ or more complex examples it's less useful... And dependabot is no longer extensible so this can't even be solved by using open source additions.

Currently looking at things like Mend's Renovate as an alternative


I don't understand all this modern "supply chain security" stuff. What is the point of getting cryptographic signatures and bills of material from people you don't know and don't trust?


This is targeted at public sector and other enterprise customers. It doesn't need to actually solve the problem. It just needs to clear the hurdle of giving the people who work at these companies a way to look like they got something done. Bonus points when it does so in terms (like "bill of materials") that they're already familiar with.


It's a step (and I guess for many, a coping strategy).

Getting a BOM about people you don't know/trust lets you know who you don't know/trust: known unknowns. At the very least it present a quantifiable risk profile that you can use to push for disengaging from untrusted dependencies. At best, it's a stepping stone to start building relationships.


It's a step towards what? What's the next step?

Some rando's BOM does not give you "known unknowns", it gives you "unknown knows"/"untrusted knows", an untrustable list of stuff they say they know, which only give you a risk profile if the potential attacker is not attacking you, and is definitely not "quantifiable risk".


I'm not really sure what "unknown knowns"/"untrusted knowns" are (or how switching the order of the words is significantly different in this particular context).

If I'm using a piece of software from Alice, & Mallory is using a supply-chain attack to compromise that software, I have chosen to trust Alice without knowing who Alice has chosen to trust (unquantified risk). An SBOM tells me that Alice has chosen to trust Mallory - it doesn't necessarily tell me whether Mallory should be trusted, but it allows the risk associated with trusting Alice to be better quantified.

The next (out-of-band) step is actively engaging with & trust Mallory (or not).


This assumes that you trust the manifest from Alice. You rely on Alice being truthful to evaluate the risk in trusting Alice. This is a catch-22.


I'm not sure we're talking about the same thing. What manifest are you referring to here?

The post is detailing a feature offered by GitHub to generate SBOMs from source code repositories - these will provide machine-readable inventory based on the files within the repository. You can upload a dependency graph to feed the SBOM as an out-of-band step, but trusting Alice's out-of-band dep graph is only a function of access to the source code as a source of truth.

Ultimately that's what all if this comes down to: creating a machine-readable graph from a predicable central place that tells us information we could have ascertained ourselves before anyway (just with considerably more effort).


If they're wrong, you don't trust them in the future. Sure, they can just generate a new key. Eventually there will be businesses with good reputations, and ones with unknown reputations. Just like in "real life" right?

Maybe I'm misunderstanding the discussion here?


We're all already doing this, what benefit does the key provide?


The first step in solving the trust problem is solving the identity problem. At the very least, once you've got cryptographic identities for entities involved in your supply chain, you can use a TOFU policy and check whenever an identity changes.

Simple operations like rotating a key shouldn't trigger any security warnings, as long as they new key is signed by the old one, and even adding new people to a team should happen seamlessly if (a majority of) the existing team members approve that new identity being added.

Of course it doesn't solve key compromise, or someone selling their keys to someone else, but with long-lived (even pseudonymous) identities, it becomes possible to reason about the trust level of packages just based on how long an identity has been used without being compromised.

No system is perfect, and there's still a long way to go, but the existing systems make the remaining problems more tractable, and already increase the cost for attackers, which should reduce attacks.


> The first step in solving the trust problem is solving the identity problem

I disagree entirely. Knowing that the random "leftpad" library you pulled it was in fact authored by "John Brown, 46 years old, from Milwaukee" does absolutely nothing for your software security.

The only way to audit your dependencies is to actually have someone you trust (e.g. works for you) go and audit your dependencies. The entire system is built on a broken premise.


I'm glad you agree that knowing someone's name, age, and address doesn't prove their trustworthiness, because I don't want trust decisions to be dependent on threats of state-backed violence or mob vigilantism.

It is possible to build up trust in an identity based on how long that identity has been used, and the "transitivity of trust" principle. So you wouldn't trust someone because "John sounds like a trustworthy name", and instead you'd look at how long the author's key had been associated with the library, and whether their key had previously been endorsed on other people's projects (for example having their PRs reviewed and accepted).

Admittedly this introduces a new danger that the social graphs start to become very dangerous honeypots of metadata, especially if we start letting employers vouch for their employees, but the ultimate goal here should be to use something like Verifiable Credentials with zero knowledge proofs, which will allow very strong probabilistic arguments to be made about whether an author (and all the code reviewers) have suddenly gone rogue and decided to burn their hard-earned reputations.


> I'm glad you agree that knowing someone's name, age, and address doesn't prove their trustworthiness

My point is that NOTHING about their "identity" provides trustworthiness, unless you actually know that person and you're contracting them in some way.

> build up trust in an identity based on how long that identity has been used

Why would that be true? Times and times again, we have seen popular packages take a wrong turn. An "identity" is just a key with some untrustable name on it, which can be sold or mishandled just as easily as your NPM or GitHub password.

If your entire security still relies on "this rando didn't do me wrong in the past, they're probably fine" or "they have a lot of GitHub stars", why introduce key management? What does it really get you?


I think https://keyoxide.org provides some kind of middle ground for verifying identity here. The identity there is not meant to be real life names but rather a collection of all social profiles bi-directionally linked together with OpenPGP signatures.


This again verifies identities and in no way software. What's the point?

If you decide to trust "the Python Foundation", what does this key do for you if you're already downloading binaries from python.org? And if you don't, how much does the fact that they have a key help you? Anyone can get a key.


Multi perspective validation.

Hackers can compromise python.org and sign stuff with a key advertised there. But the site is just one point. It's much harder to hack python.org and also their GitHub and Twitter account (and DNS and dozens of other supported services).

Keyoxide makes the signing key links on multiple sites thus raising a bar for accepting fake key. It's not a silver bullet obviously. Just makes the attack harder to pull and is machine readable (instead of making humans check the keys).


I totally disagree. If John Brown is a US citizen, works for a major tech company, etc. I feel more comfortable than if it's some anime avatar, location unknown, etc. Risk is a gradient and security at enterprise scale is a huge challenge. This helps move in the right direction. It would be better (of course) to review every line of every package, but what’s the timeline on a typical org achieving that?


I admit I don't know a lot about these systems, but aren't they trying to solve a different problem? Aren't they trying to solve visibility?

Once you know a vulnerability exists in a particular version of a particular piece of software, you need a way to go from there, through dependencies, to builds, and then to deployments. You can't push fixes to all the right places without knowing what those places are.

In other words, you need to correctly and completely answer questions like, "Oh, CVE-YYYY-NNNNN was just announced. What's the list of every binary on every machine of every platform we have in every environment (production, corporate network, employee cell phones) that needs a fix?"

You can do it manually, but that's basically asking for failure because you'll miss things. Also it's tedious.


One useful thing you can do with an SBOM is quickly check whether your dependencies carry any known vulnerabilities. That kind of thing can be harder than you'd think to track, especially across large organizations with many projects.


Or maybe you do trust it, but you want to verify. Centrally collected SBOMs are useful to match against CVE’s and other vulnerability sources to quickly identify affected codebases.


Would you rather have a receipt for something you buy or to take someone's word on it if anything goes wrong?


I don't understand what you mean at all. If I don't trust you or the software you sell me, how does you handing me a receipt that says "software is trustworthy I swear" increase that trust? What do I do with that receipt if anything goes wrong?


Say you bought a car from someone. They give you their word that it runs. It runs for awhile but then something breaks. You have no idea why so you take it to a shop and they find a number of reasons it could be faulty. After many days and countless dollars later you finally find a root cause.

Say you bought a car and this time they gave you an itemized receipt of every part in that car and the current condition of each part. It breaks down. Now when you go to take it in to a shop to investigate, you may correlate a specific defective or unmaintained part as the reason. You replace that part and move on with life.

You’ve saved time, energy, and got plenty of peace of mind.

You do not trust who you got the car from, that largely doesn’t matter. Because now you have to deal with real problems owning that car can bring.


This is a fine analogy for accidents. I am talking specifically about security.


It wouldn’t be far fetched to include in the same analogy any known safety recalls or provenance of where you got the part. You may just have a different definition of security here. Best of luck.


Safety != security.


you are taking an academic security perspective but think about it from a liability / legal perspective.

suppose your supplier gets breached through a well known vulnerability that should have been patched by any competent vendor, e.g. log4j. you are negatively impacted and it has materially affected your business. the software component that was compromised was not in the SBOM. Now I can pursue legal action because they didn't give me accurate reporting.


Was the screenshot in the blog post a mockup, or is GitHub changing to that style? https://github.blog/wp-content/uploads/2023/03/sbom1.png?w=1...

Also what is a "cloud repository"?

> As part of GitHub’s supply chain security solution, self-service SBOMs are free for all cloud repositories on GitHub.


I assume it's in comparison to on-prem repositories that are available through the Github Enterprise product.


That is the "Light High Contrast" theme that you can set on GitHub under Settings -> Appearance.


That's neat, maybe I'm getting old but it's a lot easier to browse GitHub this way.


There are essentially 2 approaches to SBOM.

1) scanning source code: it will generate also lot of false positive and a need to manually edit (ammend/correct) the SBOM. This works best with containers. One Foss tool that does this is Anchore's syft/grype. Others have jumped onto the sbom bandwagon simply because they thought it's easy (they were already offering SAST/DAST)

The other class of tools actually only looks at a binary artifact (firmware, executable, library etc) and then tries to extract the SBOM from that. I am not aware of any FOSS tool here and would br cool to learn of anything out there. But chances are some custom work is needed for each specific firmware. Only vendor I know today are Binare.io, cybellum)

For firmware and embedded this really the only way that IMHO makes sense for embedded engineering. Because you can't just scan an SDK and say "yepp it has mbedtls xyz or c-json in it ...". It won't find it because there is no specfile or something that cam be parsed reliably. Your only option would be to strong-arm the silicon vendor into supplying a machine readable sbom along with the SDK.

It also has less false positive and doesn't find vulnerabilities (or license issues D depending on what you match for) you don't care about.

The NTIA envisions the SBOM as a key element in building a risk based approach to devsecops with Vuln Exchange (VEX) and much more. We're still far from thus.

Fact though it's already becoming part of regulation (RED, CRA) in Europe and the US (Executive Order Biden signed not too long ago)


> scanning source code: it will generate also lot of false positive and a need to manually edit (ammend/correct) the SBOM

Can you give any examples of this? I’ve never seen a tool like this report a true-false positive but based on your mention of embedded tools I’m guessing you might be referring to something like a build dependency which is completely optimized out such as a debugging feature on a production build? In such cases I would be inclined to either prepare separate SBOMs or otherwise indicate when something is present but not reachable. My concern about editing the files is that these are intended in a somewhat legalistic context and if something blew up it’s not hard to imagine someone’s lawyer using those changes to argue that you were aware of a problem and tried to cover it up. Even if you could defend your policy, that’s not a conversation anyone wants to have.


example would be trying to create the sbom on a build server which is meant to produce the artifact. and you're only intersted in what your artifact needs (pulls in). How would any tool know what to include in an SBOM?

because normally the sbom generation tool will try to query your package manager, whether thats rpm, dpkg or pypi and others to try to understand what it sees. so looking at the build stage might include a lot more (tool chain etc).

most commercual tools eould cater for this b6 alliwing you to rectify incomplete or wrong information.

the problem you mention, liability, is an important one. if the law requires the sbom to be correct then this is a problem humans need to verify. and tgats really annoying.


There are actually more "types" of SBOMs than the two you mention. I don't think CISA has published the corresponding table yet, but depending on the point in the software lifecycle that you are, you can create "Design", "Source", "Build", "Deploy", or "Analysis" SBOMs -- and they are all for different use cases,




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: