Hacker News new | past | comments | ask | show | jobs | submit login
Behind GitHub’s new authentication token formats (github.blog)
244 points by todsacerdoti 12 days ago | hide | past | favorite | 79 comments

The key insight here is that random tokens should be self-describing, so you know their intended use and therefore can make decisions and take action when one is detected.

If a script sees "ABC123" in a code commit, that's meaningless. If you see "secret-token:ABC123", now you can fail the commit with an error message: "Secret token detected in public commit, aborting."

For those that haven't seen it, "secret-token:" is an RFC. I've started using it at work.


Note that the RFC's category is "informational", which doesn't give it as much weight as something that is "standards track". Usually the important RFCs are "standards track" though there are some "informational" RFCs that are also important.

From Wikipedia[1]:

> An informational RFC can be nearly anything from April 1 jokes to widely recognized essential RFCs like Domain Name System Structure and Delegation (RFC 1591). Some informational RFCs formed the FYI sub-series.

[1]: https://en.wikipedia.org/wiki/Request_for_Comments#Informati...

An unencrypted version marker would be pretty useful, too. If anything is long-term, you can safely bet in that it'll need to evolve.

Fair, though usually a token like this is just an ID into a database table and doesn't itself contain any data.

One of the biggest learnings of our org was to prefix tokens with the entity type. It's helped immensely.


* Helps in migrations, especially complex ones where you split up entity types.

* Identities what tokens are so people can look them up if they see them in logs.

* Polymorphic relationships can delegate to the appropriate owning service easily without additional bookkeeping.

You can also encode other stuff in the token entropy, too, such as the author DC/region for active-active setups where you need to forward the request to the source of truth in the brief window where the other regions don't know about it yet.

I’ve always thought that a Java-style reverse domain name format (or, perhaps, URLs) is a great way to encode IDs: com.foo.bar.Person:0000-11111-22222-33333 or whatever. That way, any code that logs IDs or transfers IDs across the network gets tracing “for free” and, when you see an ID in a bug report, you can use it to help focus the investigation.

Ruby/Rails has this in the form of GlobalID[1]. To be honest I haven't seen it used outside of whatever Rails itself automatically does, but the concept is there.


that's how Stripe prefixes their IDs too, depending on what type of entity it is. Makes debugging, docs, .. easier

FWIW it is still very much worth reading the article, since they talk about how they implement that approach. They bring up why they use an underscore, checksum'ing, entropy, etc.

The ID prefixing is cool from an identification point of view, but we've been using UUIDs for tokens and if we implemented this we wouldn't be able to use the UUID optimized datatype field in Postgres.

Surely you can just strip off the prefix in the application layer before sending it to Postgres? You still get the benefits, while being able to use the native query.

I do it like that. We're only using a two char prefix (I copied Twilio)

Why not? You don’t have to store the prefix and the UUID in the same column?

Can’t you just add a column to your schema with the prefix?

>One other neat thing about _ is it will reliably select the whole token when you double click on it

Shout out and kudos to whomever brought that up

Frankly the fact that this doesn't happen with `-` in a `<code>` block should be considered a browser bug.

Well it doesn't happen in IDEs either (by default, at least)

I can see the argument for multi-line pre-formatted code blocks, but for inline `<code>` it would be nice if double clicking anywhere selected the whole thing.

I'd rather consistent behaviour TBH, I'm not too happy that `-` is a non-word character but I'd rather it always behaves the same everywhere without having to think about context

I wasn't quite aware of this because there are slight inconsistencies in some programs and between OSs.

e.g. in powershell on Windows, Ctrl+Backspace deletes the word part, but in cmd shell, Ctrl+Backspace deletes until whitespace. The keybindings to delete word also vary on Windows vs MacOS (Alt+Backspace deletes a word on mac, but deletes the whole line on Windows. Windows uses Ctrl+Backspace for that).

I was also confused since by-default Emacs doesn't treat `_` as a word character.

Is it `a-long-identifier` or is it `x-y`?

Ideally it’d use the language defined word separators. I fought for that in vscode but was out-voted :)

'-' is a minus sign. It will appear in code blocks much more often in contexts where you do want it to act as a word delimiter.

True, good point. Although if there are no spaces around it, it’s probably not acting as an operator. But at that point any heuristics to determine what it’s doing are well beyond the scope of the browser highlighting feature.

It seems weird that in a blog post about a new format for tokens, there isn't a single example of what a GitHub token now looks like.

From the opening paragraph: "old authentication token formats are hex-encoded 40 character strings"

description != example

https://tools.ietf.org/html/rfc8959 - "secret-token:E92FB7EB-D882-47A4-A265-A0B6135DC842%20foo"

> Thus, we are adding a separator: _. An underscore is not a Base64 character which helps ensure that our tokens cannot be accidentally duplicated by randomly generated strings like SHAs.

This is just a little bit misleading. Base64 isn’t a single neat and tidy thing: there are several alternatives for the encoding characters 62 and 63, padding, line break behaviour and one or two more things; see https://en.wikipedia.org/wiki/Base64#Variants_summary_table. When you’re talking about Base64 on the web, you’ll very commonly be talking about base64url, the URL- and filename-safe variant, rather than what’s most commonly called base64 and typically the default. But base64url is in widespread use, and has _ as character 63.

Also “randomly generated strings like SHAs” aren’t typically doing Base64 anyway,but rather hexadecimal encoding.

The post isn't super clear, but I think they're describing base62 here. It says base62 later in text when talking about the checksum, and the "Our implementation for OAuth access tokens are now 178" section uses `a-zA-Z0-9` which is base62.

ok, great, when will I be able to scope tokens to acting on only one repo, or only one org, or in any meaningful way?

I'm sure this is a valid complaint about Github, but it has nothing to do with the article, which is a bit annoying since there's some cleverness in that article (checksumming tokens, for instance) that we could be talking about stealing, rather than turning this thread into a generic referendum on whether Github is good.

It’s valid because GitHub is using cycles on format when the functionality is more important.

I think these new formats are nice, but don’t care give how hard to use their token scheme is. Just some mention of other stuff coming would be nice.

The problem I have is I’m not even sure if GitHub recognizes this as a problem (I have to grant access to every repo I can access to every script) and it’s been broken for years. Would be nice to know what they’re working on.

This is what they are working on: https://github.blog/changelog/

It seems to me that its a PM centric process i.e. lots of end-user features rather than fixing the behind the scenes parts that are poorly designed and/or broken.

We tried to workaround the inadequate token support by using their Terraform module for automation. Only for it to delete our repos losing years of issues because of a bug where renaming = deleting (3 year old issue that has never been prioritised).

a) I don't see what is particularly clever about their token algorithm.

b) Talking about fundamental flaws in their token implementation seems relevant in a discussion about their token implementation.

c) Public discussions about flaws are often the best way to educate others and make the company aware of them.

The article is about GitHub's authentication tokens. It seems relevant to bring up a complaint about their scopes.

And yet it's not, because this is an article about token formats, and not about entire authorization schemes.

I definitely agree this is needed, but I have to imagine it's quite a complex change on GitHub's side. It probably entails changing a lot of their authentication architecture, in terms of what's stateless and stateful and what requires a trip to the DB to check (i.e., it's hard to encode a list of which resources you should have access to in a single token). I'm sure they're aware this is a problem though. Maybe this recent change sets the groundwork for fixing it.

That the only method I have to scope tokens is to break the TOS for GitHub - by creating single-use accounts - is super lame.

Don't forget that creating new accounts costs money.

If you want to take advantage of Environments for example then you need to pay for the Enterprise license which means every account is another $21/month. That adds up for individual and startup use.

I am actually confused why we can't just have tokens assigned to organisations and not users.

What TOS does it violate? They explicitly recommend creating machine users in certain areas of their help docs.

This one? ”you may not have more than one free Account”


But there’s also later more detailed guidance:

”One person or legal entity may maintain no more than one free Account (if you choose to control a machine account as well, that's fine, but it can only be used for running a machine).”

This means that if you create a work account that is separate from your personal account (because you don't use personal credentials on work machines and vice-versa) then you are technically in violation of the TOS...

Yet this is something I, and many others, do because we don't want to mix business with pleasure. In fact I absolutely refuse to do so because of security reasons.

Many Microsoft employees also have separate personal and work accounts. I would be surprised if that was a violation.

It is explicitly a technical violation of the TOS.

In practice this is mostly so github has a reason to misuses like bots which might not be caught by the anti-spam measure.

They do mind a bit but not too much, and you can get help from support having literally stated that you have multiple accounts. For instance if you're testing an extension or integration with github, and there are specific interactions between different users… you kinda need different users to test it. And mocking github may not be sufficient.

Aren’t the work accounts, paid accounts? The TOS only restricts having multiple free accounts.

No, the work accounts are not paid accounts. My work requires me to interact with Open Source projects and the like. We have our own hosted Gitlab instance for our internal projects.

The work account is strictly to communicate with/provide patches back to upstream projects.

I have multiple personal accounts on Github. One for each employer I have worked at in the past couple of years, and my personal account that is tied to my own identity and is used for my personal time projects/open source work that is not tied to $work.

Yes. I'm talking specifically about the "separate personal and work accounts" case. It may very well be a TOS violation as the TOS is written. I'm saying I'd be surprised if they treated it as one.

I'm actually dealing with something like this right now and am curious what solutions people use for e2e testing of OAuth flows. I'm leaning toward creating a test account at each Identity Provider, but then I have to deal with things like 2FA. I guess it's not so bad if I just use a TOTP generator on the client, but if they want to send an email to verify my account, that's just annoying.

Whatever the letter of this restriction is, that's not its spirit. Our practice at my last company was per-client segregated accounts, and I have a mailbox full of discussions with Github support staff telling us that was OK.

This seems like a classic case of rules with high potential for selective enforcement which generally leads to unfair enforcement. It's fine as long as you don't somehow get on Github's bad side and if you do it's an instant reason to close your accounts.

Well the solution is simple, just make a new LLC for each GitHub account you need!

I'm definitely kidding, but unless there is more in there TOS (which I don't intend to read) I don't see why this wouldn't be a workable loophole

Presumably then it would be non-violating to have many paid accounts? Sure is pricey for smaller operations but depending on the value it brings to have this more granular scoping it surely might be worth it?

Doesn't remove their need to implement a more reasonable way of scoping tokens, though.

GitHub for me is running out of excuses for why the fundamentals of their platform is so poor. And why compared to Gitlab they deliver improvements at such an anaemic pace. They act like a company with 15 employees let alone 1,500+.

Everything from Security, Actions, Containers, Packages, Terraform, JIRA Integration etc is either completely broken or has major outstanding issues that haven't been fixed for years.

Personally, we use GitLab for all the internal repos at our company. We originally migrated because GitLab CI was free and GitHub didn't even have a CI solution at the time. We still use GitHub for our public repos since that's "where the community is." GitHub actions is great, although IMO a bit too prescriptive (I'd rather write a script that can run anywhere rather than spend time building up the mental model of the abstractions that are unique to GHA). But nothing beats GitLab CI + container registry; we put a lot of work into our CI pipeline and now we've got incremental builds with a Docker image for every service tagged per commit. And since GitLab Container Registry supports manifest v2, we can take advantage of BuildKit layer caching (I think GitHub registry supports this now too, but haven't played with it).

That said, GitLab has its fair share of problems too. GitHub UI is way better, community/discussion features are better, and forking/public collaboration workflow is better.

I'm glad there are two big players in the space, though; GitLab really lit a fire under GitHub to finally get them to start pushing new features.

You can already do that today. Create a Github App and add it only to a single repo.

I'd be happy to have a repo read-only scope to start with, it seems a little nutty that in the year 2021 the only repo scope available is read-write https://docs.github.com/en/developers/apps/scopes-for-oauth-...

I completely agree. This change is interesting and all, but GitHub (Enterprise, in my case) tokens aren't granular enough. There's a lot more benefit to be had in fixing that issue.

Would someone comment on this idea in context of JWTs? Not trolling, just curious as I use JWTs and embed this kind of metadata as a custom claim, which accomplishes some but far from all of what GitHub accomplishes here, but then I have no need for the easy scanning. So seeking wisdom from anyone who has thought carefully about whether or not to prefix their JWTs in this way.

JWTs purposely contain information in plain text (unencrypted and not stored in a database), however it is in base64 so you don't need to worry about url encoding issues and so it looks like a token.

You could add a prefix to a jwt. That would make it a token that contains a jwt.

I don't think the tiny prefix is what they want to obscure. So it wouldn't go against the design of JWT to add one.

I would do it. I don't see any issues with it.

It would be something like:


If you wanted to be able to double click to copy and paste, which I don't think is a huge usability improvement, you could replace the . with _, and I think a lot of devs would be able to figure out that it's a representation of a JWT.

A big motivation for such token formats is to quickly and easily identity when they are shared somewhere they shouldn't be. JWTs aren't helpful in that regard, since they always present themselves as a base64 encoded blob.

It’s pretty grepable because {“ (json opener) always encode to “ey”. So a base64 that starts with “ey” and has 3 dot separated sections is a good start for a regex. I’m sure you can go further by looking at the spec.

Yep. eyJ<base64stuff>.eyJ<base64stuff>.<base64stuff> is pretty definitely a JWT - if you see it in a log output you should probably redact at least the first and third blocks of base64.

But JWTs are generally somewhat ephemeral (they expire) so you’re fairly unlikely to commit one into a source repo in a way that could do actual damage...

This is great. If you start with the same thing in your payload (with whitespace removed and key ordering preserved), you can add that to what you grep. If GitHub used JWT and started the payload (the second of the three dot separated items) with `{"gh":`, they could grep it like this:


The question is why you couldn't just have "${prefix}-${JWT}" as your format.

Then you can just strip the prefix before parsing. Which means don't need to worry about checksumming or entropy and you have the ability to embed large amounts of data as well as plenty of client support and libraries.

Would be curious if this implementation is somehow more performant.

Yeah, what GitHub is doing by including the checksum is reinventing JWT, with HS256 precisely. They could also make it easy to grep (see other comment) by having the payload always start with the same prefix - if it starts with `{"gh":"` the jwt will contain `.eyJnaCI6I`

The subject doesn't need to be a user ID, and it sounds like they don't want that. It could be a session ID.

Totally, but JWT-like blobs can be detected (see sibling comment), and parsing attempted, so for the automated scanning use case, if I understand correctly, it can be done with perfect accuracy, just at a larger computation expense and worse security exposure due to the complexity of the scanning and the need to parse.

The more interesting side to me is the benefit to humans, from the prefix technique.

That was a really well thought out article.

I loved the use of underscore and the "it'll reliably double click"

I liked that they looked to other companies out in the sold and acknowledged that they're learning from others.

I didn't know what to expect when I opened the link, but I'm glad I read it!

Does anyone know if these new tokens are backwards compatible with software that used the old tokens? By which I mean, I’m using a version of Git Tower from before they switched to a subscription model, and I’m wondering whether regenerating my tokens will make me unable to log in.

The reason we (GitHub) kept the new tokens to 40 characters, matching the length of our old tokens, was to make this change as backwards compatible as possible. We’ve never documented or committed to our tokens being 40 characters long, but they’ve been that way a long time and there may be software out there storing them in a fixed length database columns.

If you use a service you’re worried the change may break I’d recommend minting a new token and trying it, perhaps on a new account on that service, before revoking your old one.

(We do plan to increase the length of our tokens in future, but not before July 2021 at the earliest.)

I imagine they should be, they're still just strings.

It's possible that Git Tower does some local validation that might complain, but I think they are in the same set of characters.

You could always write a crack for the local validation, too.

I wonder why they went with 2 rather than 3 or 4 for company identifier. Stock ticker for instance would make sense. Not really practical.

This isn't meant to be a standard, just something they picked for themselves. And it doesn't even need to be a company identifier. Slack tokens are prefixed with "xox<token type>-", for example.

Side rant: Github devs, if you're listening, please give us REST API. Not everyone knows GraphQL or has the motivation to do so. The industry standard is REST for public facing APIs, including companies such as Stripe (widely considered to be the gold standard for public API design and documentation). You can use GraphQL internally.

I remember having to use GraphQL to delete a Docker image that was stuck in my private repo and there was no GUI to clear it. Wasted a couple of hours trying to send a GraphQL query which would have been a 2 minute jobbie using cURL. Github's public REST API didn't have this feature.

A lot of REST APIs are just as hard to grok as GraphQL is as a whole. Companies often lack schemas and documentation, which GraphQL helps with out of the box.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact