1. Using sequentially incremented integer sequences as object IDs, and
2. Failing to protect sensitive data using some kind of authentication and authorization check.
This is becoming a trend with data breaches. Several of Krebs' other reports on behalf of security researchers were originally identified by (trivially) walking across object IDs on public URLs.
My cynical take is that Krebs couldn't go public before this afternoon because First American wanted it to hit the news at an opportune time, then get ahead of it with their own messaging. Krebs got in touch with First American on Monday May 19th. The story is only just breaking now on a Friday afternoon at 5 pm; markets are conveniently closed for the weekend.
I expect them to issue a hollow PR statement about valuing security despite being unable to act on security reports until an investigative journalist threatens to go public.
It was an absolute nightmare. Maintenance was a nightmare, you're constantly having to generate or replicate these things that add an extra layer of complexity to everything, and almost always unnecessarily.
It's also extremely bad for db performance, causes massive page fragmentation, indexes become useless almost straight after rebuilding them, etc.
For almost everything, sequential int IDs are fine. It's the things you expose to the users that you need to be careful with, and then don't use the primary key to access them, add another unique key to them, but keep the id in there for the db to use and for your own use.
My lesson was to go back to always using int ids, and on a few objects have a separate unique key column to expose to users for sensitive stuff.
Most applications don't use UUIDs and many of them are fine and I definitely wouldn't ding an app for using monotonic IDs, but I'm increasingly thinking that it's worth praising UUIDs more.
To illustrate this for you, let me turn it around a bit. Is it security by obscurity if the only thing stopping someone from logging into your account is knowing your password?
Security by obscurity is when you (for example) roll your own cryptosystem and rely (in whole or part) on the secrecy of your new-fangled algorithm to save you. That is unsafe. But if you're saying high-entropy strings shouldn't be the only barrier to authentication, you're throwing out half a century of complexity theoretic cryptography.
I can see your point. If UUIDs are handled in such a way that they are discoverable by anyone, they are not enough to make the references secure.
I think the point tptacek and others are making is that this is an instance of the defence in depths principle, though. In scenarios where UUIDs are not simply discoverable, using UUIDs is inherently more secure than using a monotonic ID, simply because the monotonic ID can be easily guessed. Yet, they are still not enough in isolation and you should be additionally using proper access control (due to eventual leakage of particular UUIDs in emails and such).
If I can see in this HTML page that your reply is /reply?id=12345, then it doesn't matter if Hacker News uses integers or UUIDs, if there's a bug in /edit?id=12345 that just lets me edit it without the appropriate security. If we say that UUIDs always make everything inherently more secure, we're doing everyone a disservice.
Now, the original discussion was about (1) discovering for read, and not about (2) escalating a read to a write. But if anyone reading this mistakenly takes from it that UUIDs are the way to solve these problems then they will go on optimizing for (1) at the expense of (2).
That's been bouncing around at least since the time I noticed it on /. Which was a couple of decades ago.
For an elegant solution to this problem, check out Twitter's Snowflake.
> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example.
HN discussion: https://news.ycombinator.com/item?id=10631806
In this kind of system you can also generate deterministic UUIDs, which are useful for idempotency (e.g. The same event can be recognised as a duplicate)
That being said I'm a little surprised to hear about the complexity. Are you able to share which DB/stack you were using? This functionality should be natively supported at two distinct abstractions: your programming language and your database.
But it's not just the support that's such the problem. You're testing, you need to switch category, you can't just change a 1 to a 2. You have to go find what random uuid the categories had added to it. You can't just go into the DB and add a new line, you have to open a UUID generator. You can't just quickly add a foreign key relationship, you have to look up the UUID. And a ton of other little annoyances.
Actually, categories are an excellent example of something that shouldn't be a UUID, they're actually supposed to be discover-able.
I think my present project has UUIDs on the user, company, invoice and payments tables, but still ints as the primary key. Everything else isn't worth it. There's a merchant table, but again, they're all supposed to be discover-able (and aren't editable by the merchant themselves).
I also generally implement controller level security that checks access to the root object being returned by default, so I can't really make a mistake exposing an unauthorised object. There's an occasional controller where I've made a conscious decision not to implement that level, generally actions that allow both authenticated and unauthenticated users (e.g. viewing merchants or categories).
The keys will be easier to guess again, but if all you have to do is guess a primary key to get access to the underlying data, something else isn't right anyways.
It's not about using hard-to-guess UUIDs, but restricting access to the underlying data.
Then of course there is the issue that email is for the most part un-encrypted (or encrypted without validating certificates).
And a side note: I wouldn't trust that the prng for your UUIDs are cryptographically secure. That's not a part of the spec.
I thought they were a bit of a hack to raise the bar a touch. In which case the crypto security properties of that function isn’t interesting. Instead the ergonomics are.
That provides IDs that are both opaque and, if you want, user-friendly.
(disclaimer: I wrote it.)
And yes, Google also posted a sitemaps file (or rather, 50,000 sitemap files) with all profile IDs. But that was last marked updated in March 2017, for some reason. Being able to validate that would have been nice.
But as a mitigation against blind bulk scrapes, a useful tool. I'd consider that one of G+'s good design elements.
The typical web app has the concept of a validated user session per request. How hard is it really to
Select ... From Documents where documentid = ? and userid = ?
Every web framework that I am aware of let’s you add one piece of middleware that validates a user session and won’t even route to the request if the user isn’t validated.
The other commenter's point about leaking information is also correct. In the finance industry one of the basic tricks to obtaining alternative data is to scrape it from private APIs which expose sequential IDs corresponding to a source of revenue. For example, a publicly traded car company might have its revenue extrapolated from an open API which sequentially increments an ID every time a vehicle is sold. Research groups will reverse engineer mobile apps from companies with only one or two dimensions of revenue, find the private API endpoints (reversing request signing as needed), and then look for object IDs which can be thrown into a timeseries on a quarterly basis.
Generally speaking the risk and compliance department of a hedge fund disallows this kind of data if it's gathered from an actual security vulnerability (e.g. leaks PII). It needs to be "only" a neutral information side channel without sensitive data, so that doesn't really apply in this specific scenario. But it does apply for people considering using integer IDs for user-facing APIs.
I couldn't say it better myself when I'm speaking to management that makes these kinds of decisions. Now I can quote throwawaymath verbatim to drive the detailed point home.
The idea pre-dates web APIs many decades :-)
- It exposes the count you have of a particular item
- It exposes your growth rate of those items
- If a developer accidentally breaks your authentication (or somebody hacks it), it becomes trivially easy to download all your items very quickly
And it isn't like using a randomized token is hard. In the most common implementation, it is just one additional column that gets filled with a random string and an index on the column.
If they could somehow change your code, all hope is already lost.
But I do agree with it does allow someone to determine rate of growth which would be valuable more from a business intelligence side than a privacy violation.
The larger issue is that a developer forgets to add the “and userid = ?”
I guess the work around for that is to have a database that ties user authentication to records in the table/object store directly like DynamoDB or S3.
So the developer may think it is safe to say select value from stock positions left join account on account.id = stock position.id left join user_accounts on user_accounts.accountid == account.id left join users on user_accounts.userid == user.id where user.id == session.userid.
Safe right? We checked userid. But then clicking on the position to drill in on the position data, they just select * from stock_position where stock_position.id = params.stock_id... there's no "and stock_position.userid" on that table, and the developer might be too lazy to spin up the entire join again especially if you don't need account data for this view. Whoops, suddenly a vulnerable page query.
I imagine there are other ways to screw up. Like insecure cookies, and just checking cookie.userid, ah yes, you're the right user. Whoops, didn't realize cookies could be spoofed.
But you don’t do cookie.userid.
You send the username and password to an authentication service which generates a token with a checksum. The token along with the username and permission is cached in something like Redis.
On each request, middleware gets the user information back using the token.
> Yet another security vulnerability caused by...
I mean, yes, but these are also some of the easiest vulnerabilities to miss even with out-of-the-box static analysis (code scanning and data analysis), automated dynamic analysis (pentests [edit to clarify for tptacek: automated pentests]), and a basic code review process. They're usually identified in live environments during manual penetration tests or, in more security-mature environments, with custom static analysis checks and custom linting rules.
As for best-case prevention: accomplished generally architecturally, e.g. language/framework decisions that enforce secure coding practices by design, or implementing certain patterns in development which whisks away some of the more risky coding decisions from engineers who may not be qualified to be making them, such as mandating authn/z and limiting exceptions only to roles and change processes qualified to make them. Checks including linting for specific privacy defects (direct object referencing using sensitive data or iterative identifiers as opposed to hashes/guids/etc) can help with catching them during development, and as you might've guessed, such checks tend to be custom for a given environment rather than out of the box.
I distinctly recall a card issuer whose name starts with a C in the United States having an http endpoint which allowed for enumerating account details by iterating full PANs (16 digit card numbers)... around a decade ago. Here we are today, and you're seeing the same bugs continue to arise.
Mitigation options in organizations with immature security practices typically rule out remediation simply because their existence might not be known, and practices traditionally reserved for defense-in-depth may need to be relied-upon instead (think monitoring web requests for anomalous behaviors and blocking traffic when detected) rather than trusting that one can fix all the defects, and even then you'll still lose a few records... but that might be the only solution available to you as a CTO, CIO, or CISO simply because of resource constraints and bureaucracy in an entrenched org e.g. in the financial or insurance space.
tl;dr: these defects are among the harder ones to catch for legacy applications especially in environments with weaker security postures, and they're as old as time. What I'm saying is that as much as we can call companies out for making these mistakes in hindsight, their existence in larger legacy systems is to some extent inevitable and must be managed in other ways.
This is absolutely not the kind of vulnerability that pentests tend to miss; rather, they're the first thing pentesters check for. You can miss bugs like this when they're in obscure backend features and your client or team didn't document the project adequately --- though you still shouldn't, and that's part of the point of getting an assessment, to find stuff like that --- but you generally don't miss them in an assessment where the bug is literally "edit a number in a URL".
Web scanning tools will miss findings like this. But, regarding web scanners: see static source code security analyzers.
As for code review: a competently constructed application shouldn't be relying on developers to catch every possible instance where numeric ids are used individually. In modern web frameworks, it should be obvious when you're looking an ID up without doing an authorization check; for instance, in a Rails or Django app, you can simply regex for lookups coming off the ORM class rather than the appropriate association instance.
In sum: I dispute much of this analysis.
People do miss things, even when they're things they shouldn't miss. Put 3 different test teams on the same application and you will get 3 overlapping but distinctive sets of vulnerabilities back. But this is not an instance of the kind of vulnerability that is hard to catch.
You're right; they don't. Which is why I called out automated dynamic analysis. I.e. the web scanning tools which you subsequently mentioned:
> Web scanning tools will miss findings like this.
> As for code review: a competently constructed application shouldn't be relying on developers to catch every possible instance where numeric ids are used individually. In modern web frameworks, it should be obvious when you're looking an ID up without doing an authorization check; for instance, in a Rails or Django app, you can simply regex for lookups coming off the ORM class rather than the appropriate association instance.
Right, which I also stated:
> As for best-case prevention: accomplished generally architecturally, e.g. language/framework decisions that enforce secure coding practices by design, or implementing certain patterns in development which whisks away some of the more risky coding decisions from engineers who may not be qualified to be making them, such as mandating authn/z and limiting exceptions only to roles and change processes qualified to make them. Checks including linting for specific privacy defects (direct object referencing using sensitive data or iterative identifiers as opposed to hashes/guids/etc) can help with catching them during development, and as you might've guessed, such checks tend to be custom for a given environment rather than out of the box.
A sibling comment makes the obvious point that no pre-auth endpoint should be touching this kind of data to begin with, which is another layer of "stuff you can just regex for".
That's fine, but I'd appreciate it if you just read the entire analysis next time. It shows that you respect the time people invest into constructing and presenting guidance, even if you don't necessarily respect the guidance itself.
Editing mine to match your edit... as if to make my point about reading the analysis in its entirety:
> A sibling comment makes the obvious point that no pre-auth endpoint should be touching this kind of data to begin with, which is another layer of "stuff you can just regex for".
Correct, something which I'd also stated:
> Checks including linting for specific privacy defects (direct object referencing using sensitive data or iterative identifiers as opposed to hashes/guids/etc) can help with catching them during development, and as you might've guessed, such checks tend to be custom for a given environment rather than out of the box.
> I do want to take every opportunity I can get to disabuse people about the effectiveness of scanners.
This entire exchange is frustrating because it's exactly what I said in my root comment:
> these are also some of the easiest vulnerabilities to miss even with out-of-the-box static analysis (code scanning and data analysis), automated dynamic analysis (pentests [edit to clarify for tptacek: automated pentests]), and a basic code review process.
I'm going to step away from my keyboard a bit; please forgive me.
I really appreciate this as this at least concludes that a miscommunication took place, thank you. I'll accept that there's likely a bit too much flourish to what I write for the sake of targeting nuanced clarity.
> If you think scanners suck too, we might just not have anything worth arguing about.
Largely yes, but I do think they have their place. I view them more as platforms to build upon or add to (e.g. custom data rules or enforcing the use of specific best practices) than generalized security salves, but as you'd pointed out, many of those objectives can also be achieved through much simpler means, e.g. just grep the code for things as a commit test.