1. Using sequentially incremented integer sequences as object IDs, and
2. Failing to protect sensitive data using some kind of authentication and authorization check.
This is becoming a trend with data breaches. Several of Krebs' other reports on behalf of security researchers were originally identified by (trivially) walking across object IDs on public URLs.
My cynical take is that Krebs couldn't go public before this afternoon because First American wanted it to hit the news at an opportune time, then get ahead of it with their own messaging. Krebs got in touch with First American on Monday May 19th. The story is only just breaking now on a Friday afternoon at 5 pm; markets are conveniently closed for the weekend.
I expect them to issue a hollow PR statement about valuing security despite being unable to act on security reports until an investigative journalist threatens to go public.
I once made an app not using sequential integers as object ids, as you suggest.
It was an absolute nightmare. Maintenance was a nightmare, you're constantly having to generate or replicate these things that add an extra layer of complexity to everything, and almost always unnecessarily.
It's also extremely bad for db performance, causes massive page fragmentation, indexes become useless almost straight after rebuilding them, etc.
For almost everything, sequential int IDs are fine. It's the things you expose to the users that you need to be careful with, and then don't use the primary key to access them, add another unique key to them, but keep the id in there for the db to use and for your own use.
My lesson was to go back to always using int ids, and on a few objects have a separate unique key column to expose to users for sensitive stuff.
I also don't think using UUIDs as a security (by obscurity) strategy is valid. But there are other reasons someone may choose to use UUIDs. For instance, it's convenient to generate identifiers in a decentralized manner. I want to counter your one bad experience with my (equally anecdotal) many-multiple good experiences. Databases do just fine with UUIDs. Though we may be working on different kinds of systems, and optimizing for different things. I don't frown upon using integers (well, longs) for identifiers, but I personally prefer UUIDs.
A securely generated 128 bit UUID isn't security-by-obscurity, but rather security-by-cryptography. It's still bad not to have authorization checks, because UUIDs can "leak" into logs, browser histories, emails, and things like that. But the security benefit of using crypto-random IDs is neither cosmetic nor superficial.
Most applications don't use UUIDs and many of them are fine and I definitely wouldn't ding an app for using monotonic IDs, but I'm increasingly thinking that it's worth praising UUIDs more.
If you know the (integer) identifier, and because the bad application isn't secured with authentication, you get access to something you're not supposed to. If you make the identifier a lot harder to know, and you still have no security, that smells like the obscurity part. I can absolutely see your point that the UUID identifiers are not just a lot harder to guess, they may be impossible to guess. But the security is still bad, and I don't think that the impossible to guess-property of the UUIDs should be a substitute for security. I don't think we really disagree, though.
That's not what people usually mean by "security by obscurity" when they critique the concept. Unfortunately the term is overloaded so it's lost its way over time.
To illustrate this for you, let me turn it around a bit. Is it security by obscurity if the only thing stopping someone from logging into your account is knowing your password?
Security by obscurity is when you (for example) roll your own cryptosystem and rely (in whole or part) on the secrecy of your new-fangled algorithm to save you. That is unsafe. But if you're saying high-entropy strings shouldn't be the only barrier to authentication, you're throwing out half a century of complexity theoretic cryptography.
Yeah, I think the understandable confusion comes from the idea that a UUID "obscures" the sequential identity of the id in the same way a password mask obscures a password, but the obscurity in security through obscurity refers to reliance on an attacker's ignorance of implementation details to secure the system rather than on a mechanism that is provably secure.
I think context matters here. If someone wants to hand out tokens, for instance via e-mail verification, I'm fine with relying on that being a UUID. When you make it harder (impossibly hard) to guess a "record number" by using UUIDs, which is what we were probably talking about, that's great too. (I already yielded that point.) But let's not lead the general population into thinking that UUIDs make everything safer (probably not what you were saying), because if something is "just an identifier" it may not be handled as safely, which is what this seems to be relying on in the context of security. Same as how user names were traditionally not handled as something secret or confidential. Sometimes UUIDs appear as just identifiers and are not handled with any secrecy, so they just can't always double as a security feature.
> Sometimes UUIDs appear as just identifiers and are not handled with any secrecy, so they just can't always double as a security feature.
I can see your point. If UUIDs are handled in such a way that they are discoverable by anyone, they are not enough to make the references secure.
I think the point tptacek and others are making is that this is an instance of the defence in depths principle, though. In scenarios where UUIDs are not simply discoverable, using UUIDs is inherently more secure than using a monotonic ID, simply because the monotonic ID can be easily guessed. Yet, they are still not enough in isolation and you should be additionally using proper access control (due to eventual leakage of particular UUIDs in emails and such).
I never said they were less secure. I said there are situations where they're not really more secure.
If I can see in this HTML page that your reply is /reply?id=12345, then it doesn't matter if Hacker News uses integers or UUIDs, if there's a bug in /edit?id=12345 that just lets me edit it without the appropriate security. If we say that UUIDs always make everything inherently more secure, we're doing everyone a disservice.
Now, the original discussion was about (1) discovering for read, and not about (2) escalating a read to a write. But if anyone reading this mistakenly takes from it that UUIDs are the way to solve these problems then they will go on optimizing for (1) at the expense of (2).
Note most databases use type 1 UUIDs by default, not randomly generated type 4 UUIDs. There are tons of security holes out there because people are using type 1 UUIDs thinking they can be used as secure tokens.
I always wondered why databases have not implemented a scheme like Microsoft's Active Directory RID master FSMO role. One server is responsible for handing out chunks of ID's to each server. They request a new block whenever a threshold is reached (50% by default IIRC).
> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example.
Sure, security is about taking a layered approach - I don't think anyone would seriously advocate using knowledge of a UUID as enough authorisation on it's own. Well, I hope not :)
I find UUIDs very useful for this reason - the IDs can be generated by different parts ofba distributed system, and be "guaranteed" to be unique.
In this kind of system you can also generate deterministic UUIDs, which are useful for idempotency (e.g. The same event can be recognised as a duplicate)
Sure, that's fine. The context of my point about IDs is for user-facing APIs. Note that user-facing really means "publicly accessible", even in the case of private APIs. As I mentioned elsewhere, market research groups will be happy to extrapolate as many metrics as they can from your APIs integer object IDs.
That being said I'm a little surprised to hear about the complexity. Are you able to share which DB/stack you were using? This functionality should be natively supported at two distinct abstractions: your programming language and your database.
In that case C#/EF/SQL Server is what that app was made in. his was like 6 years ago, admittedly, but it didn't geel as if it's really treated as a first class citizen. Everything's in ints in example code, you have to fight the auto-code generators a bit, etc. So in my experience it's never anywhere near as seamless as the int support.
But it's not just the support that's such the problem. You're testing, you need to switch category, you can't just change a 1 to a 2. You have to go find what random uuid the categories had added to it. You can't just go into the DB and add a new line, you have to open a UUID generator. You can't just quickly add a foreign key relationship, you have to look up the UUID. And a ton of other little annoyances.
Actually, categories are an excellent example of something that shouldn't be a UUID, they're actually supposed to be discover-able.
I think my present project has UUIDs on the user, company, invoice and payments tables, but still ints as the primary key. Everything else isn't worth it. There's a merchant table, but again, they're all supposed to be discover-able (and aren't editable by the merchant themselves).
I also generally implement controller level security that checks access to the root object being returned by default, so I can't really make a mistake exposing an unauthorised object. There's an occasional controller where I've made a conscious decision not to implement that level, generally actions that allow both authenticated and unauthenticated users (e.g. viewing merchants or categories).
You can generate uuids that play nicer with database storage / indexing. NEWSEQUENTIALID() in MSSQL, for example.
The keys will be easier to guess again, but if all you have to do is guess a primary key to get access to the underlying data, something else isn't right anyways.
It's not really security through obscurity. In these case I understand the ids where related to data that the company was making available to users through email links. A cryptographically secure 128bit UUID is impossible to guess, no more than a cryptographic access token. Now of course, you would probably rather want to have an authentication scheme on top of that, but that comes at a support cost in term of customers loosing their passwords, locking themselves out of their account, etc. And it is not clear you have increased security as people re-use passwords.
Then of course there is the issue that email is for the most part un-encrypted (or encrypted without validating certificates).
It's still an access control issue in that case. The user should never be aware of the UUID's. Only the backend should deal with it. If you have a _public_ API that deals with UUIDs, therein lies the issue.
And a side note: I wouldn't trust that the prng for your UUIDs are cryptographically secure. That's not a part of the spec.
Is the point of non mono tonic is schemes to make them -secure- secure?
I thought they were a bit of a hack to raise the bar a touch. In which case the crypto security properties of that function isn’t interesting. Instead the ergonomics are.
No, the cryptographic security of the identifier matters a lot. A GUID generated from an insecure PRNG can be used to predict other GUIDs. A UUID generated from 16 bytes of /dev/urandom can't be used to get anything but the object to which it refers.
Yeah there's nothing wrong with using sequential integer IDs in the database. But objects should be assigned random unique IDs as well, which is how they are referenced by and presented to the outside world. The random ID is what is presented to the frontend/user. I'm not sure what the issue you had with generating random integers for primary keys was, it seems like that should work fine. Is it because the index has to be rebuilt when an value is inserted into the middle of the ordered sequence?
Something I realised looking at Google+ identifiers -- 21 digit numerics, 19 of those significant -- was that it made brute-searching the user profile space infeasible. There were only 4 billion and change legitimate profiles, there was a 4 in 100 billion chance of hitting one by chance on any given random request of the space. And IDs appeared to berandomly distributed.
And yes, Google also posted a sitemaps file (or rather, 50,000 sitemap files) with all profile IDs. But that was last marked updated in March 2017, for some reason. Being able to validate that would have been nice.
But as a mitigation against blind bulk scrapes, a useful tool. I'd consider that one of G+'s good design elements.
There is nothing wrong with using sequential ids in and of themselves.
The typical web app has the concept of a validated user session per request. How hard is it really to
Select ... From Documents where documentid = ? and userid = ?
So even if the user does a
GET /Document/{id+1}
No documents would be returned.
Every web framework that I am aware of let’s you add one piece of middleware that validates a user session and won’t even route to the request if the user isn’t validated.
No, nothing wrong with it intrinsically. But if UUIDs were used instead, the lack of authentication or authorization checks wouldn't be as catastrophic. That would be somewhat comparable to having a reset password token which doesn't expire. Still bad, but not as bad.
The other commenter's point about leaking information is also correct. In the finance industry one of the basic tricks to obtaining alternative data is to scrape it from private APIs which expose sequential IDs corresponding to a source of revenue. For example, a publicly traded car company might have its revenue extrapolated from an open API which sequentially increments an ID every time a vehicle is sold. Research groups will reverse engineer mobile apps from companies with only one or two dimensions of revenue, find the private API endpoints (reversing request signing as needed), and then look for object IDs which can be thrown into a timeseries on a quarterly basis.
Generally speaking the risk and compliance department of a hedge fund disallows this kind of data if it's gathered from an actual security vulnerability (e.g. leaks PII). It needs to be "only" a neutral information side channel without sensitive data, so that doesn't really apply in this specific scenario. But it does apply for people considering using integer IDs for user-facing APIs.
Having done a few assessments in the last year where I was forced to downgrade sev:hi findings because nobody is realistically going to guess a 128 bit random number, I have to grudgingly acknowledge that UUID object keys are a meaningful security improvement. Which I hate to admit, because I'm generally of the opinion that "defense in depth" is a design cop-out, and here's a pretty potent counterexample.
I agree with you. Let me emphasize this explicitly: the real failure here is the utter lack of authn and authz. But it is meaningful that the integer IDs are being used.
One reason I <3 HN is that complex scenarios like this get described so clearly, succinctly like this.
I couldn't say it better myself when I'm speaking to management that makes these kinds of decisions. Now I can quote throwawaymath verbatim to drive the detailed point home.
Nice. This reminds me of the German Tank problem in WWII, where the allies used samples of serial numbers from captured nazi tanks, to estimate their population. The tanks and their parts used sequential serial numbers. It could also be used to determine production rates too I guess.
Maybe not "wrong", but there are some very obvious downsides to exposing sequential IDs vs a randomized token:
- It exposes the count you have of a particular item
- It exposes your growth rate of those items
- If a developer accidentally breaks your authentication (or somebody hacks it), it becomes trivially easy to download all your items very quickly
And it isn't like using a randomized token is hard. In the most common implementation, it is just one additional column that gets filled with a random string and an index on the column.
In that simple scenario. What are some ways that a hacker could break your front end API to allow it to serve requests for multiple users without having access to multiple account logins? I understand that they could possibly get access to your database but that’s a different threat.
If they could somehow change your code, all hope is already lost.
But I do agree with it does allow someone to determine rate of growth which would be valuable more from a business intelligence side than a privacy violation.
The larger issue is that a developer forgets to add the “and userid = ?”
I guess the work around for that is to have a database that ties user authentication to records in the table/object store directly like DynamoDB or S3.
In my experience, many tables don't have a userid on the table that would be associated with the user. It would be a table join or two or three away.
So the developer may think it is safe to say select value from stock positions left join account on account.id = stock position.id left join user_accounts on user_accounts.accountid == account.id left join users on user_accounts.userid == user.id where user.id == session.userid.
Safe right? We checked userid. But then clicking on the position to drill in on the position data, they just select * from stock_position where stock_position.id = params.stock_id... there's no "and stock_position.userid" on that table, and the developer might be too lazy to spin up the entire join again especially if you don't need account data for this view. Whoops, suddenly a vulnerable page query.
I imagine there are other ways to screw up. Like insecure cookies, and just checking cookie.userid, ah yes, you're the right user. Whoops, didn't realize cookies could be spoofed.
If the cookie is spoofed and someone got another clients authorization token, then they would get any documents that user was authorized to see anyway.
But you don’t do cookie.userid.
You send the username and password to an authentication service which generates a token with a checksum. The token along with the username and permission is cached in something like Redis.
On each request, middleware gets the user information back using the token.
I'm familiar with that process. I was trying to illustrate a picture of how a poor developer might stumble their way into this situation. It's technically possible to store the userid in the cookie rather than using JWTs, but obviously it's not secure in the slightest.
(It's apparent that my initial reply didn't resonate, so I've made substantial edits to my reply for clarity's sake. If you've read it once, give it another read; it's from the angle of an organization with much in the way of legacy impairment.)
> Yet another security vulnerability caused by...
I mean, yes, but these are also some of the easiest vulnerabilities to miss even with out-of-the-box static analysis (code scanning and data analysis), automated dynamic analysis (pentests [edit to clarify for tptacek: automated pentests]), and a basic code review process. They're usually identified in live environments during manual penetration tests or, in more security-mature environments, with custom static analysis checks and custom linting rules.
As for best-case prevention: accomplished generally architecturally, e.g. language/framework decisions that enforce secure coding practices by design, or implementing certain patterns in development which whisks away some of the more risky coding decisions from engineers who may not be qualified to be making them, such as mandating authn/z and limiting exceptions only to roles and change processes qualified to make them. Checks including linting for specific privacy defects (direct object referencing using sensitive data or iterative identifiers as opposed to hashes/guids/etc) can help with catching them during development, and as you might've guessed, such checks tend to be custom for a given environment rather than out of the box.
I distinctly recall a card issuer whose name starts with a C in the United States having an http endpoint which allowed for enumerating account details by iterating full PANs (16 digit card numbers)... around a decade ago. Here we are today, and you're seeing the same bugs continue to arise.
Mitigation options in organizations with immature security practices typically rule out remediation simply because their existence might not be known, and practices traditionally reserved for defense-in-depth may need to be relied-upon instead (think monitoring web requests for anomalous behaviors and blocking traffic when detected) rather than trusting that one can fix all the defects, and even then you'll still lose a few records... but that might be the only solution available to you as a CTO, CIO, or CISO simply because of resource constraints and bureaucracy in an entrenched org e.g. in the financial or insurance space.
--
tl;dr: these defects are among the harder ones to catch for legacy applications especially in environments with weaker security postures, and they're as old as time. What I'm saying is that as much as we can call companies out for making these mistakes in hindsight, their existence in larger legacy systems is to some extent inevitable and must be managed in other ways.
There are no effective static source code security analyzers. Static analyzers aren't a bad thing to add to a CI pipeline, because why not, but anyone depending on static analysis is playing to lose.
This is absolutely not the kind of vulnerability that pentests tend to miss; rather, they're the first thing pentesters check for. You can miss bugs like this when they're in obscure backend features and your client or team didn't document the project adequately --- though you still shouldn't, and that's part of the point of getting an assessment, to find stuff like that --- but you generally don't miss them in an assessment where the bug is literally "edit a number in a URL".
Web scanning tools will miss findings like this. But, regarding web scanners: see static source code security analyzers.
As for code review: a competently constructed application shouldn't be relying on developers to catch every possible instance where numeric ids are used individually. In modern web frameworks, it should be obvious when you're looking an ID up without doing an authorization check; for instance, in a Rails or Django app, you can simply regex for lookups coming off the ORM class rather than the appropriate association instance.
In sum: I dispute much of this analysis.
People do miss things, even when they're things they shouldn't miss. Put 3 different test teams on the same application and you will get 3 overlapping but distinctive sets of vulnerabilities back. But this is not an instance of the kind of vulnerability that is hard to catch.
> This is absolutely not the kind of vulnerability that pentests tend to miss
You're right; they don't. Which is why I called out automated dynamic analysis. I.e. the web scanning tools which you subsequently mentioned:
> Web scanning tools will miss findings like this.
---
> As for code review: a competently constructed application shouldn't be relying on developers to catch every possible instance where numeric ids are used individually. In modern web frameworks, it should be obvious when you're looking an ID up without doing an authorization check; for instance, in a Rails or Django app, you can simply regex for lookups coming off the ORM class rather than the appropriate association instance.
Right, which I also stated:
> As for best-case prevention: accomplished generally architecturally, e.g. language/framework decisions that enforce secure coding practices by design, or implementing certain patterns in development which whisks away some of the more risky coding decisions from engineers who may not be qualified to be making them, such as mandating authn/z and limiting exceptions only to roles and change processes qualified to make them. Checks including linting for specific privacy defects (direct object referencing using sensitive data or iterative identifiers as opposed to hashes/guids/etc) can help with catching them during development, and as you might've guessed, such checks tend to be custom for a given environment rather than out of the box.
I'll amend my previous comment to say that I only dispute much of the analysis, not "the whole" analysis.
A sibling comment makes the obvious point that no pre-auth endpoint should be touching this kind of data to begin with, which is another layer of "stuff you can just regex for".
> I'll amend my previous comment to say that I only dispute much of the analysis, not "the whole" analysis.
That's fine, but I'd appreciate it if you just read the entire analysis next time. It shows that you respect the time people invest into constructing and presenting guidance, even if you don't necessarily respect the guidance itself.
---
Editing mine to match your edit... as if to make my point about reading the analysis in its entirety:
> A sibling comment makes the obvious point that no pre-auth endpoint should be touching this kind of data to begin with, which is another layer of "stuff you can just regex for".
Correct, something which I'd also stated:
> Checks including linting for specific privacy defects (direct object referencing using sensitive data or iterative identifiers as opposed to hashes/guids/etc) can help with catching them during development, and as you might've guessed, such checks tend to be custom for a given environment rather than out of the box.
Yeah, no, I think you got this wrong, but more than that I was motivated to comment by the implication you made that these were "easy to miss" vulnerabilities because bullshit security tools that don't work miss them. I don't so much care whether you're right or wrong, but I do want to take every opportunity I can get to disabuse people about the effectiveness of scanners.
> "easy to miss" vulnerabilities because bullshit security tools that don't work miss them
> I do want to take every opportunity I can get to disabuse people about the effectiveness of scanners.
This entire exchange is frustrating because it's exactly what I said in my root comment:
> these are also some of the easiest vulnerabilities to miss even with out-of-the-box static analysis (code scanning and data analysis), automated dynamic analysis (pentests [edit to clarify for tptacek: automated pentests]), and a basic code review process.
[...]
> Checks including linting for specific privacy defects (direct object referencing using sensitive data or iterative identifiers as opposed to hashes/guids/etc) can help with catching them during development, and as you might've guessed, such checks tend to be custom for a given environment rather than out of the box.
---
I'm going to step away from my keyboard a bit; please forgive me.
You "stepped away from the keyboard", and then edited your comment. I read what you wrote differently than you appear to have intended. It is fine if we simply disagree about this. If you think scanners suck too, we might just not have anything worth arguing about.
> I read what you wrote differently than you appear to have intended.
I really appreciate this as this at least concludes that a miscommunication took place, thank you. I'll accept that there's likely a bit too much flourish to what I write for the sake of targeting nuanced clarity.
> If you think scanners suck too, we might just not have anything worth arguing about.
Largely yes, but I do think they have their place. I view them more as platforms to build upon or add to (e.g. custom data rules or enforcing the use of specific best practices) than generalized security salves, but as you'd pointed out, many of those objectives can also be achieved through much simpler means, e.g. just grep the code for things as a commit test.
1. Using sequentially incremented integer sequences as object IDs, and
2. Failing to protect sensitive data using some kind of authentication and authorization check.
This is becoming a trend with data breaches. Several of Krebs' other reports on behalf of security researchers were originally identified by (trivially) walking across object IDs on public URLs.
My cynical take is that Krebs couldn't go public before this afternoon because First American wanted it to hit the news at an opportune time, then get ahead of it with their own messaging. Krebs got in touch with First American on Monday May 19th. The story is only just breaking now on a Friday afternoon at 5 pm; markets are conveniently closed for the weekend.
I expect them to issue a hollow PR statement about valuing security despite being unable to act on security reports until an investigative journalist threatens to go public.