I see this an issue of typing. Programs use integers in many different kinds of ways, and if you have a function that accepts an integer as an argument, neither static nor dynamic type checking will catch the case where you pass the wrong "type" of integer. The same goes for UUIDs, strings, etc. or whatever other opaque values you use to reference entities in your system.
You can avoid these problems by wrapping the integers in objects that you use solely for referencing entities of the relevant type. For example:
class UserRef {
id: int;
}
Then you define functions like this:
function ban_account(user: UserRef) {
// ...
}
And a static type checker will pick up incorrect uses. For dynamically typed languages, you could instead use a unique field name such as user_id, to achieve the same thing (though getting a runtime error instead of a compile time error).
Obviously you'll still be using ints or strings as an external representation, but as long as you do the conversion at the point where the identifier enters the program, the type system will take care of the rest for you.
Yes, this reminds me of "stringly typed programming", i.e. where the language may offer strong types, but the program just uses `String` everywhere. String injection attacks are examples of this: SQL injection can only occur if it's possible to concatenate "SQL statement" with "user input"; if these are both represented as `String` then it's easy to run into such problems; if they're represented as different types, then the only way to combine them would be with a designated conversion function, which is exactly where we can put the neccessary escaping.
It’s what pretty much everyone in the F# or DDD community is advocating for.
Avoiding to use the first 10000 integers as she suggested is not a solution.
And using contiguous integers as user ids is the silliest thing that you can do given that it leaves you also open to enumeration attacks.
> And using contiguous integers as user ids is the silliest thing that you can do given that it leaves you also open to enumeration attacks.
I don't think it's silly, you just need to protect your parameters. There's a way to do this as part of the basic programming framework. See my other post: https://news.ycombinator.com/item?id=16947546
You’re right that it’s an issue of typing, but lots of type systems handle this just fine. Even Go’s type system—which everyone loves to hate—let’s you create new types from existing ones (such as a UserID type based on int). This feature is frequently referred to as newtype, and its even present in Python via typing.NewType! This is a bit nicer than the wrapper approach you describe for performance reasons (in practice, wrapping an int in a class will create unnecessary overhead for many languages).
Basically saying that don't use numbers because somebody might write crap code and it'll get run on your deployment database?!
If getting such dangerously awful code deployed to production is likely then sequential IDs are just one of your many problems!
Sequential IDs for key data can be good to avoid for a few good reasons but awful code isn't one of them. Testing, code reviews and not having bad programmers should be in place to fix that.
All programmers make mistakes. I think the mark of good software design is using practices that make it impossible or much more difficult to make the mistakes that we know are going to be made at some point.
Much of our code review is not about "is this code correct" - that's obviously important, but rather "will this be easy to review changes to in the future" or "if I came back here in a year, would I get it wrong". I think that's just as valuable.
There's always a trade-off with complexity, and I'm not sure whether this one pays off, but designing _for_ programming errors is important for any team/company/product that is growing, and will introduce new developers who weren't around when the decisions were made.
I would reject PR that introduce code like this. I disagree with the article. You should not introduce "safeguards" as described in the article. In few years time, someone will make bug assuming that users id starts from 1.
You have to assume that a bug is the normality, not an exception.
The more safeguards you put to make illegal state unrepresentable and illegal code uncompilable the better is it.
The author was an SRE at Google IIRC. Google has a pretty decent code review and testing culture. However, even if you have a process that eliminates 99.9% of the moronic mistakes like this, that kind of scale still makes it a certainty that stuff like this will get through occasionally. She still has to try to keep the service running. That's the perspective from which she's writing.
For me the article's negative was that it focused on one edge case reason to not use IDs, when there are others.
While everyone makes mistakes I'd hope that using an array's size instead of its value is nigh on impossible to end up committed, let alone deployed to production.
As everyone makes mistakes for me the main reason for not using numeric user ids is because you're much more likely to accidentally expose an API or query string param that can be used to look up other user ids. When that happens being able to enumerate ids makes for a massive data breach, whereas a GUID stops that.
(There may be some reason why you'd be forced to use numeric IDs in a data store for performance at scale reasons but I imagine that's relatively rare.)
How do you draw the conclusion that the author is saying "don't use numbers because somebody might write crap code" when one of the suggested improvements was "to use a large key space, like the 64 bit integers (or perhaps the subset which can be represented by Javascript, sigh)"
int64 are numbers, so the author cannot be saying "don't use numbers".
Instead, it's specifically numbers generated by something like an auto-incrementing primary key in the database.
Problem is that randomly assigned ids tend to perform sub-optimally in relational databases due to the sparse distribution screwing up row estimates and causing fragmented writes to the underlying table. For this reason many people have a policy of having (nice sequential) "internal" ids and then (pseudorandom) "external" ids.
Depends on your database. For example, Spanner it is better to use random numbers for the primary index, as monotonically increasing ints will cause hotspots on the database.
UUID are the way, GUID for those in Microsoft's lands.
Not only will you never run out but you don't need a round trip to the keymaster, horizontal scaling, clustering and vertical sharding even are all way easier due to that. This removes a single point of failure entirely when the keymaster is no longer needed.
Using ints for keys for profiles/users and many other things were needed way back when processing/db/disk/memory and performance from that were a problem, no longer.
Do your part, join the UUID revolution. Also, if you were a piece of data, would you not want to be unique across all the databases, storage and services? You've heard of Roko's Basilisk right? Do not disappoint.
I’ve often wondered where the tipping point is where you do start having to worry about collisions. Is frantic “fix all the uuid breaking code” the jobs boost for programmers 20 years from now? 100?
We've had multiple collisions with uuid v4. I'm not overly familiar with the examples, but in one case, the collision was found because the collision happened in the same account! Customer: "Why do these two events have the same id?" Us: "Wow." I think we determined that it was a limitation in the random number generator in the Perl lib we were using. To avoid that, I believe the solution was to go with uuid v1 plus a nonce of some sort.
Holy schmolies, yes. And if you use them as your access keys for, say, a webservice, people aren't really going to be able to fish around the same way they can if you use auto-increment integers.
The upside is that your business is not totally transparent to outsiders who can figure out how many users, carts, products and messages you are adding each month.
There must be some other way. I can often figure out the direct url to a software project in bitbucket, but office365 urls are pages long. I somehow doubt atlassian is as transparent as you say.
True, you absolutely should have security and authorization setup. But if the server replies to the request with a 401 or 403, rather than a 404, you at least know that there is something there.
Looking at the provided example. Nothing like that has ever happened in any code I've ever written, or I've ever seen written by anybody else, and I have absolutely no fear that I will ever see anything like that happen in any code I write for the rest of my career/life.
So, not a great example, and not a very convincing argument to me to stop using integers.
That's interesting. Every project I've worked on has had at least one bug of this form, and typically more than one, where the integer-type indexes into a table of integer-type IDs have got mixed up with the integer-type IDs.
You might be surprised how much of a pain in the arse it is to even realise these bugs exist, because during development and initial tests these tables have a nasty habit in many situations of containing IDs that are the same as the index. And when everything is an int, or similar, fixing them can be quite painful too. It just takes one bug in one function for a set of subtly broken workarounds and/or misunderstandings to spread throughout the code. Lots of places where functions take an "id" and then pass it into a function that takes an "index", or vice versa... just what was the intention here? :(
Quite! It's supposed to be an alternative viewpoint, nothing more.
Thinking about it, maybe I shouldn't have opened with "That's interesting", which here means exactly what it says, but is a phrase sometimes deployed with malicious intent.
What arithmetic operations do you realistically expect to do with user ids? You might not make that specific error, but it seems to me that any arithmetic operation is likely to a bug, and giving user ids their own type will catch all of them.
1/ stuffing random ints into your user functions : that problem can be solved with typing and tests. I'm hoping that extensive testing of any piece of code which would have a drastic effect on a user would get extensive testing before going into production.
2/ ID canary : seems a rather good idea, like stack canaries commonly used when you don't have much stack space and you might get a collision with your heap. It's only a problem for languages for which point 1/ couldn't be a solution.
3/ Using UUIDs to avoid disclosing information about your user count : I think that's a separate problem. You should avoid disclosing unneeded information in general. If you use int IDs, always have an opaque public_id field that you use publicly and for interoperability with third-parties. But it does not mean you have to use UUIDs internally. However they do have a number of advantages, mainly that you don't need a central authority to distribute new sequence numbers, you can just generate UUIDs where you need them which will save you DB round-trips and make sharding of your DB easier. Also will help avoid issues such as this one : https://blog.travis-ci.com/2018-04-03-incident-post-mortem
I think the real problem is the terse coding style. Why does the for loop always use "i"?
When I write code, I don't use "i" or other single-letter variable names. I write long names, like "currentItem", "currentMessage", "currentUser", etc. If I reference an object, I usually name it "thisItem", "thisMessage", "thisUser".
Compilers shrink executables, so it's not a size issue. I don't know why people want code to be shorter; I prefer it to be easy to read and debug.
Naming won't help, sooner or later someone will make similar mistake that will slip through code review.
Real problem is that it is possible to accidentally do this type of mistakes. If you can avoid doing it by leveraging type system, you should. Relying on humans to never make mistake is futile.
Since I largely program in C#, I typically define the unique ids on my classes as enums. Enums in C# are open, not closed, so you can define something like a UserId like so:
public enum UserId { Unset = 0; }
public class User
{
public UserId Id {get;set;}
}
This works with various ORM and other mapping tools since enums are ints underneath, but you get static checking.
Then there's the issue of protection when passing around ids in web apps as parameters, in cookies, etc. I devised Clavis [1] as an experiment for protecting URL parameters via an HMAC. The idea works pretty well in practice, but it's current incarnation is a little too cumbersome to use.
Last week we caught a bug caused by a security check on the users id (a Java Long) that was accidentally using == instead of .equals(). Since Java caches Long’s between -127 and 128, == will pass for any ids <128, including those from our tests. Our QA stage only caught it because the tester happened to be a later stage employee, so had a user ID > 128.
Another risk is that numerical ids for users (as well as for other database tables) can be used to infer the growth rate of a business. If you record the time that a user is created at, and then repeat the process over a time period, you can work out how many users are joining the app over that time period, and potentially work out the total number of users too.
Strongly typed languages help somewhat, but you still often need to store or serialize the data in something less clever (e.g. JSON).
To reduce risk of mixup of different kinds of IDs in the system I used different increment values in Postgresql sequences (e.g. 13 for users, 7 for categories), so the IDs quickly went out of sync and had little overlap.
Whenever I deal with an external API, even if it gives me what looks like an integer for any object it has, I treat it as a string. I never do math on it, I don’t care about saving space, and I never know when they’ll run out of space and need to switch. Remember when Twitter famously stopped working because their IDs were BigInt but in JavaScript you only have 53 bits of space? Yeah, I don’t want that.
Also a stupid number of APIs I deal with like to use one or more leading zeros in their identifiers. The meaning isn’t different, but it gets annoying when trying to do search, because of course the end user typed those in and wants to be able to look up whatever as 021.
By the same argument, Unix file descriptors should not be ints, either.
This argument is, in essence, an argument for strongly typed languages. There are many argument against this historical argument, and it is by no means a settled issue – on the contrary, it is very slowly, as the years go by, looking more and more like the strongly typed languages are on the way out.
> on the contrary, it is very slowly, as the years go by, looking more and more like the strongly typed languages are on the way out.
What? I'd argue the complete opposite. Above a certain level of complexity, lack of type checking becomes so onerous and bug-inducing that dynamic languages start introducing stronger typing. Typescript is paradise compared to Javascript, and even Python has added an optional type-checker.
It's standard security practice not to expose integer user IDs. Anyone who goes through standard security training knows this.
Granted, one thing languages could do is provide easy type containers so it's hard to misconstrue an I'd as referring to a wrong type. I once tried to do this with generics in C# but it wasn't worth the effort.
I think user IDs should be passed around as ints. It's efficient and simple. If you've got critical code that might lock someone out (or worse), you're much better off spending time testing it to ensure it works, rather than building in inefficient paradigms into your data model. For argument's sake, imagine if your data model also stores the sender's current manager per message. You could by mistake do this as well (a little slip of the finger when using code completion):
ban_senders_of_messages(messages) {
for (i = 0; i < messages.size(); ++i) {
ban_account(message[i].sendersManager);
}
}
The only way to catch that is to test, and because the manager and sender will have similar data types, it will compile/execute just fine. The point is, this is probably a more likely error than the one mentioned in the article, and needs manual inspection and testing to correct. If you're carrying out that process anyway, the added inefficiency of a more elaborate data type just for user IDs seems redundant.
Why not define a new data type for each type of ID? Simple and effective. And there's no inefficiency; these things aren't created and destroyed all the time! They're just retrieved from other objects and then passed around.
In languages that support value types, you'll typically make them integer size, so the cost is likely to be the same as passing an integer. In languages that support reference types only, you're just passing around a pointer anyway.
But in my example, surely the data type for a manager's ID will still be the same as that of the user, so I'm not sure it really solves the problem entirely.
If you need to mix them up, there is bunch of options. It all depends on what you need to do, and what language you are using, but usually there should be simple solution to this.
For example, if you simply want to have functions that works for both of them (so that you don't duplicate code), you can either create function from Manager to User (so that you can reuse functions for users), or use whatever polymorphism stuff your language support (polymorphic function, OOP, ...).
If you want to mix User/Manager in same collection (or have function that returns any of those), OOP can help too (Manager is "child" of User). If your language have sum types, you can use those (have additional type "User or Manager", and accompanying matching/extraction function).
In some languages you can do this with no runtime overhead (i.e. the additional type will be erased during runtime, as it's already type checked).
But your example implies you shouldn't be allowed to get them mixed up?
If manager IDs and user IDs truly are the same type of thing, then there's a limited amount the type system can do for you in this respect. Maybe you'll have to stop at this point and just accept that you'll have to exercise a certain degree of care.
But there's a big gap between stopping there, in my view, and what you appear to be advocating: deciding that since manager IDs and user IDs are the same thing then you might as well give up entirely and just decide that they may as well be the same thing as ints while you're at it.
Not quite. If I was to truly critique the example, the problem lies mainly with the ban function, which appears to allow ints to be passed in as arguments. If we have static typing available to us (the example in the article doesn't seem to), then yes, I agree we should ensure that only the relevant type should be allowed as the argument. But still, problems can lurk in the shadows, for example, you could initialise a User type with a message ID by mistake, or even an iterating int, much like the example in the article. I'm not trying to be difficult or unnecessarily contrarian, but my experience tells me that you can put a whole bunch of safeguards into your code, but nothing beats testing at catching bugs, and sometimes, the safeguards are not worth the efficiency hit. Worse still, the safeguards can at times provide a false sense of security.
It shouldn't be an excuse to dispense with testing altogether, but static typing is certainly superior to testing for certain classes of bug. The compile-time checks prove certain types of defect simply don't exist, which is the kind of guarantee no amount of testing can give you in that respect for any useful program.
As for the initialisation problem, it's true that it can't be structs all the way down, and at some point you will have to create one of these objects, probably from a primitive with a non-meaningful type such as int, or string. But my experience is that IDs and the like tend to be created in a small number of places, and then reused, copied and passed around. Far easier to find and check all the places where one is created than all the places where one is used!
I can recommend the ShortUUID[1] Python package I wrote, I use it for all my IDs nowadays. The good thing about it is that it makes nice and short human-readable/typable IDs that you can use for all your objects, so you don't care even if you expose them to the user.
This has nothing to do with user IDs. It's a general problem with antiquated languages that quietly convert/promote compatible integer types and can happen to any other integer types.
Solution: use a modern language like Go or use pointers to structs like everyone else.
Doesn't pretty much every language nowadays have foreach loops that don't require keeping an integer index around when iterating over elements?
That seems like a way better idea.
If you need a way to uniquely identify something in the universe, there is no reason to get clever. Just use Universally Unique IDentifiers. UUIDs. Done.
Why would you use anything but a string for User IDs?
My understanding of numerical types is that they exist to perform math. User IDs are not used for math, they're a completely arbitrary vanity system to assist with identification, so they should be strings, equally arbitrary.
Personally, I think E-mail addresses are the best user identifiers these days. Back in the day when there were like 5 websites everyone used, having your username was a cool thing. These days there's a billion websites and nobody uses the same ones and there's zero inter-user interaction on most sites. From the perspective of user friendliness, E-mail addresses are the easiest because you kill two birds with one stone (contact method + username + password recovery).
If you want a numerical ID, what about using a hash of the E-mail address? Or perhaps a combination of things, email, full name, sign-up date.
> Oh god no. You don't want all your IDs changing when a user changes their email address.
That's a pretty passionate response, can you explain your logic? What are you doing with your usernames that you can't afford to let users change them?
The article doesn't use "User ID" in the sense of "username" (externally visible identifier, that likely is used for log in), but as in "mostly internal id thats used to reference a user across database tables, services, ...". If you use something that can change in there, you need to do the change across all those things consistently, which is a lot of potential for error.
I got that but I can't imagine why you would even use the User ID for anything if we're talking about the row ID from the database. If you're doing tests against a user's profile why not use their username? There must be some case-examples that I'm not thinking of...
I know that some services have a public-facing "username" and a behind the scenes unique identifier (which is a great UX model), I'm just focusing on the unique identifier. Which I would think should always be it's own column, whether it's also used for the public "username" or not.
> Because if you change that, all your relations between tables will break.
Okay, that is not a response to my question, which is why would you ever use the row ID for anything in your program. If you never use it, then it cannot ever be changed. Also SQL allows relationships based on more than one field, so it seems such a disaster could be easily avoided.
I've done an UPDATE without a WHERE on a large database before. Fortunately the table was so large that the UPDATE took a significant amount of time, and I was able to type ^C before it completed, which meant the result of the update never got committed and no damage was done!
Please try to refrain from using language such as "retarded". It's really quite offensive, and there are less offensive words that would likely make your argument stronger.
You can avoid these problems by wrapping the integers in objects that you use solely for referencing entities of the relevant type. For example:
Then you define functions like this: And a static type checker will pick up incorrect uses. For dynamically typed languages, you could instead use a unique field name such as user_id, to achieve the same thing (though getting a runtime error instead of a compile time error).Obviously you'll still be using ints or strings as an external representation, but as long as you do the conversion at the point where the identifier enters the program, the type system will take care of the rest for you.