
User IDs probably shouldn't be passed around as ints - weinzierl
https://rachelbythebay.com/w/2018/04/27/uid/
======
peterkelly
I see this an issue of typing. Programs use integers in many different kinds
of ways, and if you have a function that accepts an integer as an argument,
neither static nor dynamic type checking will catch the case where you pass
the wrong "type" of integer. The same goes for UUIDs, strings, etc. or
whatever other opaque values you use to reference entities in your system.

You can avoid these problems by wrapping the integers in objects that you use
solely for referencing entities of the relevant type. For example:

    
    
        class UserRef {
            id: int;
        }
    

Then you define functions like this:

    
    
        function ban_account(user: UserRef) {
            // ...
        }
    

And a static type checker will pick up incorrect uses. For dynamically typed
languages, you could instead use a unique field name such as user_id, to
achieve the same thing (though getting a runtime error instead of a compile
time error).

Obviously you'll still be using ints or strings as an external representation,
but as long as you do the conversion at the point where the identifier enters
the program, the type system will take care of the rest for you.

~~~
chriswarbo
> I see this an issue of typing

Yes, this reminds me of "stringly typed programming", i.e. where the language
may offer strong types, but the program just uses `String` everywhere. String
injection attacks are examples of this: SQL injection can only occur if it's
possible to concatenate "SQL statement" with "user input"; if these are both
represented as `String` then it's easy to run into such problems; if they're
represented as different types, then the only way to combine them would be
with a designated conversion function, which is exactly where we can put the
neccessary escaping.

See also: [http://blog.moertel.com/posts/2006-10-18-a-type-based-
soluti...](http://blog.moertel.com/posts/2006-10-18-a-type-based-solution-to-
the-strings-problem.html)

~~~
weberc2
CMake is stringly typed, and that’s only part of the reason it’s such an
atrocity.

------
dominicr
Basically saying that don't use numbers because somebody might write crap code
and it'll get run on your deployment database?!

If getting such dangerously awful code deployed to production is likely then
sequential IDs are just one of your many problems!

Sequential IDs for key data can be good to avoid for a few good reasons but
awful code isn't one of them. Testing, code reviews and not having bad
programmers should be in place to fix that.

~~~
danpalmer
All programmers make mistakes. I think the mark of good software design is
using practices that make it impossible or much more difficult to make the
mistakes that we know are going to be made at some point.

Much of our code review is not about "is this code correct" \- that's
obviously important, but rather "will this be easy to review changes to in the
future" or "if I came back here in a year, would I get it wrong". I think
that's just as valuable.

There's always a trade-off with complexity, and I'm not sure whether this one
pays off, but designing _for_ programming errors is important for any
team/company/product that is growing, and will introduce new developers who
weren't around when the decisions were made.

~~~
Chyzwar
I would reject PR that introduce code like this. I disagree with the article.
You should not introduce "safeguards" as described in the article. In few
years time, someone will make bug assuming that users id starts from 1.

Instead, you can write this as (js):

    
    
      banSendersOfMessages(messages) {
        messages
         .forEach(({senders}) => senders.forEach(banAccount));
      }
    
    

In the ruby community, there is the principle of least surprise. Your design
you should not introduce astonishing things.

------
ris
Problem is that randomly assigned ids tend to perform sub-optimally in
relational databases due to the sparse distribution screwing up row estimates
and causing fragmented writes to the underlying table. For this reason many
people have a policy of having (nice sequential) "internal" ids and then
(pseudorandom) "external" ids.

~~~
kyrra
Depends on your database. For example, Spanner it is better to use random
numbers for the primary index, as monotonically increasing ints will cause
hotspots on the database.

[https://cloud.google.com/spanner/docs/schema-
design#choosing...](https://cloud.google.com/spanner/docs/schema-
design#choosing_a_primary_key)

------
alexandernst
Hi, I'm mister UUID and I have been living in your favourite framework since
2001. Use me.

~~~
drawkbox
UUID are the way, GUID for those in Microsoft's lands.

Not only will you never run out but you don't need a round trip to the
keymaster, horizontal scaling, clustering and vertical sharding even are all
way easier due to that. This removes a single point of failure entirely when
the keymaster is no longer needed.

Using ints for keys for profiles/users and many other things were needed way
back when processing/db/disk/memory and performance from that were a problem,
no longer.

Do your part, join the UUID revolution. Also, if you were a piece of data,
would you not want to be unique across all the databases, storage and
services? You've heard of Roko's Basilisk right? Do not disappoint.

~~~
kasey_junk
I’ve often wondered where the tipping point is where you do start having to
worry about collisions. Is frantic “fix all the uuid breaking code” the jobs
boost for programmers 20 years from now? 100?

It seems unthinkable now but who knows...

~~~
drawkbox
We'll be good until at least sentient AI takes over and they can solve it for
us, or them.

------
mike-cardwell
Looking at the provided example. Nothing like that has ever happened in any
code I've ever written, or I've ever seen written by anybody else, and I have
absolutely no fear that I will ever see anything like that happen in any code
I write for the rest of my career/life.

So, not a great example, and not a very convincing argument to me to stop
using integers.

~~~
tom_
That's interesting. Every project I've worked on has had at least one bug of
this form, and typically more than one, where the integer-type indexes into a
table of integer-type IDs have got mixed up with the integer-type IDs.

You might be surprised how much of a pain in the arse it is to even realise
these bugs exist, because during development and initial tests these tables
have a nasty habit in many situations of containing IDs that are the same as
the index. And when everything is an int, or similar, fixing them can be quite
painful too. It just takes one bug in one function for a set of subtly broken
workarounds and/or misunderstandings to spread throughout the code. Lots of
places where functions take an "id" and then pass it into a function that
takes an "index", or vice versa... just what _was_ the intention here? :(

This shit is the worst kind of bug.

My usual solution:

    
    
        struct ThingID {uint64_t id;};
        typedef struct ThingID ThingID;
    
        struct OtherThingID {uint64_t id;};
        typedef struct OtherThingID OtherThingID;
    

And that's it. When at all syntactically inconvenient, it's a sign you're
possibly doing the wrong thing.

~~~
mike-cardwell
"Every project I've worked on has had at least one bug of this form, and
typically more than one"

Then we have had completely different experiences in software development and
are unlikely to agree on the importance of the content in this post.

~~~
tom_
Quite! It's supposed to be an alternative viewpoint, nothing more.

Thinking about it, maybe I shouldn't have opened with "That's interesting",
which here means exactly what it says, but is a phrase sometimes deployed with
malicious intent.

------
__s
This can be addressed by using 'opaque ids' which are typed to be incomparable
& non interchangable. Then have explicit to/from int conversion

------
ajnin
There are several points in this article :

1/ stuffing random ints into your user functions : that problem can be solved
with typing and tests. I'm hoping that extensive testing of any piece of code
which would have a drastic effect on a user would get extensive testing before
going into production.

2/ ID canary : seems a rather good idea, like stack canaries commonly used
when you don't have much stack space and you might get a collision with your
heap. It's only a problem for languages for which point 1/ couldn't be a
solution.

3/ Using UUIDs to avoid disclosing information about your user count : I think
that's a separate problem. You should avoid disclosing unneeded information in
general. If you use int IDs, always have an opaque public_id field that you
use publicly and for interoperability with third-parties. But it does not mean
you have to use UUIDs internally. However they do have a number of advantages,
mainly that you don't need a central authority to distribute new sequence
numbers, you can just generate UUIDs where you need them which will save you
DB round-trips and make sharding of your DB easier. Also will help avoid
issues such as this one : [https://blog.travis-ci.com/2018-04-03-incident-
post-mortem](https://blog.travis-ci.com/2018-04-03-incident-post-mortem)

------
oftenwrong
Types, as mentioned, make this easy to avoid. For example:

A User has an ID of type UserId.

A Message has an ID of type MessageId and a sender of type UserId.

ban_senders_of_messages would have a parameter of type [Message]

ban_account would have a parameter of type UserId

------
peterburkimsher
I think the real problem is the terse coding style. Why does the for loop
always use "i"?

When I write code, I don't use "i" or other single-letter variable names. I
write long names, like "currentItem", "currentMessage", "currentUser", etc. If
I reference an object, I usually name it "thisItem", "thisMessage",
"thisUser".

Compilers shrink executables, so it's not a size issue. I don't know why
people want code to be shorter; I prefer it to be easy to read and debug.

~~~
mic47
Naming won't help, sooner or later someone will make similar mistake that will
slip through code review.

Real problem is that it is possible to accidentally do this type of mistakes.
If you can avoid doing it by leveraging type system, you should. Relying on
humans to never make mistake is futile.

------
naasking
Since I largely program in C#, I typically define the unique ids on my classes
as enums. Enums in C# are open, not closed, so you can define something like a
UserId like so:

    
    
        public enum UserId { Unset = 0; }
    
        public class User
        {
            public UserId Id {get;set;}
        }
    

This works with various ORM and other mapping tools since enums are ints
underneath, but you get static checking.

Then there's the issue of protection when passing around ids in web apps as
parameters, in cookies, etc. I devised Clavis [1] as an experiment for
protecting URL parameters via an HMAC. The idea works pretty well in practice,
but it's current incarnation is a little too cumbersome to use.

[1] A url [http://foo.com?userId=1234](http://foo.com?userId=1234) becomes
[http://foo.com?-userId=1234&clavis=asdbwef67t34rfbs](http://foo.com?-userId=1234&clavis=asdbwef67t34rfbs),
where the 'clavis' parameter is an HMAC of the URL's protected parameters, and
changing any of them causes the request to fail. Unprotected parameters are
also supported, so GET form submissions are still possible. See:
[http://higherlogics.blogspot.ca/2014/01/clavis-rebooted-
secu...](http://higherlogics.blogspot.ca/2014/01/clavis-rebooted-secure-type-
safe-urls.html)

------
stu_douglas
Last week we caught a bug caused by a security check on the users id (a Java
Long) that was accidentally using == instead of .equals(). Since Java caches
Long’s between -127 and 128, == will pass for any ids <128, including those
from our tests. Our QA stage only caught it because the tester happened to be
a later stage employee, so had a user ID > 128.

------
paulbjensen
Another risk is that numerical ids for users (as well as for other database
tables) can be used to infer the growth rate of a business. If you record the
time that a user is created at, and then repeat the process over a time
period, you can work out how many users are joining the app over that time
period, and potentially work out the total number of users too.

------
pornel
Strongly typed languages help somewhat, but you still often need to store or
serialize the data in something less clever (e.g. JSON).

To reduce risk of mixup of different kinds of IDs in the system I used
different increment values in Postgresql sequences (e.g. 13 for users, 7 for
categories), so the IDs quickly went out of sync and had little overlap.

------
floatboth
Another reason: prevent users from incrementing their ID in the URL bar and
discovering whatever pages...

------
IgorPartola
Whenever I deal with an external API, even if it gives me what looks like an
integer for any object it has, I treat it as a string. I never do math on it,
I don’t care about saving space, and I never know when they’ll run out of
space and need to switch. Remember when Twitter famously stopped working
because their IDs were BigInt but in JavaScript you only have 53 bits of
space? Yeah, I don’t want that.

Also a stupid number of APIs I deal with like to use one or more leading zeros
in their identifiers. The meaning isn’t different, but it gets annoying when
trying to do search, because of course the end user typed those in and wants
to be able to look up whatever as 021.

------
based2
[https://github.com/jOOQ/jOOQ/issues/5589](https://github.com/jOOQ/jOOQ/issues/5589)

------
teddyh
By the same argument, Unix file descriptors should not be ints, either.

This argument is, in essence, an argument for strongly typed languages. There
are many argument against this historical argument, and it is by no means a
settled issue – on the contrary, it is very slowly, as the years go by,
looking more and more like the strongly typed languages are on the way out.

~~~
megaman22
> on the contrary, it is very slowly, as the years go by, looking more and
> more like the strongly typed languages are on the way out.

What? I'd argue the complete opposite. Above a certain level of complexity,
lack of type checking becomes so onerous and bug-inducing that dynamic
languages start introducing stronger typing. Typescript is paradise compared
to Javascript, and even Python has added an optional type-checker.

------
dredmorbius
What you more likely want is a sense of privileged, sensitive, and/or high-
consequence users who arent trivially compromised or blocked.

E.g.,
[https://en.wikipedia.org/wiki/Politically_exposed_person](https://en.wikipedia.org/wiki/Politically_exposed_person)

That and some sanity in your account-handling ops.

------
gwbas1c
It's standard security practice not to expose integer user IDs. Anyone who
goes through standard security training knows this.

Granted, one thing languages could do is provide easy type containers so it's
hard to misconstrue an I'd as referring to a wrong type. I once tried to do
this with generics in C# but it wasn't worth the effort.

~~~
tigershark
C# is ill suited for this sadly. On the other hand in F# you have single case
discriminated unions or unit of measures if you care about performance.

------
osrec
I think user IDs should be passed around as ints. It's efficient and simple.
If you've got critical code that might lock someone out (or worse), you're
much better off spending time testing it to ensure it works, rather than
building in inefficient paradigms into your data model. For argument's sake,
imagine if your data model also stores the sender's current manager per
message. You could by mistake do this as well (a little slip of the finger
when using code completion):

    
    
      ban_senders_of_messages(messages) {
        for (i = 0; i < messages.size(); ++i) {
        
      ban_account(message[i].sendersManager);
        }
      }
    

The only way to catch that is to test, and because the manager and sender will
have similar data types, it will compile/execute just fine. The point is, this
is probably a more likely error than the one mentioned in the article, and
needs manual inspection and testing to correct. If you're carrying out that
process anyway, the added inefficiency of a more elaborate data type just for
user IDs seems redundant.

~~~
tom_
Why not define a new data type for each type of ID? Simple and effective. And
there's no inefficiency; these things aren't created and destroyed all the
time! They're just retrieved from other objects and then passed around.

In languages that support value types, you'll typically make them integer
size, so the cost is likely to be the same as passing an integer. In languages
that support reference types only, you're just passing around a pointer
anyway.

~~~
osrec
But in my example, surely the data type for a manager's ID will still be the
same as that of the user, so I'm not sure it really solves the problem
entirely.

~~~
mic47
Then maybe you should have manager's IDs done as separate type too?

~~~
osrec
But they're all users, where one user can be set as a manager for another. Why
would you have a separate type for each?!

~~~
tom_
But your example implies you shouldn't be allowed to get them mixed up?

If manager IDs and user IDs truly are the same type of thing, then there's a
limited amount the type system can do for you in this respect. Maybe you'll
have to stop at this point and just accept that you'll have to exercise a
certain degree of care.

But there's a big gap between stopping there, in my view, and what you appear
to be advocating: deciding that since manager IDs and user IDs are the same
thing then you might as well give up _entirely_ and just decide that they may
as well be the same thing as ints while you're at it.

~~~
osrec
Not quite. If I was to truly critique the example, the problem lies mainly
with the ban function, which appears to allow ints to be passed in as
arguments. If we have static typing available to us (the example in the
article doesn't seem to), then yes, I agree we should ensure that only the
relevant type should be allowed as the argument. But still, problems can lurk
in the shadows, for example, you could initialise a User type with a message
ID by mistake, or even an iterating int, much like the example in the article.
I'm not trying to be difficult or unnecessarily contrarian, but my experience
tells me that you can put a whole bunch of safeguards into your code, but
nothing beats testing at catching bugs, and sometimes, the safeguards are not
worth the efficiency hit. Worse still, the safeguards can at times provide a
false sense of security.

~~~
tom_
It shouldn't be an excuse to dispense with testing altogether, but static
typing is certainly superior to testing for certain classes of bug. The
compile-time checks prove certain types of defect simply don't exist, which is
the kind of guarantee no amount of testing can give you in that respect for
any useful program.

As for the initialisation problem, it's true that it can't be structs all the
way down, and at some point you will have to create one of these objects,
probably from a primitive with a non-meaningful type such as int, or string.
But my experience is that IDs and the like tend to be created in a small
number of places, and then reused, copied and passed around. Far easier to
find and check all the places where one is created than all the places where
one is used!

------
StavrosK
I can recommend the ShortUUID[1] Python package I wrote, I use it for all my
IDs nowadays. The good thing about it is that it makes nice and short human-
readable/typable IDs that you can use for all your objects, so you don't care
even if you expose them to the user.

------
lazyjones
This has nothing to do with user IDs. It's a general problem with antiquated
languages that quietly convert/promote compatible integer types and can happen
to any other integer types.

Solution: use a modern language like Go or use pointers to structs like
everyone else.

------
olingern
I hope that it's obvious that you shouldn't expose part of your implementation
to your end users a la auto incrementing primary keys.

Obfuscation of how data is queried and stored is pretty low hanging fruit,
security-wise.

------
fma
Besides all the other arguments made in the comments about other solutions for
user id's...

This code is so easy to write a unit test for. I hope it didnt even get
committed, let alone deployed to prod.

------
alt_
Doesn't pretty much every language nowadays have foreach loops that don't
require keeping an integer index around when iterating over elements? That
seems like a way better idea.

------
all_blue_chucks
If you need a way to uniquely identify something in the universe, there is no
reason to get clever. Just use Universally Unique IDentifiers. UUIDs. Done.

------
foxhop
I use a UUID for the primary key, it protects against this as well as other
issues. Does have some drawbacks though.

------
partycoder
And also you should not recompute the size on each iteration, or ban the same
user more than once.

------
originalsimba
Why would you use anything but a string for User IDs?

My understanding of numerical types is that they exist to perform math. User
IDs are not used for math, they're a completely arbitrary vanity system to
assist with identification, so they should be strings, equally arbitrary.

Personally, I think E-mail addresses are the best user identifiers these days.
Back in the day when there were like 5 websites everyone used, having your
username was a cool thing. These days there's a billion websites and nobody
uses the same ones and there's zero inter-user interaction on most sites. From
the perspective of user friendliness, E-mail addresses are the easiest because
you kill two birds with one stone (contact method + username + password
recovery).

If you want a numerical ID, what about using a hash of the E-mail address? Or
perhaps a combination of things, email, full name, sign-up date.

~~~
StavrosK
> E-mail addresses are the best user identifiers these days

Oh god no. You don't want all your IDs changing when a user changes their
email address.

You probably want your ID (e.g. a UUID) and your user-friendly lookup method
(e.g. an email) to be separate.

~~~
originalsimba
> Oh god no. You don't want all your IDs changing when a user changes their
> email address.

That's a pretty passionate response, can you explain your logic? What are you
doing with your usernames that you can't afford to let users change them?

~~~
detaro
The article doesn't use "User ID" in the sense of "username" (externally
visible identifier, that likely is used for log in), but as in "mostly
internal id thats used to reference a user across database tables, services,
...". If you use something that can change in there, you need to do the change
across all those things consistently, which is a lot of potential for error.

Or am I misunderstanding your perspective?

~~~
StavrosK
That's exactly it, and the article doesn't talk about user-facing usernames,
since they're automatically incrementing ints.

~~~
originalsimba
I got that but I can't imagine why you would even use the User ID for anything
if we're talking about the row ID from the database. If you're doing tests
against a user's profile why not use their username? There must be some case-
examples that I'm not thinking of...

I know that some services have a public-facing "username" and a behind the
scenes unique identifier (which is a great UX model), I'm just focusing on the
unique identifier. Which I would think should always be it's own column,
whether it's also used for the public "username" or not.

> Because if you change that, all your relations between tables will break.

Okay, that is not a response to my question, which is why would you ever use
the row ID for anything in your program. If you never use it, then it cannot
ever be changed. Also SQL allows relationships based on more than one field,
so it seems such a disaster could be easily avoided.

~~~
StavrosK
Because if you change that, all your relations between tables will break.

