More and more I'm noticing this narrative of "if I write awful code with obvious mistakes and somehow nobody else notices it in code review, it's never our fault, the language and its typing system or safety features should've stopped me" gaining popularity in the programming world and I really have to ask, what happened to programmers that actually knew what they were doing instead of expecting the computer to tell them what to do?
Because people are realising in general how good type systems can help at preventing such errors? The well was poisoned for a while with terrible type systems that provided either marginal safety or were just horrible to work with (e.g. complete lack of inference) but that is now seriously changing.
Enough of the "hard problems" are solved where it's expected that developers use off-the-shelf type solutions. As a result, they're not getting hands on experience solving hard problems, or challenging each other to think more deeply about solutions.
Software systems have been able to grow mostly because developers have been able to delegate a lot of diligence to tools.
Of course we could require a developer to "know what they are doing", many work environments do. However, you won't see many posts about that. First of all because it doesn't scale, and secondly because it doesn't make for interesting reading.
I learned never to use autoinc IDs in anything especially not URLs where they leak DB info, etc. However, I've seen it many times where younger devs do exactly that because they're learning from bad online books or tutorials. Every generation dev needs to relearn the same lessons.
I also suspect that many devs today are learning platforms first and software skills last. This was the reverse for many devs that came from building it the hard way and then using platform tooling to simplify. Newers devs are looking for the tooling to provide the core skills guardrails.
2. use typed schemas for APIs, i.e. GraphQL with a custom scalar type for UserID. Other typed schemas for APIs: OpenAPI, AsyncAPI, protobufs/gRPC, etc.
// add GraphQL parsing and serialization code
scalar UserID
3. Make the implicit - explicit. Don't use naked primitive types in your code (similarly as you will not use naked literal constants, i.e. use PI instead of 3.1415...). Use a PL with the string static typing (preferably Algebraic Type System). Define type for UserID which cannot be mixed with the integers, i.e.
type UserID = UserID int
4. in dynamic PLs you can use tagged tuples or tagged maps/structs, i.e.
{:user_id, 1234}
or
{"_type": "UserID", "id": 1234}
5. validate all external inputs (even those comming from the DB or Message Broker):
validateUserID: int -> UserID
6. The examples above assume integer representation of the user id, here's if we switch to UUIDv4:
type UUIDv4 = UUIDv4 string
type UserID = UserID UUIDv4
validateUserID: string -> UserID
My one critique of this prescription is that UUIDs are not identifiable as user IDs. IMO you're better off with a format like "user{20 random characters}"
UUID also suffers from a serious issue: there is no standard way to encode them as string and in binary format.
In binary, UUID can be encoded as Big Endian or Little Endian or even Mixed Endian, and while the string encoding is usually lowercase with a certain hyphenization rules, I've seen variations of that.
The (hexadecimal) text encoding is also quite inefficient compared to more modern standards like ULID or kSUID.
I don't use TypeScript and I didn't found how to do proper ML-style wrapper types in TypeScript. But I guess one can use type aliases for that:
type UserID = {
userID: number;
};
let validateUserID = function (x: number): UserID? {
...
};
Note, that my original example was missing the fact that not every input value is a valid user ID, i.e. the function should return Maybe<UserID>, Option<UserID>, or Result<UserID,SomeErrorType>.
In TypeScript one can use optional types xxx? instead.
I was deep into ReasonML/OCaml in the last coupe of years so yeah I missed option/result and the ability to define new type on top of the primitive ones.
Your example:
type UserID = { id: number; }
Would result in more allocations. But I think it's the only way at the moment.
your validation function would probably look more like
```ts
const validateUserId = (x: number): x is UserId => {};
```
for the `UserId` type declaration, I'm actually a little unsure of the best way to go about it. But I feel like using enums, typed string literals, or classes might be better approach. Would be interested if anyone has any thoughts on this
And if you use UUIDs for user IDs you're probably using them for other things. The same sort of mistake can happen.
The only real defense here is language-level enforcement. Allow declaration of a subtype that is not assignment-compatible with the parent even though it's identical.
I don't do any web-facing stuff but the only bits of code that know about things like IDs are the database stuff. All the logic works with classes that contain the ID and relevant data--you always pass the class, not the ID.
If you’re using randomly generated UUIDs, then collisions of any kind should be extremely unlikely, no matter how you’re using them. <0.2% chance in 2^60 random UUIDs.
I'll get back on my "tests are better than types" hobby horse and say "or write a single test for this function". The problem here is that an engineer was able to push untested code to prod that irrevocably modifies the database.
You need a test here because types won't tell you that a user was banned. And that test would also have caught this error.
Nice, yeah totally. And I would add you shouldn't put all your risk eggs in one safety basket. If you only have types or tests not only are you not covering all your bases, you're asking way too much from a given technique.
Disagree. UUIDs are a waste of space, and they aren’t sorted.
Use 64-bit ints. Prefer not exposing them to users, but it’s not a problem for most software.
If you ever get so big you need to shard, use Snowflake or a 64bit scheme that encodes the shard.
Don’t overcomplicate things. You’re already using either SQLite or PostgreSQL and it already gives you auto-incrementing integer keys by default, without the need to encode/decode UUIDs in whatever software you’re writing to interface with it.
Take incrementing int32. Extend to 64 bits. Multiply by large prime number (e.g. fnv32 prime). Mod 1 million. Add 1 million. End up with random looking, 7 digits, nicely sequenced, 32 bit integers. Write an exhaustive test to verify.
When near 1 million users (yagni), reset sequence and do the same with 10 million (or one billion).
Doesn't solve the upside of 128-bit random numbers (ala uuid): the ability to generate remotely and expect no collision.
I don’t see how remote generation is an upside. If you’re using UUID as a database key, you’re hitting the DB anyway, so it doesn’t save a trip to the DB.
There are plenty of uses for UUID like peer-to-peer apps where that makes sense, but the article is talking about database keys.
I don't imagine there are many valid use cases where you want a list of users sorted by their database record ID though, and if you're suggesting an auto-incremented int then the creation timestamp will give you the same order anyway.
It was received knowledge thet SQL Server (at least) would have problems if you had a massive index on randomly generated UUIDs.
By the time I compared UUIDs to sorted UUIDs (in the special pattern that was supposed to work for SQL Server) and ints, however, I didn’t find any difference in performance.
Perhaps by the time I tested it had been fixed, or perhaps it was only an issue in certain circumstances, but never an issue with a few million rows for me.
In practice it works just fine. Lots of really big and small companies do it everyday and somehow manage not to collapse.
In theory there are all sorts of negative consequences to using integers that you’ll run into once in a while and will require a google for a solution, however, everyday all of your queries and all of your inserts will be faster than using anything else.
Also, it will be widely supported by any 3rd party code you use to build your project which is probably more important than anything else.
The mistake has nothing to do with 'int' lacking something. 'int' is totally fine as an user ID or anything else. Everything in computer is essentially an 'int'. Every character in this text, for example, is an integer.
The problem is that 'ban_account' changes data in a user record. Hence it should be a method of a 'User' class. And the 'Message' class should have a way to fetch an instance of 'User' for the sender. Here's the right way:
ban_senders_of_messages(messages) {
for (i = 0; i < messages.size; ++i)
messages[i].sender.ban();
}
One solution, available in most languages, is to not use manual indexing at all. Use a for-each loop, iterator or similar abstraction. Sometimes using an index or counter is unavoidable, of course, but hopefully seeing an indexed loop, especially combined with a comparatively risky operation like banning, would trigger more scrutiny during a code review, more testing, or both.
Avoiding manual indexing doesn't prevent you from sticking other stray integers in place of user IDs (like specifying message IDs instead of the user ID that sent the message), but most of them are not biased toward the start like indexes are.
Still, I think it's prudent to avoid these kinds of errors at all if you can. Perhaps a good reason to switch to UUIDs for all primary keys, even if the normal concerns about enumerability don't apply.
Is it just user-ids we should discuss or is it all kind of ids represented as an int? on one side you have the convenience and simplicity and on the other side you have a set of experience, best practice, security considerations to take into account.
It can be overwhelming because the correct way seems obvious, listen to the experts and do as somebody think you should do - but it never is as easy as that. You need to balance your structure. Everything you do has a performance hit in some way or another and you need to consider the impact said practice will make in your environment.
While UUID4 can be good for UserID. Sometimes, to avoid having a second index in the DB, ULID is another good option; As not only it supports some of the features of UUID4 of uniqueness, it is also sortable.
The part that was last touched in 2019 is the spec + homepage. By this measure, UUID was last touched in 2005 (if we count RFC 4122).
Most ULID implementations are still actively maintained and people are still using it in production. Most implementations also have liberal license like MIT so no, you should not worry about the GPL spec (unless you plan to distribute the spec with your app).
Nice! I just hadn't seen it referenced anywhere before - the last big ID wave I can remember was the swath of content about moving to snowflake style IDs
ULID leaks information about when the user was created (while serial IDs leaking the order in which users were created and their cardinality).
I'm using ULIDs for the cases when the entity is publicly orderd by time: any timestamped event, e.g. a chat message, a log entry, or a sensor measurement/metric.
But need to be careful that the timestamp resolution is detailed enough.
Also, while ULID might be good for optimizing RDBMS indexes, it might create hotspots in NoSQL K/V stores (i.e. all entities will be created on the same node in the cluster).
> ULID leaks information about when the user was created
How often is this really a bad thing? Are you worried about someone enumerating the entire space of possible ULIDs for every millisecond without ever rate-limiting them? Not many people are building anonymous, privacy-first websites and there's plenty of other ways to determine when a user first started using the site regardless.
Unpredictable user IDs are also an important security feature. At every tech company, developers inevitably write an endpoint that accidentally operates on user data without authorization checks. It is much easier for an attacker to iterate over integers than to figure out a separate way to leak a list of user UUIDs.
On the other; Legal Names change: for lots of reasons. It makes a LOT of sense to store a UID the way *NIX systems often do. An explicit indirect lookup table entry.
The question is "are you passing that ID around as an int? or as a byte array?"
It may happen that the data type in the database is an int... because the database does math on it (or its source - seq.nextval type thing).
However, when you get it, you don't need to have it be an int that you can do math with. Even if you keep the same underlying implementation of it, pass it around in a way that doesn't let you add two IDs together or mutate it.
The easiest way to do that is to store it as a String.
One of the advantages of taking it to a String in Java is that it becomes immutable and so can cleanly be used as a key to a Map and prevents the easy math operations on it without taking it from a String to a numeric type (and back). If you see someone doing `Integer.parseInt(someId)` it becomes clear that they're doing something wrong with that.
The file format / interface API shouldn't make assumptions about the use case.
A Display Name (what humans interface with) and the 'internal identity in the system' (be that a UUID, a bare number, an exact string, whatever) should BOTH be available.
As an example, a backup / archive of a project is created. Years later it is restored, but now several employees have new names. Maybe some automation accounts got renamed too. Should the restoration fail when no user ID is present? Or should it instead restore the internal / native IDs? What if a user by the same name does exist, but they got re-added or the users migrated to a different authentication method and now have different internal / native IDs?
So you store the lookup table. Maybe even use a third tiny integer column if there aren't that many users. Store the platform ID and also the name. Though also have an option to not store or flatten that information.