Hacker News new | past | comments | ask | show | jobs | submit login
GUIDs Are Not the Only Answer (softwareatscale.dev)
86 points by ublaze on Jan 6, 2021 | hide | past | favorite | 61 comments



I would strongly say strings are actually the _only_ type you should use for IDs. Prevents the vast majority of buggy client behaviour and gives you good flexibility to change how you do things over time.

---

My company ended up with a simple KSUID implementation of our own - https://www.cuvva.com/product-updates/showing-off-our-fancy-... (having originally used UUIDs and Mongo ObjectIDs)

For us, a big part of it was usability with cursor selection etc - in addition to it being immediately obvious what the ID was for.

Once we finally had that rolled out everywhere, we ended up collecting up every other ID we'd ever used and mapped it to its KSUID resource equivalent, so now all our IDs work standalone without type/context info, even across environments (and thankfully we'd never had any collisions on the old IDs)

---

Going back to the typing - the most difficult part of migrating our IDs actually was converting them all to string types. With Postgres this is a little slow but ultimately fine, but with Mongo you have to actually remove and reinsert every document - you cannot (or at least could not) update IDs in place.


Strings have their own issues. One place I worked had a bug where users could take over accounts because of missing/inconsistent unicode canonicalization. Case can be a problem, as can special characters.

There's something to be said for strings, though. Prefixed IDs that mark the type can be nice to work with when it's an otherwise opaque ID, but they're a pain to handle internally.

If you're storing third-party IDs, you probably want strings...unless their clever JSON API returns 1.3E6 as a number. Or the string "null;" that's always an adventure.


IDs should really be their own type: string operations don’t really make sense on them. If you use strings, there should probably be a constraint on the acceptable characters and lengths: e.g. only digits, exactly 10 or something like that.

I personally like the idea of using urls using a domain you control + a path that reflects the type of data it’s an ID for.


To be fair, string operations don't usually make much sense in general.


azure does this for resource IDs and its one of the few things i really enjoy about the platform.


If the main benefit of prefixed IDs is debugging then a dev tool can be created that would just look up the ID in all tables and report what object it corresponds to.


> I would strongly say strings are actually the _only_ type you should use for IDs.

Just got that in the face at work. In a large custom integration we read orders, they have unique order numbers, nice 5 digit things. So in the database, they became integer primary keys, with lots of child tables.

Fast forward 5 years, customer switches ERP system and they ask "hey, the order numbers, you do support 10+ digits right?"

Changing the database is relatively easy, changing the code is a chore, but worst part will be going over all the queries and their parameters, especially in the reports.

So yeah, lesson learned.


I have an honest question. Why would anybody, ever, make an order number that only has 5 digits. Even if it's just a home-hobby project, the cost of changing to 7-10 digits is so small a d negligible that I can't see any reason for choosing anything lower. Like the 640k that was once "enough for the long term future" in DOS. I understand that hindsight is 2020, but can't wrap my head around not starting bigger when there's no extra cost (nearly).

Could you shed some light on that?


Since 10 digits puts us in billions, you're really asking "why would someone do 'CREATE TABLE ... id INT AUTO_INCREMENT' when they could use a BIGINT?" These days, there's rarely a reason not to use a BIGINT, but I also have a little trouble faulting someone for thinking 2B would be enough when they're currently at 10k.


Please don't use BigInt. It is the same size (usually 16 bytes) as GUID/UUID in binary but has lots of drawbacks and no real upside other than they are easier to type in when they are still small.

1) The ids are clustered around the starting point so an invalid join will return data when it shouldn't. MIN_ID exists for almost every table/object type. With a well designed ID this shouldn't return any data.

2) BigInts are different sizes on different platforms. I've worked at a place 1 order of magnitude (base2) from overflowing JS Number and breaking most client code. This was due to bug that consuming a Seq ids quickly but there was no going back smaller.

3) Will often be coded client side as a Int and no one will notice until you grow large enough to overflow INT

Really, just avoid any sequential, numeric IDs and you will be good, IMO.


> Really, just avoid any sequential, numeric IDs and you will be good, IMO.

Sequential ids make better as database indexes. Do you propose using both a uuid for external purposes and a sequential id as the primary key?

Or something like uuid v6? http://gh.peabody.io/uuidv6/ (which isn't widely supported)


Postgres and SQL Server both have 8 byte BigInts. Is there a database that has bigger ones? I haven't heard of it.


Bigint is a 64-but (8 byte) signed integer. It’s half the size of a UUID and using a sequence rather than random ids leads to significantly less WAL ok Postgres when full page writes are enabled.


> Could you shed some light on that?

First off, I had a brainlapse, their order numbers were 6 digits. I'm not entirely sure what they max was in their system, it was just that's where they were at in the series.

And they didn't have that many orders per year, less than 10k, as one order could be for say five containers of goods. So it was not like they'd exceed 9 digits in the foreseeable future.

Or so we thought...


Obvious answer: because they started at 1 and have had less than 100,000 orders.

I know of a company that had enough trouble with vendors after their PO numbers rolled over that they modified their EDI transmissions to add 1,000,000,000 to every order number.


Back in the day, we were taught to use database constraints to validate user inputs.

Memory, storage and compute were also more limited, so there was an extra cost to over-spec.


Isn't this still the best practice? I do it to prevent the possibility of invalid data.


Just had a similar project. Spent loads of time replacing %d with %s in printf and \d to \w in regex. Tedious! All from design choices in 2013.


>Poorly formatted log statements/errors can become harder to debug >UUIDs often need context to aid debugging.

The GUID vs Int problem is incidental to the real problem: poorly formatting logs. An integer or other key type without context is no more helpful.

>At the very least, identifiers should not be allowed to float freely as strings or integers in order to prevent a class of inconsistency bugs.

I'm not familiar with the ergonomics of GUIDs across all languages, but C#/MSSQL makes them pretty easy to handle when they have been chosen as keys.

So the answer, as far as ergonomics go, is not settled depending on your stack.


> The GUID vs Int problem is incidental to the real problem: poorly formatting logs.

This.

The author's arguments boil down to "UUIDs are bad because the way I generate my logs is an unusable mess".

The ints vs UUIDs problem is also specious because a UUID is nothing more than a 16 byte int that's generated in a clever way. It might typically be presented in a particular string representation but the text string is not the value of the UUID.


Yeah, IDs with a sufficiently large dataset will be unintelligible no matter what you do.


Structured logs would help here in order to extract the logged information in an easy way.


The Python package `shortuuid` makes working with UUIDs a little easier by encoding them as strings: https://github.com/skorokithakis/shortuuid#usage (uses base-57 encoding, with alphabet consisting of A-Za-z0-9 with potentially confusing symbols skipped)

The string representation is what you show to users, but under the hood it's still a UUID and compatible/interoperable with any other system that needs UUID-shaped identifiers.

The coolest part is youcan even truncate the string encodings to get shorter IDs, which correspond to UUIDs with lots of leading zeros.


> The Python package `shortuuid` makes working with UUIDs a little easier by encoding them as strings:

This. In my opinion the only issue with UUIDs is that their standard textual representation can be very verbose and cumbersome in some scenarios.

A single ID takes over around 40 string characters in a line. That's half the width of a default terminal's width.

Thus the solution is obviously to use other textual representations. Instead of base16 then let's ramp up the base to shorten up the text dump, and while we are at it let's pick a readable format.


GUID v6 is pretty nice when you need monotonously increasing numbers that are globally unique.

On another note, I worked somewhere that prefixed GUIDs with the environment the app was running on. All of production, staging, even dev machines all used the same connection string.

There was even a stored procedure to copy user accounts, etc of prod for your machine. It was hands-down the best debugging experience when a customer had an issue.


That sounds very scary, but also, extremely freeing.

No more permissions to deal with! But also...no more permissions to deal with...


I'm only aware of versions 1 through 5. Where is v6 defined?


I think this is what he's referring to[1]

[1] http://gh.peabody.io/uuidv6/


> Sometimes, we want a zero inconsistency approach to storing objects, so it might make sense to make the identifier (or part of it) the checksum of the content that is to be stored. This guarantees that the underlying content has not been modified.

No, it does not. I come across this false statement again and again. It seems that a lot of developers do not understand what checksums or hashes guarantee and what they absolutely do not guarantee.

Let's set this once and for all:

1. Differing checksums or hashes guarantee that the content is different. 2. Identical checksums or hashes do not guarantee anything. The content could be identical or not.


If the hash function is collision resistant (e.g. SHA-256) unique hashes are guaranteed in practice.


False.


The main point of the article is that even using a UUID for database objects, when other functions in the backend code query multiple classes of objects, those function signatures are poorly written, ex:

    ban(id: uuid):
        pass
There's no way for the reader to know what object the ID is referring to, and also the wrong object ID can be used in the function. Their solution is to wrap the ID in a type, which is correct, but did not continue writing that the function signatures in the backend code should be updated as well. Something like this should close the loop in the blog:

    ban(id: UserId):
        pass
Where the compiler can now check that a Team.id is not being passed in the method, only a User.id can be passed into the method.


I have stopped using UUID and GUID in favor of https://github.com/ulid/spec


The article mentions KSUID which has a very similar spec, but with a wider random segment.

For comparison: KSUID is 160-bit wide versus ULID sticks to 128-bit wide (which is somewhat more compatible with UUID-like database column types, for instance, so long as they don't try things like UUID version checks). KSUID uses Base62 encoding versus ULID uses a much simpler to (lexicographicly) sort Base32 encoding.

(I've found ULID useful in some of my own projects.)


This is one thing I like about type systems where you can declare a primitive type as MyImportantThing. This ensures the string or what have you is explicitly defined as MyImportantThing. Rust does this pretty well. C/C++ AFAIK will let you pass in the raw string, and so will Java if you "extend String".


You're talking about what is called the newtype pattern in Rust, and the newtype language feature in e.g. Haskell, right?


What does it mean to “extend String” in Java? You can’t, of course, literally do that.


Sorry bad example. Forgot it's final. :)


Pretty obvious stuff. Of course GUIDs are more unweildy to read etc. compared to simple auto-incrementing integers.

I don’t see a reason to prefix an id with something like `task-`. I would rather leave it to the display logic.


I tried many things for making sync work across devices. I tried GUIDs, and partitioning ranges of ints, and several versions of it.

But what worked amazing?

Use NATURAL keys (or their hash) + version field. That is all you need most of the cases. It make sync far easier, easier to trace stuff (thanks to version), immune to problems of timestamps (some computers have their cloks wrong). In short:

    struct Order {
        code: String, //natural key
        version:usize
    }

    struct Location {
        code: Hash //hash of city + country
        city:String,
        country:String,
        version:usize
    }
Natural keys are global if well defined. In some places where it is not obvious, hashing the whole row and put a nice encode is the same.

This also will reveal when something TRULY need a guid or similar. For example, for invoices in my country the law demand partition of ranges with certain characteristics (ie: INV-1-XXX in machine 1, INV-2-XXX in machine 2).

Add another id:i64 become redundant most of the time. If your Order.code is duplicated or whatever it will be the same problem with or without an extra id:i64, so is better to deal with the problems of the ACTUAL data when is need and not mask it with other stuff.

The downside is that the key become repeated in JOINS (like in InvoiceLine) but honestly all rdbms handle triggers, and it actually become very nice to see the Order.code in the child relations (far easier to correlate).


Isn't the obvious difficulty here that such natural keys tend to change? Eg names and boundaries of cities & countries change - relatively - all the time. You don't want your IDs to be changing alongside


Why not?

I work for ERP/eCommerce, so this issue happens. But for real, I need to store historical facts anyway ie: In a invoice/order I must store the ship info at the time of the transaction.

A artificial ID is useless in a lot of cases where "and if the key change, what?" because if that is an issue, is mostly because the data must retain history, and then, you need to store that anyway.

For that, I log the data in a history/log table.


The best answer is the humble integer. The only reasonable arguments I have ever seen against using integer keys universally are as follows:

#1 Integer keys have finite range.

#2 Integer keys betray the identity of other sensitive resources when exposed as a public identity.

#3 Integer keys are "difficult" to sequence in the face of multiple networked participants.

My resolutions and counter-arguments are as follows:

For many systems, #1 is not a concern, because the number of expected entities is well-bounded by a 64 bit integer. For others, #1 can be resolved by usage of more complex types such as BigInteger (C#). If utilized carefully, these can be treated just like normal integers, and quickly converted to/from byte arrays of appropriate length to satisfy the required range. In virtually all SQL implementations, blob columns containing these values can be indexed with the exact same semantics as with a 64-bit integer column. Whether this performs better or worse than GUID keys probably depends on if you can provoke a >120 bit BigInteger representation. This is quite unlikely, even for Google.

#2 is trivially solved by simply applying encryption to sensitive keys as they traverse the boundary between your system and the outside world. AES256 would do the trick here. You could also generate entirely separate keys of any appropriate type for public consumption (i.e. maybe some YT-style identifier format).

#3 is solved by anticipating the maximum possible # of nodes in your system, and then producing a key space in which identities are sharded out by a simple constant factor of that max quantity. This would certainly produce concern regarding all of the skipped identities (assuming you start with a small number of hosts on day 1), but the proposed resolution above for #1 (BigInteger) alleviates these concerns with a practically infinite range of keys. Skipping 10k identities is a non-event when you have all of infinity to pull from.

There are also other considerations with this. GUID keys are a pain to communicate. Integers, even of massive range, are easy for most humans to communicate verbally when appropriate digit grouping and other reasonable measures are undertaken.

Also consider a situation in which you decide to use 1 global integer range to key every single entity in your system. This allows for interesting database structures in which foreign keys are all referring to the same keyspace, so the specific type of a thing is no longer a hard constraint in a relational sense. Some would probably take substantial offense to this proposal, but I have found in many cases this allows for powerful optimizations. Anything can be used irresponsibly.


> Integers, even of massive range, are easy for most humans to communicate verbally when appropriate digit grouping and other reasonable measures are undertaken.

Presumably this is after not taking your own advice of performing AES256 encryption on the key before sending it to the user?

Your three easy steps together seem a lot more complex than just using guids.


Not all communications around keys are of a sensitive nature. Virtually daily, we will communicate keys internally in order to clarify details regarding some activity. Applying AES to keys internally would only serve to hinder our operations & support efforts.


How often are you verbally communicating primary keys/id's internally within an organization?

I've never understood arguments against guid's, except for table performance with databases that might take a performance hit because of either inserts, or because data won't necessarily be partitioned in the "correct" order (e.g. MySQL writes to buffer pools).


> The best answer is the humble integer. (...)

...and then you proceed by suggesting ID code generation methods that replicate UUIDs.

It sounds like you are trying to criticize UUIDs while being totally unaware of what UUIDs are.

I mean, UUIDs are nothing more than integers that are generated following one of half a dozen different methods so that they can be probably unique without relying on a central authority.

One of the UUID versions is a pure random number. Another is a MAC address concatenated with a timestamp complemented with a local counter. There are literally two UUID methods that consist of hashing data.

Your supposed solutions to the alleged problem posed by UUIDs is a reinvention of what UUIDs have always been. But UUIDs have been though through and are standard and ubiquitous.


The size requirement of integers grows with the number of records. Uuid's tend to be a fixed size.

IMO that's a big advantage for integers that uuid's will never be able to overcome. Key size matters when you are dealing with heavily indexed data.


Absolutely everything in this article is true, and I wish the folks who wrote stuff on Azure would read this... But if you actually need an opaque identifier for an arbitrary resource, what actually works better?


Microsoft is obsessed with GUIDs, Windows is also full of them.


I’ve always wondered: what is the history behind hyphens in UUIDs?



I also tried to use custom types for IDs, but in my opinion, it has more cons than pros. I have to write custom serializers and model binders for them, explicitly convert them to other types, write separate validators etc... At the end I finished with even more bugs than before.


I found serialization of typesafe IDs trivial in Kotlin with Jackson. We have a single generic supertype that has the correct annotation to tell Jackson to use the `value` field as the serialized value, and defines helpful methods like equals. Each new ID type is defined as a single line of code, simply inheriting from this base ID type and providing the actual underlying type (int, string, uuid).

Validators likewise are a non-issue. Jackson handles that automatically during the underlying type conversion, and Java won't allow you to construct a UUID that is invalid.

I see explicit conversion between types as a pro, not a con. If I'm going to take one ID and try to use it as an ID for a different type of entity, I'd better have a very good reason.

Again, I did this with a specific set of tools, but the functionality I used should be available in pretty much any language and serialization framework.


Why not just `f"task-{uuid.uuid4()}"`?


I agree, the author is mixing up presentation with the data type itself, and then pinning the blame on the poor presentation on the way he chose to present the data.

If the author's problem is not being able to grep logs easily with UUIDs because they miss context info then the solution to his problem is to output logs that are searchable by providing additional context.

His arguments regarding searchability are also specious. Take for example

> For example, a task ID would look like: `job-123-task-1`. This also helps in ad-hoc database queries to find relevant rows without complex JOINs.

Well, if he's already running SQL queries then running an exact match on two attributes is far easier and elegant and less error prone than running a string pattern search on a derived attribute.


Read the article, but also: sortability.


"At the very least, identifiers should not be allowed to float freely as strings or integers in order to prevent a class of inconsistency bugs."

Tell that to almost every Typescript developer who uses `number` for identifiers.


Typescript doesn't have functionality for non-equivalent types with identical primitives or interfaces (e.g. you can't have string-equivalent type A ≠ string-equivalent type B), though there's ongoing discussion around adding it (https://github.com/Microsoft/TypeScript/issues/202).


Arctic take.


That is why we use things like io-ts. t.Int is a branded type that ensures number is whole and greater than 0.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: