Hacker News new | past | comments | ask | show | jobs | submit login
Do Not Recommend: User Provided Primary Keys (cendyne.dev)
81 points by tempodox on Feb 23, 2022 | hide | past | favorite | 72 comments



I don't get this, as far as I understand there is no reason at all to use a string column for a UUID. Your primary keys are important for performance and extremely important for integrity. Making them slower and less robust by using a string column just doesn't make any sense, why would you ever do that?

UUIDs are binary data, they should be stored as such.


Real UUIDs in the DB are definitely the right answer. I couldn't help but think the article has the right idea but reaches the wrong conclusion — it is possible to use user-specified IDs, but make sure you're taking them as UUID and storing them as such too.

That said, I understand how they got there. Although using real UUIDs in the backend is obviously the right path, it's amazing how rare their use is in industry. Developers either (1) don't know about the UUID type, (2) don't understand the advantages of such and therefore use a string instead because it's more familiar, or (3) in hubris, cast off the use of UUID because they know better and are doing their own thing.

We're using real UUIDs where I work now, but after a full ten years of industry experience, it's the first job where we're doing it right, despite previous jobs being at top name Silicon Valley companies who you'd think would know what they're doing.


There is a fourth reason: momentum.

We are just (finally) retiring a product with a large less-organised-than-we-care-to-about codebase that has been around in various states longer than our younger staff have been alive, that stored UUIDs wrong due to early misunderstanding by a long-gone expert. Several times during its life I made the case for converting it's use of UUIDS in strings to proper UUID types, even demonstrating significant performance improvements under load in a PoC, and storage savings. Everyone agreed it should be done that way, but there was considerably more work involved than just switching types (changing procs or type conversion could kill good query plans, dealing with the fact that at some point in history a smart person had used other strings as special values in one or two places, finding ad-hoc query preparation everywhere and making sure that declared the right type, ..., and of course the dreaded full regression test) so it never got done.

Especially in the last decade and a bit, because it would be retired in a year or two (ahem) so the time & effort would be better put to use elsewhere.


Yeah it is essentially storing bytes in a encoded form.. it is never the most efficient way. uuid, hex, base64 whatever.. when it hits the db it can be decoded into bytes and stored more efficiently which is generally a good idea.


From the article:

> If you use an embedded type like PostgreSQL UUID everywhere, then this is not an issue for you.

So only applies where your language/system doesn’t support strong types.


I’m fairly certain I heard someone say that it was inconvenient for the devs, or something like that.

A lot of technical problems are born of laziness.


Well, obviously UUDs should not be stored as a string but using a proper "uuid" data type (which also makes storing them much more efficient). Then the database will automatically reject the value with an error.


ah interesting, didnt even think about storing them this way makes them more efficient. Do you have any ressource on that?:)



In case anyone needs the equivalent for MariaDB: https://mariadb.com/kb/en/uuid-data-type/

Couldn't actually find something similar for MySQL apart from using BINARY as the datatype, historical sources also refer to that though are (hopefully?) out of date: https://dev.mysql.com/blog-archive/storing-uuid-values-in-my...

An observation: seems like PostgreSQL is HN's darling (and maybe the industry's as a whole), i wonder why MariaDB doesn't get similar amounts of attention, apart from maybe not being as actively developed? Or maybe its relative simplicity or history works against it here?


MySQL doesn't have a native uuid column type, but it does provide conversion functions UUID_TO_BIN() [1] and BIN_TO_UUID() [2]. These also have a convenient optional "time swapping" flag, to improve insertion order / btree efficiency.

[1] https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...

[2] https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...


> i wonder why MariaDB doesn't get similar amounts of attention, apart from maybe not being as actively developed?

I can obviously only speak for myself. But (in my opinion) Postgres simply offers the better overall package. I don't think there is a single "killer feature" that makes it better, but the sum of numerous things (JSON support, indexing, parallel query) that make it more efficient (and easier) to work with.

Note that the UUID type in MariaDB is not yet part of an officially released version. 10.7 is still in preview/alpha


MariaDB 10.7 is now GA. They've recently moved to a once-per-quarter release series model, and 10.8 is the current preview/alpha series.


uuid = 128-bit int for a database. the "abc-1234-abc...." is just a representation of that int.

So, if we think about it as using a int-128 or a string[~25], one of them takes more space than the other.

So, except for taking up more space, the question then becomes, are larger keys/indexes less efficient, I would say yes.

Other people can probably explain this better than me, but at least for me, it seems intuitive that going through and comparing entries in a list of small things is easier than doing it on a list of large things.


Why wouldn't a dedicated data type be more efficient than an improvisation with CHAR, a pair of two BIGINT/INT8, INT16* or BLOB?

* Is there a SQL database with support for 128 bit integers?


PostgreSQL natively supports a UUID datatype. SQLite provides an official extension to add UUID support.

Since you asked for "SQL database" and not really "OLTP database": ClickHouse natively supports UUID and 128-bit numeric datatypes. I think by now it also supports 256-bit.


You're right it, Clickhouse has UInt8, UInt16, UInt32, UInt64, UInt128, UInt256, Int8, Int16, Int32, Int64, Int128, Int256 [1].

[1]: https://clickhouse.com/docs/en/sql-reference/data-types/int-...


What are some use cases for Int256?


Maybe SHA-256 or BLAKE3?


SQL Server also has a dedicated data type for UUIDs


Learned that the hard way recently, when a client changed ERP system and suddenly order numbers went from being 6 digit integers to 20 chars.

I had made their order number the primary key for their integration tables, as I was much more inexperienced at the time and had read so much about using natural keys.

Learned my lesson since, and this project just confirmed it again: natural keys are rarely a good idea.

Wasn't the worst issue ever to convert given that an ALTER TABLE fixes it, but had to drop all the FK's first, just quite tedious going over the code checking where it made assumptions about it being an integer.


I personally almost always use surrogate keys, because of having been burned by natural ones once too many times as well. Altough, sometimes I end up having a unique constraint on a natural column as well, and then the temptation is there to make that the key.

But to keep it consistent I pretty much always end up using a surrogate key anyways. (I think I used natural keys like 2 times in a few years)

Unrelated: The blogs mascot is cute.


In my experience it’s best to work with a surrogate PK and generate a bijective customer-facing natural key.


I work mainly with CRUD, and usually people want to search for specific customers, names or similar. While not necessarily bijective on its own (multiple customers, same name, for example), it narrows it down enough for the person to find the record. (Or filtering on multiple fields)

So, I don't really see a need to generate a natural key.


I'm glad to have gotten this confirmed again; in school you learn to use natural keys, because e.g. someone's social security number or e-mail address is unique, right?

I'm currently rebuilding a system where a user-provided 'name' property is used as a primary key (no foreign keys though). I can't change its mechanics much - the name will be used as an ID in a different application as well - but I can add an artificial ID for internal use. Who knows, maybe long-term I can support changing the name property.


> in school you learn to use natural keys, because e.g. someone's social security number or e-mail address is unique, right?

Not sure if this is ironic, but a couple of subtleties about uniqueness aren't always conveyed in school. (Not very subtle subtleties given a few minutes' thought, but still...)

It's not enough to have single-point-in-time uniqueness, i.e all entities of the type under discussion have unique values of the candiate key now, which is what a uniqueness constraint guarantees.

It is also necessary that the value of the candidate key for each entity also does not change for the expected life of the database (or 250 years, whichever is longer--yes, the life of the database, not merely of the entity concerned), so no using people's names, for instance; and as the population of entities changes, values that disappear cannot later reappear (like SSNs of dead people being re-used for a new person).

If you don't control key allocation, it's only rarely possible to have ironclad, law-of-physics guarantees about these aspects of keyness.


Natural keys are perfectly fine. You just have to realize that you don't have full control over their sematics. But you absolutely require that to be able to construct the foreign key links in your datbase schema.

Therefore, by all means feel free to introduce and save any natural keys that the application requires. But do not rely on them for foreign keys.


> Did you use a regex to verify the data looks like a uuid before putting it into the database?

Is that really a thing? I have used UUIDs as primary keys many times in my life, and never had a need to do such a thing. I am struggling to even think of a valid use case.


> Did you use a regex to verify the data looks like a uuid before putting it into the database? Is your database (like sqlite) using a case sensitive collation for the column? Then you have a problem.

yeah, it seems like the premise is flawed. I don't know that there's actually a problem here.


Might be related to the use of SQLite which doesn't really support data types to begin with and allows to store anything in a column regardless of the declared data type (e.g. you can store 'fourty two' in a column declared as integer).

Apparently it's a deliberate decision because they don't really want to be a "database" just something better then "a file".


Strict table support came in last year to help address this: https://www.sqlite.org/stricttables.html


The only thing I can think of is if the user can export data from one system and import it to another, and for some reason there was a requirement to keep the UUIDs the same.

...or maybe if you're storing data from a remote API that uses UUID's as primary keys, and you want to make sure you catch it if they change things. Though in that case I would likely have my own primary key and store the remote primary key in a separate column


This is a really long winded way of describing a few things to be careful about when using UUIDs as primary keys (and mostly due to [ab]using strings and regular expressions to handle them). It _doesn't matter_ who generates them, all the points still stand. Mainly: make sure to canonicalize the things when comparing and use a real UUID type. The leap to broadly not recommending “user provided primary keys” feels hyperbolic.


It does matter who generates them.

UUIDs are generally assumed not to collide, since that's extremely unlikely. But if you allow an user to generate them, then they can try various shenanigans such as generating duplicated IDs, generating IDs that might have some sort of special meaning (eg, all zeroes), generating non-random IDs that could degrade performance, or storing some sort of data inside the ID.


For that matter, they may not even be trying to do so. Depending on generation method, it can be really easy to generate duplicate uuids from two different processes on the same machine that aren't synchronizing with each other.

In general, the other party generating your identifiers takes all the control out of your hands, so you have to start safeguarding against malicious/antagonistic/non-compliant inputs. And lose out on optimizing opportunities (e.g., opting to use a generation method that offers partial ordering)


So return a 409 and make the user generate a new one. Again, there’s nothing unique to who generates the UUID in this problem provided you validate the UUID which you should be doing anyway.


You can't tell, given a number, if it was randomly chosen. This could lead to a hash dos for example if you assume your uuid distribution is random.



Yes, this reads like a user who is dead set against all typing in software and is then surprised when that works out badly. There are some real problems with UUIDs as primary keys but they fail to mention any of them (some examples: UUIDs are quite large, which make them an expensive option for large tables, v1 UUIDs don't work well for distributed databases as the timestamps cause hotspotting, etc etc).


I've developed a database normalization form wherein absolutely no business knowledge may ever serve as the identity of a thing or relation key between tables. All things are keyed and related either using a global sequential identifier (in synchronous systems) or Type4 uuids (in distributed/async systems).

Depending on any 3rd party keys to build your relations upon is a huge mistake in my experience. Doesn't matter who gave them to you or if they are of high quality or not. If you relate by way of outside key information, all it would take to fuck your entire weekend is one of your customers saying they are moving to a new vendor who uses an integer key instead of a magic string to identify their customer records.


The example at the top of OP was of using UUIDs though. If they are supplied by a client in a distributed/async system, I guess they count as user-supplied.

The root of the problem with regard to UUIDs, though, seems to be that MySQL doesn't actually provide a UUID type, so the system in question was just storing their hex-dash representation as a string.

There's several things going on in the OP example that deserve more discussion than they get.


Yes, they count.

It doesn't matter that they are UUIDs and you can in theory continue to make them after the client changes something. If any user sees them, they aren't fit for a primary key.

I do go further. Anything the user touches is unfit as a primary key. If they want some number to identify the records (in a spreadsheet, in a customer number somewhere on their work or in a public API), that number is unfit as a primary key, it doesn't matter if you are the one creating it.


There’s a difference between using “natural data” and using a “client generated identifier”. This thread is talking about natural data and the example given amounts to a fussy customer rather than an actual technical problem (not that they don’t exist) with using natural data. Client generated UUIDs are totally fine provided you validate the ID and authorize resource access normally e.g. client A can’t put a resource by UUID owned by client B just by knowing the UUID. Using natural data or not really seems to be a product problem and in the face of inexperienced product managers or changing requirements probably best avoided for the reasons given. It’s not clear which issue the article is addressing, specifically, though.


> the example given amounts to a fussy customer

Just so we are clear, the customer pays our salaries.

Client-issued UUIDs seem fine in theory but then start to impose haunting dread when you consider the total space of all possible client devices, operating systems and browsers. At least some of those scenarios are going to have shitty rng implementation or use the wrong type of UUID. At least one of those problematic scenarios will have multiple concurrent users.

What do you reach for when your client-controlled supposedly un-collideable identities begin colliding? Why even take the chance?


HTTP 400 and 409


why? The advice to never let users see PK's (basically keep them secret?) is new to me.


Basically, if users see them, they will do "human things" with them, like developing a preference ("hey, that customer is deeply annoyed that we identify them by the number 666, can you change it?"); argue about the rules ("why do customer numbers go from 1, 2, 3 to 21? this is wrong"); correlate and replace it with other identifiers that they will always guarantee that are 1 to 1 with them, but never really are; and a lot of other stuff you or me will never expect. All of the columns in a table are fit for human manipulation, the only exception is the primary key that has informational rules, so use some other column if they need a key.

It's not "keep them secret", it's "this data is for developers eyes only", it doesn't matter a lot where the developers are, only that they know the rules.


I go against the trend here: I start with natural with FoxPro, then I use surrogate keys for more than a decade (following dogma...) and revert to natural ones a few years ago.

To make it more fun, the data come from DOZENS of different ERPs sources, with massively divergent ideas about what is data (like: One of them store money in different STRINGS formats with different decimal separators... in the SAME column). Plus, normal shenigans like nulls inside chars, uncommon ideas about what is ascii, etc.

And I sync the data in near real time, and the app is used concurrently by many users... etc.

---

The major reason: A LOT easier to sync data. A LOT.

UUIDs is one of the biggest anti-patterns to use for this. It means NOTHING. It demand to correlate with the REAL thing.

It makes harder to locate stuff later.

And when sync, it looks like are 2 different records. Still need to correlate with the actual natural keys.

But, how about changes on the key? How about mistakes? The natural keys can and will change, and using UUIDs or Integers NOT CHANGE THAT FACT. So deal it upfront. Is not that hard.

The only real true major drawbacks is to alter the rest of the places where it could be. Hapily, DBs have something called triggers. It works fine.

--

What is far better than UUIds where natural keys can't fit is to compute a hash of the relevant rows and use that as ID. It also make wonders for sync logic.

--

The key here is that the relational model is made for modeling the facts of the "real world".

Embracing this make everything nicer: Is easier to correlate things ("this line of invoice already tell me the Invoice Sequence and the TaxId of the customer and the City code!") making easier to do reporting, to use the DB admins, etc.


The posted article does not go far enough. Due to regulations like GDPR, user ids are considered PII if they're ever transmitted to a device outside the control of the organization. So if you must follow European regulations (and other countries like Canada are catching up too!) then to make your life easier I recommend TWO id columns. One as a random uuid4 generated in server side software (makes sharding easier, and collision is effectively impossible) that can be used for API calls and one internal id in the DB that can be whatever you want.

I know it seems strange to have three effective columns that all do the same thing, but the reality is that we actually have three seperate concerns here anyway. One key to our user's data. One key to our (untrusted) API data. One key between the tables. The user's key can change (e.g., someone changes a SIN / SSN after fleeing domestic abuse), the external key can leak (so it's PII which should—and eventually must—be deleted), the internal key can be very difficult to delete or obfuscate due to DB constraints, backups, etc.

If you don't do this you'll be paying a lot more for privacy engineers to write your erasure code. Or you'll be limited to countries without strong privacy protections.


Microsoft's SQL Server used to pad a blob that was stored in the database with an arbitrary amount of zeroes. This bit us when trying to verify the hash signature of that blob. I don't know if that's still the case.


I use secure random base-58 16-digit keys as primary keys in my distributed database, as long as you keep collisions rare with enough entropy (58^16 = 16400152899115243850138976256) you don't need to worry too much, generating a new ID/key is easy/fast when you have all the keys locally to match against.


Why not uuids?


UUIDs are pretty big when serialized to text; this makes use cases like QR codes and SMS unwieldy in my experience. YMMV


Usually you want to store them as integers when possible, like in DB or QR. For text formats like SMS I've often seen base-64 or similar with more characters.


You can also chose to encode them there in a less spacy format and have best of both worlds.


Why UUIDs?


Proper, standardized, efficient support across the stack plus well known uniqueness guarantees about the the ids generated. Not that an ad-hoc solution can't be just as good, but why go through all the trouble of reinventing the wheel?

What benefits are you getting from the ad-hoc solution that you wouldn't with uuids? They were made for this.


I look up collisions every time I generate an ID and it's also sychronized when the node is broadcasted on the global cluster.

UUIDs use time and origin to make things unique something that is both an advantage and also a problem, say for example you want to hide origin and time?!

Also as danuker said, why add a dependency for no real benefit, Log4J should have proved to you that it's better to write minimalistic implementations yourself.

Finally there is a difference between reinventing the wheel and improving the wheel: http://move.rupy.se/file/wheel.jpg

I made a distributed database in 2000 lines of Java.


> UUIDs use time and origin to make things unique

Some UUID formats do. Format 4 has all bits (except the format bits) drawn from (P)RNG.


But then is uniqueness guaranteed?


From Wiki originally:

> the annual risk of a given person being hit by a meteorite is estimated to be one chance in 17 billion, which means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.

So in a theoretical sense, no, but in a practical sense, yes. The same is true for any custom ID format like yours as well. 128 bits is enough to never hit a dup though, so you don't need to go crazy.

Your database should be what authoritatively guarantees uniqueness at the end of the day — generate UUIDs assuming no collisions (which will ~always be true), but store in a UNIQUE index so things'll fail in case of a duplicate or a bug that results in trying to store the same ID twice.


> why go through all the trouble of reinventing the wheel?

Because now you depend on an UUID library and its attack surface.

In some environments it is crucial to reduce dependencies.


You can generate the UUID in the database. And UUID generation is something that you can also find in the standard library quite often, in both those cases you don't need to add an additional dependency.


Yeah, UUID generation should be from the DB. Have a stored procedure for putting creating the record and return the UUID from that, rather than it being provided from outside.


UUID is part of your language stdlib, or at least there will be widely tested dependencies for the formats. Are you that certain about the parser of your own custom format?...


>Because now you depend on an UUID library and its attack surface.

I can appreciate reducing dependencies, but this is bordering on paranoia.


Whenever there's a "parse" method (as shown in this article), there's the possibility of poor implementation and exploitability.


Sorry, I do not get your point. If your inputs come from outside your system, you have to sanitize them, period. It would be the same thing for the original poster of this thread if the ids are being generated client-side. Besides, you're also running the risk of poor implementation for an ad-hoc system, which in my opinion is order of magnitudes larger than using a battle tested algorithm.


No no.. why not uuids :D


I think postgreSQL's 'citext' datatype gets around this:

> The citext data type allows you to eliminate calls to lower in SQL queries, and allows a primary key to be case-insensitive.

https://www.postgresql.org/docs/9.0/citext.html

TL;DR If users choose their usernames, and 'MyUserName' was registered and someone tried to register 'myusername', it would detect a collision and error, thus not causing any problems (the second user would simply have to choose another username).


Note that most of the examples are UUIDs, and postgres has a datatype for that that evades all these problems and is a lot more efficient too.


UUIDs are not text. Most uuid parsers ignore hyphens and braces, so citext will not help you.

Use the native uuid type, or as a fallback, store it in a 16 byte binary blob.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: