Ran into this one recently, and worked around it by executing a CLUSTER command after inserting data. By CLUSTERing on a column that contains random data (which your test could inject) you force Postgres to randomise the row order on disk, and thus randomise the return order.
In my case our primary key is a random UUID (I know, it’s a terrible thing, it’s not my choice), which is perfect for the task as CLUSTER requires an index. You might be able to pull the same trick by creating an index in a transaction, CLUSTERing then rolling back the transaction, but I suspect that you can’t call CLUSTER in a transaction.
Not as nice as proper test mode, but gets the job done, and avoids the need to wrap queries deep inside your application, with all the brittleness and peril that comes from using wizard level code reflection that’s normally required for such tricks.
The first and maybe the only time I saw UUID PKs was in a multi-tenant solution. If your customers are potentially competitors to each other, you absolutely do not want people to be able to walk IDs sniffing for other stuff. I wasn't the biggest fan of this solution, but I really didn't have a better suggestion.
On that same project we ran afoul of Javascript's 10^53 floating point limit with IDs, but I think that was a separate issue.
Honestly I think random UUIDs, or just UUIDs in general, make for terrible PKs and generally shouldn’t be used.
To solve the multi-tenant issue, I personally prefer either flakeID/hashID that have an ordered component and a random component to make walking IDs hard; Aggressively name spacing your data so a customer ID plus an object ID is always needed as pair to look something up, so trying to walk object IDs can only every result in someone accidentally looking up objects that already belong to them; edge layers that remap and filter internal identifiers via hashing etc so externally all ID are opaque and random; strong and careful access control, that ensures you can only ever lookup ID that belong to you, with careful consideration for side channel timing attacks.
Relying on random UUIDs for customer privacy would raise red flags for me. If being able to walk ID is enough to break your security model, then I kinda wonder if you actually have a security model.
It's security in depth in this case, as I recall. There are subtleties in how you confirm or deny the existence of a record and for their particular solution to the problem you would get a 403 vs a 404 for a record you did not have access to.
Knowing how much activity a competitor is adding into a project management system isn't a lot of data, but it's more than zero.
There’s a simple solution to that. Always return 403, it’s what you should be doing regardless of your underlying data model.
You should always be determine right to access a record before attempting to look it up. If you can’t determine ACLs without the lookup, then your permissions layer should always assume the requester doesn’t have permission to view a non-existent record and return a 403.
Doing anything else is dangerous, and randomising PKs is a sticky plaster over a badly designed access control system.
How about solutions like https://hashids.org/ which let you keep auto increment intger PKs and present the IDs to the users in a obfuscated string form?
It sounds like GP encountered UUID Version 4 keys.
UUID version 7 features a time-ordered value field derived from the widely implemented and well known Unix Epoch timestamp source, the number of milliseconds seconds since midnight 1 Jan 1970 UTC, leap seconds excluded. As well as improved entropy characteristics over versions 1 or 6.
If your use case requires greater granularity than UUID vesion 7 can provide, you might consider UUID version 8. UUID version 8 doesn't provide as good entropy characteristics as UUID version 7, but it utilizes timestamp with nanosecond level of precision.
> Both UUIDv8 and UUIDv4 only specify that the version and variant bits are in their correct location. The difference is that UUIDv4 specifies that the remaining 122 bits be pseudo-randomly generated. UUIDv8 suggests that the generated value still be time-based however the implementation details of how that generation happens are up to the implementor. This means we have 48 bits of custom implementation, four bits for the version (1000), 12 more bits of custom implementation, two bits for the variant, and finally 62 bits as the implementation sees fit. This leaves a lot open to the implementor but it has enough rules around it that it can coexist in the existing UUID environment.
What's wrong with UUIDs? Back in the 90s Microsoft strongly encouraged it. You lose a little performance and a little storage (128 vs 64 bits), but gain the ability to merge databases (think corporate acquisitions) with less pain.
At least one distribute datastore (Google Cloud Datastore) works better with random keys rather than incremental keys, it tends to produce fewer tablet splits. At least, it used to, the technology may have changed since then.
Honestly it seems like the arguments for/against UUIDs as keys are pretty mild on both sides. Why would it be "terrible" to go one way or the other?
Harder to copy and paste IDs. Discourages denormalization since every reference is 128 ~~bytes~~ bits. Reduced throughout for inserts since the primary index pages won’t all be in cache if the keys are randomized. Lose ability to use id as a fast temporal ordering. Encourages uuids generated client-side.
All in all not enough good reasons to make a lot of stuff slightly worse.
> Reduced throughout for inserts since the primary index pages won’t all be in cache if the keys are randomized. Lose ability to use id as a fast temporal ordering.
You can generate the first few bytes (we use the first 3) of the UUID from a timestamp - that gives you index locality. It's even better than an auto-inc because you can control exactly how much of your index will be used for hot inserts based on how many timestamp bits you use, so you can avoid lock contention around the single latest index page.
> Encourages uuids generated client-side.
It's really an API design choice about whether you allow this, but it can be useful in some circumstances (if you trust the client).
> so you can avoid lock contention around the single latest index page.
That is usually the total opposite of what you want. There are some optimizations for inserting to the last page but primarily it is because you want to to sequential inserts. So if you want to avoid contention on the last page you should insert ordered by (connection ID, sequential ID or high resolution timestamp). That way every connection will do sequential inserts on its own page.
Nice this makes sense. Seems like storing connection_id in the table would make it difficult to look up data later? It seems like you could use (customer_id, seq_id) for similar effect.
- UUID (fully random) is completely secure and private. You cannot infer anything from it. As soon as you add anything to it ( like timestamp or counter), outsiders can infer things like how fast you are generating new objects
- UUID (fully random) cannot be abused by developers, which might skip creating timestamp columns and read it from the id instead
- UUIDs are slower, but as you see y's not for no benefits
Not that much slower than bigint as long as you replace the top one or two bytes in the UUID with a rotating per-minute counter.
You keep 106 bits of entropy (from v4's 122 bits) while largely eliminating page faults. Walkability is eliminated while not leaking too much temporal info.
> Discourages denormalization since every reference is 128 bytes.
This might just be a mistake, but a uuid is 128 bits = 16 bytes not 128 bytes. Of course yes there are still circumstances where the extra space (16 bytes vs 8 or 4 or something else) isn't worth it.
Is there an actual problem with this? I don't mean "it feels impure" - what's the failure case here that's severe enough to discard UUIDs as PKs?
The other reasons all boil down to 'database performance tuning', which is reasonable if you have any prospect of needing it, but most datasets are _tiny_.
The other reasons are more important imo, and they aren’t really all tuning more ergonomics and simplicity.
There’s a lot more edge cases when you use client-side generated IDs. You have to check if the ID is actually new to avoid security issues is the main one. Let’s just upsert into the DB and return the results and the uuid is new and randomly generated so it’s fine! So simple! Except now Alice can send someone else’s uuid and read their data. X100 endpoints all need to handle this correctly. It’s a huge risk.
If you do offline first ID generation using global uuids, the possibility of conflicts (due to bugs, patched client code, etc) that’s a whole rabbit hole of edge cases and problems as well.
It can be done, Bret Taylor used that architecture for quip, but it’s needlessly tricky which is the whole story for UUIDs imo - annoying and slightly worse for many normal apps. If you’re building a complex distributed system and want to deal with all the trickiness, go ahead. I’d recommend you use a B64 custom identifier instead of UUID so that it’s copy paste able :-)
The nature of your application and its security and idempotence needs will determine who should (ideally) generate ids for any specific context. The format is irrelevant.
Generate ids on the client when it makes sense, generate them on the server when it makes sense. Server-side is more common but if you choose poorly you'll make bad software either way.
> Postgres can auto generate UUIDs too and can be used for pk
True, but the practice makes it easier to shift to the dark side and insert the child records first, as you can already know the pk of the parent record in advance.
> How can you do that when you have fkeys in your database design like a good engineer, right? Right?
I know you are trolling, and it is ok if done sensibly, but it is not simply purism that makes it a bad idea. Because you have to trust the client it opens the way to discovery attacks (try inserting a record with a specific PK -> it will fail if the record exists -> now you know that it exists even if you don't have access to that record).
You may also not have access to all client implementations (think of public APIs) so some client libraries might not implement proper (i.e. strongly random) UUID generation.
B-trees are self balancing so I don't think there will be bloat. But if you insert towards the end of an index a lot, the right-side of the index will more likely be in cache which makes it a lot faster. If you inserted randomly then you need to do lots of traversals in different parts of the B-Tree.
When you’ve got 128bits to play with, I think you can do more useful thing with those bits than UUIDs provide.
UUID are perfect fine identifiers in an entirely pure sense. But baking a like more info into those identifiers can make debugging and building operational tools so much easier. It’s just a massive waste to fill those 128bits with either random data, or very limited options that other UUID versions give you.
Arguments around leaking information into public I think are a little silly as well. There are plenty of ways of preventing that, trading ease of debugging for making your internal identifiers safe in public is a poor trade off in my view.
Ran into this one recently, and worked around it by executing a CLUSTER command after inserting data. By CLUSTERing on a column that contains random data (which your test could inject) you force Postgres to randomise the row order on disk, and thus randomise the return order.
In my case our primary key is a random UUID (I know, it’s a terrible thing, it’s not my choice), which is perfect for the task as CLUSTER requires an index. You might be able to pull the same trick by creating an index in a transaction, CLUSTERing then rolling back the transaction, but I suspect that you can’t call CLUSTER in a transaction.
Not as nice as proper test mode, but gets the job done, and avoids the need to wrap queries deep inside your application, with all the brittleness and peril that comes from using wizard level code reflection that’s normally required for such tricks.