Hacker News new | past | comments | ask | show | jobs | submit login
Random identifiers are poorly compressible (lemire.me)
22 points by ingve on Sept 12, 2021 | hide | past | favorite | 17 comments



The reason to use uuid column types in a database is so you can skip locking dB sequences or using synchronization strategies like HiLo and just generate collision-free identifiers client-side. Additionally, some databases will use the randomness to give you free uniformly distributed sharding.

But those random identifiers have no intrinsic value. So long as your database is a complete view of the data you wish to compress (i.e. a closed graph with nothing from the outside world pointing in) then one distinct identifier is as good as another. You can consider a compression strategy that replaces uuid x with an (irreversible) one-to-one mapping value y as a form of lossy compression that gives you an equivalent but not identical dataset upon decompression.

Depending on where, how, and why the compression is taking place, that could be good enough. If there is an additional non-constrained channel available for the non-hot path, the mapping could also be exported separately to allow for reversing the mappings out-of-band.


Well, yes. Maximum information = Maximum randomness.

This is why "modem sounds" are like white noise - both are maximum entropy. The probability of ANY state is maximized to the channel capacity. For binary words, that means every combination of bits would exist. That's not going to be as compressible as a situation with redundant bits or states.

Maximum compression is also related to this: if you have maximum entropy, you have maximum compression and maximum channel capacity. You also have "most perfect encryption" if the data has been encrypted - again redundancy means encryption can be broken faster/easier than maximum entropy.

All nicely related. All commonly dealt with in digital modulation and wireless technologies (I learned all this as an EE long before the digital communications revolution arrived - came in handy for sure).


The math is correct, but I don't fully agree with the explanation. The saved bits don't really come from uniqueness, taking away 10,000 possibilities out of 2**64 barely makes a dent. The combinatorial savings come almost entirely from ignoring order.

The real question is, what does a dataset have to have a unique list of identifiers be such a significat part of it that this is worth worrying about? Many identifier-heavy datasets have lots of cross references (e.g. n:m relation tables), with lots of repeated identifiers that are actually easy to compress. Am I missing something?


There was an interesting point about this on the Mangadex thread the other day: https://news.ycombinator.com/item?id=28443261

> Do that! If you strip them out, the 529 kB document shrinks to 280 kB, which hardly seems worth the hassle, but when gzipped, this is a miniscule 13 kB! This is because those strings are hashes, which significantly reduces their compressibility compared to general JSON, which usually compresses very well.


>In the extreme case where you have many, many distinct identifiers, then make [sic] become highly compressible

This seems backwards. Each random, incompressible identifier must be stored verbatim exactly once. If you have relatively few identifiers, that are mentioned many times, it doesn't matter at all what you call them. A compressor can just put them in a dictionary and use their reference.

On the other hand, if you have a large number of identifiers, the cost of that dictionary becomes non-trivial.


It makes sense. If the number of identifiers you’re storing exceeds 1/2 the cardinality of the identifier’s underlying key space, then you can simply store the complement at a reduced size. Why store (2^32 - 10) random 4-byte identifiers when I can store 10 instead, and invert the lookup query?


This seems like a weird argument, since any time I've encountered a UUID or some other randomized identifier in practice, I care about things beyond set containment, like "what order did I see them in" or "the database row attached to the identifier".


This is quite an interesting point. In data engineering, compression is a key goal -- yet random identifiers are often used.

Part of the reason is, presumably, that over a distributed system it's hard to coordinate choosing a deterministic globally unique identifier without a performance cost.

Does any one have any thoughts on implementing a globally unique identifier in a compressible/deterministic way?

My initial guess is choose some low-collision probability random part, and then glue on timestamp?


Snowflake has been mentioned here, there is another way to achieve this although I've never seen it used in practice. Imagine, in the DBMS the identifier column would be sequential, but whenever the rows are queried, the identifiers are ran through AES. The server would have a single AES key that is used throughout it's lifetime. To the user the identifiers would seemingly be random, but on the server side they are sequential and thus highly compressible. In addition, when querying the rows and putting a condition on the identifier column, AES decryption would have to be used. For example "select user_id, first_name from user where user_id = 6744073709551615;" would decrypt 6744073709551615 into the sequential number, like 103 (if it's the number 103' inserted row), and then use 103 to query the table. There would certainly be a performance overhead when returning large result sets where lots of identifiers need to be encrypted. On the other hand, modern cpus have specific AES instructions which are really fast, and in addition, when the identifiers are compressed to something like 1.5 bits on average, you can fit a lot more in the cpu cache which may offset some or even all of the performance loss.

If you have only a single master DBMS server, then 64-bit integers should be sufficient when using this strategy. Otherwise, if you are using multi-master in a distributed environment, you could achieve lock/communication-free identifier generation by instead using a 128 bit integer and for example include the server identifier inside those 128 bits, similarly to snowflake, before encrypting it and exposing it to the user.

Side note, since the identifiers are sequential there is anther benefit in that, like snowflake, when you order by (created_date, user_id), it wouldn't incur any performance cost over just ordering by user_id.

If anyone knows anyone who uses this or why it would be a bad idea I would love to know.


Snowflake IDs may compress well.

https://en.m.wikipedia.org/wiki/Snowflake_ID


Computer index (top 32 bits) + entry index (bottom 32 bits)

Adding more things such as time stamp just adds entropy without being useful. The indexes should start at 0 and count up. The computer index is something that should be handed out once every time a computer is brought in.


> " First, most of the existing UUID versions such as UUIDv4 have poor database index locality. Meaning new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The negative performance effects of which on common structures used for this (B-tree and its variants) can be dramatic. As such newly inserted values SHOULD be time-ordered to address this."

-- https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...


Then why not map timestamps into a chronological index either on-the-fly or some time later?


The first comment on the article referenced this article on slightly un-unique identifiers => https://news.ycombinator.com/item?id=28088213


Twitter uses timestamp + machine ID + sequence number.

https://en.wikipedia.org/wiki/Snowflake_ID


Assign pool of ID's to servers in blocks, e.g. 0..9999, 10000..19999. Then, when ID block is near to exhaustion, ask for new one.


That's pretty close to how Vitess handles sequences (distributed auto-increment)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: