
How many ways are there to sort GUIDs? - soheilpro
https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=102450
======
amluto
I’m not sure whose fault it is that GUID string representations depend on
endianness, but I blame Microsoft. I’ve never seen a little endian string GUID
anywhere that wasn’t either Windows or a standard that had heavy MS
involvement.

Hey, MS, the string “0001” is the number one, or the bytes 0,1 _in that
order_. It is _not_ the number two hundred and fifty six!

~~~
Sahhaese
Or even 128 or 8

------
pdw
I never understood why GUIDs were defined as

    
    
      typedef struct _GUID {
        DWORD Data1;
        WORD  Data2;
        WORD  Data3;
        BYTE  Data4[8];
      } GUID;
    

rather than simply

    
    
      typedef struct _GUID {
        BYTE Data[16];
      } GUID;
    

I get that traditional GUIDs have an internal structure, but it's entirely
different than the struct (60-bit timestamp, 4-bit version, 16-bit sequence
number, 48-bit MAC address).

~~~
aljarry
It's divided exactly as it is in string representation: 4 bytes time_low, 2
bytes time_mid, 2 bytes time_hi_and_version and then: other stuff - 8 bytes.

~~~
pdw
But the string representation has five fields, not four.

There might be an interesting story in tracing the history of GUIDs through
the ages...

I found some 1987 code dealing with UUIDs here:
[https://stuff.mit.edu/afs/athena/astaff/project/opssrc/quota...](https://stuff.mit.edu/afs/athena/astaff/project/opssrc/quotasrc/src/ncs/nck)

They seem to have originated as a 64-bit "uid"s:

    
    
        struct uid_t {
            u_long clockh;
            u_int clockl: 4;
            u_char extra: 8;
            u_long node: 20;
        };
    

Then the 1987 version seems to have been something like this (with the
reserved field ignored in the string representation!)

    
    
        struct uuid_$t {
            u32 time_high;
            u16 time_low;
            u16 reserved;
            u8 family;
            u8 host[7];
        }
    

And I guess that's pretty close to the current struct...

~~~
pdw
And to complete the story, I'm guessing that that old struct likely got
serialized in the DCE protocol as dword/word/word/bytes, so the struct layout
remained unchanged to maintain compatibility, even as the definition of an
UUID evolved...

------
akersten
I'm struggling to understand the use case. The article offers "keys in an
ordered map" or "binary search" \- but the idea of a binary search is that the
order of the elements is inherently _meaningful_ , like the floors of a
building, weights, rankings, etc. GUIDs are _meaningless_ by design, so there
must be some underlying property that you're actually interested in searching.
And in that case, why not use that property as the key to the collection?

Is there another use case here that I'm missing?

~~~
willvarfar
Often UUIDs are used as keys to things...

And these things are often stored in databases...

And usually the database puts them into a btree internally, because that's how
tables are stored.

The moment you have any kind of load on such a table, your performance goes to
hell. This is because the inserts are going to happen all over the place, and
the way tables are stored definitely prefers appends.

So a common word of wisdom is to have an auto-increment primary key, and the
uuid indexed! Ugh what a work around.

A better way is to discover ULIDs, which are like UUIDs but with the high-bits
including a timestamp. This turns inserts into, approximately, appends. Much
nicer!

[https://github.com/ulid/spec](https://github.com/ulid/spec) is fairly recent,
but the 'trick' has been used for years. I've build gazillion-row databases
where all the data sources have, independently, used this same high-bits-are-
timestamp trick. So even though I didn't see it discussed and called ULID
until fairly recently, its a very old and well-known trick in some circles.

Once you have ULIDs and you know you have UILDs, a lot of database type tasks
become easier because the ID encodes a creation date. I've found myself adding
AND id BETWEEN to queries, and using them as a partition key, and using them
to DELETE old data etc, and various other admin stuff that the original
creators of the table never thought about.

~~~
tempguy9999
> Often UUIDs are used as keys to things...

IMO most data in DBs has a natural key. Why add a fake key? If it's to enforce
uniqueness then that uniquness is spurious because it's a GUID which has no
meaning to the data to which it's attached eg. you could have the same row
10000000 times but falsely uniquified with a different GUID.

Only exceptions I've encountered to my 'natural key' claim are addresses,
which are a sod to canonicalise (in fact, seemingly impossible, I'd welcome
any pointers).

Also UUIDs are bloody big keys, so fattening up tables and impairing
disk/memory bandwidth. This matters if you have gazillion-row DBs, surely? I
dealt with a near-billion row DB and shaving a few bytes off a row would
matter to us bigly.

For the sake of argument I'll assert that you should never use UUIDs as keys,
but please feel free point out where I'm wrong.

> This is because the inserts are going to happen all over the place, and the
> way tables are stored definitely prefers appends.

This is opposite to my understanding in that an append to a table causes
insertion hotspot which can be bottlenecks for insertion-heavy work. If it's
inserting at random places there's no hotspot. I suppose updates scattered all
over the btree could cause much more random disk activity though, do you mean
that?

> So a common word of wisdom is to have an auto-increment primary key, and the
> uuid indexed! Ugh what a work around.

Oh god no! Quite agree.

[stuff about ULIDs] I'd really like to understand what you're doing overall.
It just sounds just wrong, though I'm sure that's down to me not understanding
your use case.

~~~
willvarfar
Its easy to create plausible scenarios where things have UUID IDs.

For example, your classic customer database. Each customer has an ID.

Its hard to find a natural key for something as simple as a customer...

If you have a very narrow customer base then perhaps the company has a VAT
number or something that seems natural. But get an international customer, or
simply have problems getting people to enter those kind of details, gets
unworkable fast.

And then I suppose the 'trading name' is another natural identifier, except
they can change.

Hell, we've all experienced systems that have used the email address as a
natural key, and all the resulting pain as those change.

Just today, in fact, I had trouble with a jetbrains subscription when I used
my company's new domain name to register something and it created a new
account for me. Ugh. Can't work out how to link them, just because I changed
the email address on the pay page instead of the account page. Jetbrains, we
expect better of you!

So, anyway, its normal to go with an auto-increment.

But, then, the modern web site you've just built exposes that id to the
customer, and you've just leaked how many customers you have!

So modern best practice is to only expose UUIDs in public APIs.

And I'm saying, go with ULIDs instead. Your database will thank you.

~~~
tempguy9999
I'm going to clarify something; I said most data has a natural key then when
presented with this and other examples I seemed to have backed away from that.
I hope this makes sense...

In the real world of atoms etc. there's no natural abstraction, including
unique IDs (keys). These are all creations of ours.

In my line(s) of work, someone's invented and allocated a key for most items
by the time it's reached me - people have SSN/NI numbers, planes will have
some identifier, ships have IMOs, machine parts have unique part numbers (when
combined with their suppliers' ID), geographical areas have
postcodes/zipcodes, and all that.

So for me, someone's done that work already; a 'natural key' is available and
I can rely on it. It's always been there, except for a few things such as
addresses.

For you, willvarfar, customer numbers have to be created. So I made a wide
assertion without thinking it through based just on my experience.

------
tracker1
The portions of a UUID/GUID have meaning, there isn't just one version, though
the display as a string is relatively consistent. The sorting often will sort
on the timestamp portion of UUID v1, there are variants of a GUID (COMB or
Combined Guid or Sequencial UUID, etc) that are meant where a portion of the
values are sequencial like v1, and the rest is mostly random like v4.

SQL is optimized for the portion that is timestamp based when using v1
algorithms, because it results in better indexing performance. It's also why
there's a sequencial id option with SQL server.

Other systems/databases will sort the raw bytes differently, other
languages/platforms will use different arrangements. IIRC MongoDB's JS API
actually has like 3-4 different serialization options because of these
inconsistencies between languages/platforms.

Edit:

[https://codingsight.com/combguid-generation-of-sql-server-
fr...](https://codingsight.com/combguid-generation-of-sql-server-friendly-
guid-values-in-net-applications/)

NOTE: sequencial id's are not quite the same, they're part of the auto insert
so that they are ensured sequencial... this is the best indexing option for
SQL Server, though a COMB is probably easiest for applications. Other dbms
will structure them differently, so best do figure out your serialization
strategy based on your db.

~~~
Tornhoof
Here is a more recent post with benchmarks of different GUID algorithms in SQL
Server, the author has a different outcome. In his benchmarks about index
fragmentation, COMBGuid is not really much better than normal GUIDs.
[http://microsoftprogrammers.jebarson.com/benchmarking-
index-...](http://microsoftprogrammers.jebarson.com/benchmarking-index-
fragmentation-with-popular-sequential-guid-algorithms/)

~~~
tracker1
It does look like the sequencial ids (not the combs) does about 50% better
overall, and the comb from nhibernate does about 10% better than just random
uuids .. this is in terms of paging and indexes.. total time to run the test
isn't bad, but that's just the insert time. If you're generating from your
application layer and storing in SQL Server, looks like the best option is
UuidCreateSequential.

------
qwerty456127
Here is a great related article I can recommend:
[https://www.percona.com/blog/2014/12/19/store-uuid-
optimized...](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-
way/)

------
edoceo
Check out ULID, lex-sort by time built in. It's super cool

~~~
4e1a
Or try KSUID [0]. K-Sortable Unique IDs. I'm not affiliated, just easy to code
from scratch and sortable.

[0] - [https://github.com/segmentio/ksuid](https://github.com/segmentio/ksuid)

------
afpx
So, a GUID is not actually a UUID? The article refers to endians, but I don't
see endian referenced in the UUID spec (rfc4122).

