Hacker News new | past | comments | ask | show | jobs | submit login
Nanosecond timestamp collisions are common (evanjones.ca)
316 points by ingve 10 months ago | hide | past | favorite | 286 comments



This is why you should use ids that combine both a time component and a sequence.

Eg UUIDv7 has a milliseconds time component and then a field that increments for each event in the same millisecond, and then enough random bits to make collisions between ids generated on different machines astronomically unlikely.

Of course there are only so many bits so you might generate too many events in the same time slice so the sequence overflows, and you might actually get collisions between machines, and you are limiting your event generation speed by forcing your cpu to sync on the increment etc.

But in practice UUIDv7 works great at scale.


I’m feeling a bit like an accidental time traveler, because I can recall a conversation at a tech meetup that had to have been at least ten years ago where someone was struggling with unique UUIDs because they were bursting above 1000 UUIDs per millisecond and not happy with the available options.

How old is UUID7? I can’t get the internet to tell me.


UUID7 was first drafted a bit over a year ago: https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...


That's kinda what I figured. So you can see my confusion.


Perhaps one of many on this list[1]? ULID is listed as having 1.21e+24 unique per millisecond. I have no idea though about what may match up with your timeline(s).

[1] https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...


UUIDv7 was first proposed just a few years ago, but the rfc contains a good list of well known ULID systems that predate it.

Putting the timestamp in the high bits of random ids is a “trick” i learned from DBAs (that used to be a thing!) in the 90s. And often the drive was to improve DB insert performance as it was for the other uses of an ID you can order and reason about.)


Were they not happy due to possible collisions or something else?


Yeah they had a problem where the amortized ID rate was less than the number of valid UUIDs you could generate in a given time interval, but with a peak rate well above that, so UUIDs weren't quite going to function for naming/numbering them. You gotta be pushing a lot of messages through a system to hit that, but it's possible. And adding more layers of coordination on top of something meant to reduce coordination overhead tends to... make things messy.

I've half-heartedly tried to look up his problem every time I've had to introduce UUIDs to something (most recently fixing agents not generating correlation IDs), and I have not figured out which version of UUID he was talking about. I now suspect it was some proprietary or industry-specific quote-unquote UUID spec rather than an industry wide one. I may look through the UUID7 spec at some point to see if they mention prior art. Much more plausible than him being a time traveler.


Why do you need the time component anyway? It's just eating up bits in your UUID without contributing much entropy.


> Why do you need the time component anyway?

To sort or filter the records by time. Sure, you can just add an extra column if you need this, but there are cases when this is not convenient to do. E.g. when you want to be able to export the records info files, naming the files with the IDs and still be able to sort them.


More importantly if you have an index on purely random IDs, then each insert will go to some random position into the tree whereas having IDs that increase with time will make all new IDs end up at the end of the tree which reduces index fragmentation.


depending on scale and architecture, either behavior can be better. it’s easier to shard when writes occur randomly over the overall space. it’s easier to coalesce when writes all happen in a given place (head or tail)


Being random within each shard is still bad for write performance. Going fully random seems like a bad way to accomplish this goal.

Why not keep the timestamp bits to use when appropriate, but use some of the random bits for shard selection?


Only when writing all at once and when you know what the shard boundaries are and the number of shards (and boundaries) are stable. If they’re changing, growing, et c. you can’t tell where they’re at predictably and random is the least likely to cause problems and allow sub-sharding dynamically.

Very large real world datasets are unlikely to be static long enough, and equipment stable enough, to not consider this effect.


> If they’re changing, growing, et c. you can’t tell where they’re at predictably and random is the least likely to cause problems and allow sub-sharding dynamically.

I'm confused by your reply, because I never suggested not to use random bits for sharding.

I'm just saying that 60+ random bits should be enough to shard, change, grow, and sub-shard with. You don't need 122.


I never said anything about bit or key length at all? Let alone how much was random or not? Perhaps you’re confused?


Let's start over.

People were talking about the value of time+random UUIDs versus all-random UUIDs, and how those behave.

You said that sometimes the random behavior is preferable.

In response to that, I was saying that even if you want to sort randomly at some particular step, you should use the time+random format, because other steps might not want to sort randomly. You should directly choose to use the random part, instead of indirectly forcing it by making the entire UUID random.

Then you said "Only when writing all at once and when you know what the shard boundaries are and the number of shards (and boundaries) are stable."

I can't figure out how that relates to my post. I thought you were worried about insufficient random bits to use for sharding, but apparently that wasn't your concern. So I have no idea what your concern is. If you have a use case for randomness, use the random half of the UUID.


UUIDv7 has a specific format that doesn’t support that.

For the case I’m describing, you can’t use it.

For situations you want write coalescing it’s fine though.

Not sure it we’re agreeing here?


What do you mean it doesn't support that?

There's some flexibility in how you fill in a UUIDv7, but let's go ahead and say that the ones we're worried about have the first 32 bits filled with timestamp and the last 32 bits filled with random.

If you want pure sort-by-time, then use it the normal way. If you want pure sort-by-random, then it's slightly awkward but you can prioritize the random part.

But the additional power is that you can shard by the last 32 bits, then sort by the first 32 bits within a shard. And you don't need weird workarounds like hashing the UUID.

You said "it’s easier to shard when writes occur randomly over the overall space. it’s easier to coalesce when writes all happen in a given place (head or tail)". But you can have both at the same time. You can have easy sharding and easy coalescing.


Except you literally can't do random distribution AND be compliant with UUIDv7 if you use any sort of normal lexical sorting/indexing, as they use the start of the key as the most significant bits. UUIDv7 is literally designed to have stable lexical sorting orders, have the time as the most significant bits, and have the most significant bits of the time as the most significant bits of the key! It's their primary design criteria!

You can't 'prioritize' random parts of a key for sorting without writing a bunch of custom sorting (and key parsing) logic, which is generally undesirable for a number of reasons - and frankly completely unnecessary in these cases. You just wouldn't use UUIDv7 (or probably a UUID in general), and the benefits would pay for themselves very quickly anyway.

To quote the UUIDv7 RFC:

"This document presents new time-based UUID formats which are suited for use as a database key." (as the first line of the abstract)

"Due to the shortcomings of UUIDv1 and UUIDv4 details so far, many widely distributed database applications and large application vendors have sought to solve the problem of creating a better time-based, sortable unique identifier for use as a database key."

"- Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.

- Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.

- Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.

- The ID format SHOULD be Lexicographically sortable while in the textual representation.

- IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.

- IDs MUST NOT require unique network identifiers as part of achieving uniqueness.

- Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource."

[https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...]

I'm pointing out that for some systems, that makes UUIDv7 unsuitable because you WANT the keys to be randomly distributed to avoid hotspots. Using UUIDv7 in these situations will result in a single node receiving all writes (and all reads for a given time range), which in the dataset sizes I'm referring to is usually impossible to handle. No single node can handle that kind of load, regardless of how efficient it may be.

For other types of systems (such as single machine databases or 'tight' clusters of databases without extreme write loads), UUIDv7 and similar is great, as it allows easy/cheap write combining when that is actually possible for a machine to handle the load.


> Except you literally can't do random distribution AND be compliant with UUIDv7 if you use any sort of normal lexical sorting/indexing, as they use the start of the key as the most significant bits. UUIDv7 is literally designed to have stable lexical sorting orders, have the time as the most significant bits, and have the most significant bits of the time as the most significant bits of the key! It's their primary design criteria!

> You can't 'prioritize' random parts of a key for sorting without writing a bunch of custom sorting (and key parsing) logic, which is generally undesirable for a number of reasons - and frankly completely unnecessary in these cases. You just wouldn't use UUIDv7 (or probably a UUID in general), and the benefits would pay for themselves very quickly anyway.

Forget prioritizing, that was about going fully random. Seriously, let's pretend I never said that specific sentence.

Let's focus on just the sharding scenario. None of what you said there conflicts with what I said about sharding.

Unless these database engines are so incompetent that you can't shard on something as simple as id[12:16]?

> I'm pointing out that for some systems, that makes UUIDv7 unsuitable because you WANT the keys to be randomly distributed to avoid hotspots. Using UUIDv7 in these situations will result in a single node receiving all writes (and all reads for a given time range), which in the dataset sizes I'm referring to is usually impossible to handle. No single node can handle that kind of load, regardless of how efficient it may be.

You only want the keys to be randomly distributed at the sharding layer. Once it reaches its home node, you don't want random distribution within that node. At best you begrudgingly accept it.

It's within a node that things like "normal lexical sorting" matter the most, so UUIDv7 does a great job of making that smooth.

You don't need lexical sorting between shards, especially when you're randomizing the shard.


The point I'm making is all these shenanigans are completely unnecessary, don't really help, and make everything extremely hard to manage, reason about, and get performance from - all to try to force usage of a specific key format (UUID) in a situation which it is not designed for, and for which it is not suited.

It's square peg, round hole.

And folks working on Exabyte sized indexed datasets generally already get this. So I'm not sure why i'm even having this discussion? I'm not even getting paid for this!

Ciao!


"it allows easy/cheap write combining" is not "completely unnecessary". What the heck, at least be consistent.

And it's not shenanigans! You could shard based on the first bytes of a key, or you could shard based on the last bytes of the key. Neither one should be harder. Neither one is shenanigans.

> It's square peg, round hole.

Going entirely random is an even worse peg.


Wow a long thread of back and forth and confusion :)

Fwiw I’m with Dylan on this one!

I have direct experience of absolutely humongous data processing using random bits for shard selection where each shard uses sorted storage and benefits from the sortability of the time bits so, with just the smallest buffering, all inserts are basically super fast appends.

This is super normal in my experience. And I can’t wait for the new UUID formats to land and get widely supported in libs to simplify discussions with event producers :)


Just explicitly use (time, random uuid) as a key in your sorting, instead of sullying your uuid with time information?


ULID schemes aren't just about big endian sorting advantages, they often better enable time-based sorting mechanisms.


If that is the case then why shouldn't the storage system hash the IDs itself, to spread them as it requires?


Because sometimes you want some data to be collocated, while the rest sharded.

For instance, you might use a random object ID as a prefix value in the index, followed by attribute ID which isn’t. Or a modified time, so you can have a history of values which can be read out linearly.

If using it directly, that means Objects and their data are sharded randomly across, but when looking for an objects attributes (or attribute by time), their index entries are always co-located and you can read them out linearly with good performance.

If blindly hashing keys to distribute them, you can’t do that. Also, you can’t really do a linear read at all, since no data will be ‘associatable’ with others, as the index value is randomized, and what is stored in the index has no related to the key provided by the user.

You can only do a straight get, not a read. That is very limiting, and expensive with large data sets as most algorithms benefit greatly from having ordered data. (Well, you could do a read, but you’d get back entries in completely random order)

Needless to say, this is ‘advanced’ usage and requires pretty deep understanding of your data and indexing/write/read patterns, which is why random hashing is the most common hash map behavior.


Sounds like it should be an attribute of the index and not require a change in the data. To me, anyway.

    CREATE INDEX ... USING HASH;


I’ve never seen that kind of optimization on a dataset that would fit on a database server of any kind. Tens of PB or EB usually, but sometimes only several hundred TB if it’s high load/in-memory only.


Just swizzle the ID.


Page splits… page splits everywhere.


Or, you could use a graph database and stop having frustrating relational impedance mismatch, nonlocality etc. You can have O(1) lookups instead of O(log N) for almost everything


That will depend on which graph database you use as a graph database might just store the graph in an underlying relational database. And it will also depend on what kind of data you have and what kind of queries you want to perform. For a graph database it might be faster to navigate along links in the graph but I would guess you will have to pay a big performance penalty if you have to operate orthogonally to your links, like aggregate across all instances of some entity.


That sounds too good to be true. Is that really true of all grapdb’s?

Also, if that’s really true why can’t everyone just use graphdb’s?


Because of vendor lock-in with the LAMP stack over the years. Every host used MySQL, how many had Neo4j as available?


Graph databases don't solve that. All databases, document, graph, rel ALL implement indexes to find specific things in the exactly the same way. Very well known tree, hash and other techniques.

The representation (outside of indexing) has properties that make your USE CASE better or worse. EGreg would not be someone to hire to architect a solution. He'll just put your 1Trillion row per month use-case in a graph DB like Neo4J and you'll just watch it fall over when you run billing.


How big is the 1 though?


When you’re talking about data sets so large they dictate what hardware you use, and introduce terms like “cluster”, then 1 = √n

Which is why we need a version 2 of complexity theory, that doesn’t treat memory access or arithmetic on arbitrary precision numbers (aka as n actually goes to infinity) as O(1) operations. They aren’t. Which every large system engineer knows but few will talk about.


Why sqrt(n) and not log(n)?

And that complexity theory already exists. Typical whiteboard engineering uses transdichotomous models to gloss over some polylogarithmic factors (as do much of the literature), but more accurate models exist.

The difference isn't usually super relevant when comparing multiple solutions all using the same model of computation though since the extra terms don't tend to bump one complexity class above another when switching models, and if you cared about actual runtime characteristics you wouldn't be relying on log factors in such a crude tool anyway.


Speed of light.

Imagine a data center containing exabytes of data. How long does it take to access an arbitrary bit of that data?

We use clusters because computers cannot contain an infinite amount of memory, storage, or CPUs, because of physics. You see this same thing play out at smaller scales but it's more obvious at the macro scale. More addresses take logn time to sort out, but time to access is measured in radii, not gate depth.

In a world where clusters are rare, Knuth made decent approximations. In a world where clusters are not only de rigeur but hosted on multitenant hardware spread out over upwards of 100 miles, those approximations are bullshit and need to change.


Oooh interesting, thank you. I totally misunderstood this as the "integers never need more than 64 bits, so hash tables are constant" argument.


Integer arithmetic is really quantized logarithmic complexity. If your hardware has a bucket your number fits into, you're calculating n+1 or nxn in constant time. But if your data set doubles in size (especially for multiplication) you may find yourself in a bucket that doesn't exist or a more expensive one. Contemporary code is more likely to reach for bignum which is logn, but again stairstepped to each number of integers it takes to represent the entire number. A bigger problem when your data set is sparse, so that values grow faster than population (eg, UUID).

But on the topic of hash tables, you can only reach 'O(1)' if you can avoid collisions. And to avoid collisions you need a key of length m, which grows as n grows. You cannot put 51 pigeons in 50 pigeonholes. So the key length of your hash keys is m >= logn, which means the time to compare keys is also logn, which means hash tables are never actually O(1) access or insert time. Actual access time for a hash table on non-imaginary hardware is √nlogn, not O(1).

If you consider that we have applications that are just a single hash table occupying the entire RAM of a single machine, then this is not picking nits. It's capacity planning.


You cannot put 51 pigeons in 50 pigeonholes. So the key length of your hash keys is m >= logn, which means the time to compare keys is also logn, which means hash tables are never actually O(1) access or insert time.

I am not sure I am following this argument. You are not going to have more than 2^64 pigeons and pigeonholes on any system soon and I almost dare to say you will never ever get to 2^128. And for 64 or 128 bit keys comparisons and many other operations are for all practical purposes constant time. I guess you could argue that this is sweeping a factor of log(n) under the rug because of things like carry chains which could be faster for smaller bit sizes but I am not sure that this is really useful on common hardware, an addition will take one clock cycle independent of the operand values.


Are your hash keys 8 bytes long? Mine aren't. Mine aren't even close.


Sure, they can be and are often longer, but not because you were forced to use long keys, it just happened that the thing you want to store in a hash table is a long string. The way you worded it made it sound to me like you were saying that one has to use long keys, not that in practice one often has to deal with long keys. But even then I am not convinced that this should give an additional factor in the complexity analyses, I think I would argue, at least in some situations, that calculating hashes of long keys should still be considered constant time, that for the longest keys. But I can also imagine to take this into account if the key length is not only big but also highly variable.


Look, if you have N items related to X, at insert time, you store them in an array and have X point to that array, instead of foreign keys.

For example, when a user has 7 articles. Do you want to just point to where the articles are stored? Or do you want to do O(log n) lookup for each article?

And if you have many-to-many, do you want to join an Intermediate Table for even more processing, or just follow a pointer to a range of an intermediate node directly and traverse?


How is that different from a clustered index?


A clustered index requires O(log N) lookups, since it's still an index.

I'm talking about pointing directly to the location of an array where things are stored. That array isn't a clustered index. Each array is different.


What about when you delete rows? Do you just leave the space unused now? Or if you update a row to be larger? Rewrite the whole array (so possibly O(n) updates)?

How do you deal with data that gets accessed in different orders based on relationships from multiple other entities? Duplicate the data? If so, updates now get amplified and you can fit less data in RAM so you're more likely to require disk IO. If not, you need a layer of indirection (so you have an array of pointers instead of an array of data).

Even with a layer of indirection, updates that grow a row and require a reallocation will cause you to have to go update all pointers (also, those pointers need to be indexed to find who you need to update). To avoid write amplification, you can use an array of ids instead of an array of pointers. Now you want an id <-> pointer lookup, which could be done with a hash map in O(1), or with a tree in O(logN) if you want to also allow efficient range queries.

I'm not exactly sure on the implementation details, but for ballpark numbers, for an 8 kB page size with 80% fill factor and 16 B entries (8 B key + 8 B pointer), with 10E9 rows, log(N) ought to be ~4 (page IOs). Not ideal, but the point is for a lot of real-world engineering, O(logN) is effectively O(1) (with a small enough constant factor that it's fine for general use).

So if your data is not write-once, trees are a well-rounded default data structure.


One benefit of keeping it separate is that you can choose the precision of the timestamp necessary. "Millisecond precision" is arbitrary and commonly insufficient.


it's commonly totally sufficient for IDs to represent a rough sort order

millisecond precision is great for a lot of use cases


I didn't say that it wasn't. Hell even ms is too precise for many use cases (usually where date is used instead).

What I said was that it's useful to be able to select timestamp precision independently of UUID implementation. One size that fits all fits none best.


Lucky for you, they also define UUIDv8 as a free-for-all where you can do whatever you want, and nanosecond precision is one of the examples given in the RFC.


> E.g. when you want to be able to export the records info files, naming the files with the IDs and still be able to sort them.

You could name them by (time, id)?


Otherwise, failure recovery requires robust storage. Upon startup, you just wait until your timestamp ticks over, and then you know you're not re-issuing any UUIDs.

With a pure counter-based system, you need robust distributed storage and you need the machine to reserve batches of IDs via committing writes to the robust distributed storage. Otherwise, a disk failure may cause you re-issue UUIDs.

Though, seeding a thread-local AES-256-ctr or ChaCha20 instance via getrandom()/getentropy() is probably a better fit for most situations. If you need sequential IDs for database performance, seed a simple 128-bit counter using getrandom()/getenropy(). If you're going to need more than 2^40 unique IDs, then use IDs larger than 128 bits.

Assuming getrandom()/getentropy() is getting you a high-entropy seed, it's easy to size the IDs to bring the probability of collision much lower than the probability of background radiation flipping a bit and breaking your equality test between IDs. If you're not running redundant calculations and/or on radiation-hardened hardware, it's difficult to argue against randomized IDs.


> With a pure counter-based system, you need robust distributed storage and you need the machine to reserve batches of IDs via committing writes to the robust distributed storage. Otherwise, a disk failure may cause you re-issue UUIDs.

Yes, a pure counter based system is probably worse than one that uses time. Don't use counters, especially not in a distributed setting.


You need it to make database indices perform better.

If you don't need that, but just need a random UUID, UUIDv4 is better.


I feel like v7 is almost a strictly better v4. Assuming you can generate a v7 (you have a time source), what are the disadvantages?


Anyone who can see the uuidv7 can determine at what time it was generated. You might not want that for things exposed to users.


You can hand out uuidv4 to clients without revealing anything.


Entropy. You loose bits to a known source, which reduce entropy of the UUID.


I do feel 74 bits per second is enough though. That would require 2^(74/2) = 137 billion UUIDs generated in a single second for a 50% chance of a single collision.


128 bits is a lot of entropy to go around, though.


But shouldn't that be a separate field then?


The logistics of combining fields in indexes and identifiers is relatively complex, while the logistics of indexing a single field is comparatively trivial. This is also why you don't ship timestamps using separate fields for second/minute/hour/day/month/year, but a single ISO-string or UNIX timestamp as representation: it makes querying and interpreting the value more consistent.


Disagree on "relatively complex".

    create index on $table ($field1, $field2);
Seems pretty simple to me.


It is noy just the DDL of the primary table that needs to care about it, but also all foreign keys, DML, queries, APIs accessing the data, etc. Storing that UUIDv7 is likely going to be cheaper than pushing the cost of keeping a composite identity onto other components and systems that work with that data.


Well, that depends on how you manage it. An ORM will find it trivial. Custom SQL, not always so much.

That said, PostgreSQL offers https://www.postgresql.org/docs/current/rowtypes.html which gives you the best of both worlds. A single field which contains 2 fields that can later be extracted. So everywhere you can write a single field for joins, etc. But then when you need it broken out...


Create a new column with an MD5 hash of the other columns. Easy.

/s


This is how it works, except it's not MD5 but a less-costly hash function


MD5 is a useless hash function these days.

It's not cryptographically secure (anymore). And there are cheaper non-cryptographically secure hash functions.


But then you have 2 identifiers which complicates everything.


I am curios about this, and might be misunderstanding what you mean.

Can you layout a demo architecture where you use multiple keys like you propose?


Any pure relational database design will eschew surrogate keys - most real-world systems will (should) add them back - because a surprising number of good natural keys end up changing (names change, phone numbers change, emails change, twitter handles become irrelevant/disappear, location of birth may change subject to geographic regions changing size...).

And on top of all that, there are efficiency concerns.

That said, at least for join tables AFAIK - there aren't often a need for row IDs beyond the involved foreign keys - unless you need to add meta data from other relations/tables...


> location of birth may change

My British passport says I was born in Birr; my Irish passport says I was born in Galway. These are both correct, because they are answering different questions. (I was born in the town of Ballinasloe in County Galway, but my mother's place of residence at the time of my birth was the town of Birr in County Offaly.)


So British place of birth means "mother's residence" and not place of birth?


Official guidances here: https://assets.publishing.service.gov.uk/government/uploads/...

Could maybe be just a mistake when doing the UK registration ? It's an easy fix if wanted.


That's my memory of it. My British passport is expired, and is probably somewhere in my parents' house. I don't actually know what it says on it. I do clearly remember that at least one British form of ID I had at one time recorded my mother's place of residence rather than my actual birth place. I thought it was the passport. Maybe I was wrong.


> Could maybe be just a mistake when doing the UK registration ? It's an easy fix if wanted.

Changes! lol


There are many DBMSes that combine columns into a singular primary key. Performance tradeoffs vary, especially when it comes to indexing.


I dont know why people use relational databases other than they were first and “that’s the way it’s always been done”.

Why not use a graph database? O(1) lookups instead of O(N). Why need indices if you can just point to the data. Why use JOINs when map-reduce querying is far more flexible?


They were not first.

ISAM and VSAM come to mind. Yes it says "files" in there a lot but it got used like a database with programming interfaces like finding records in a database. If you will this method is more like NoSQL databases than a relational DB. The I in ISAM was a step towards not having to know the key (true "NoSQL"). Kind of like today's NoSQL databases all also give the ability to add indexes now.

https://en.m.wikipedia.org/wiki/ISAM


You aren't describing a property of a graph database. You're describing a property of some set of key value based systems.

The reason why you want indices is because some query patterns don't have a key to perform a look up on.


I've been interested in learning more about them and how to best utilize them in my company. What graph database and query language would you recommend (regardless of stack)?


1. Memgraph 2. Neo4j

As for query language: definitely Cypher!


You might want to read Codd's paper: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf


How do you do constant time lookup an graph databases?

My intuition let's me know that you can not get below O(n log n) (Lower limit for comparison based ordering)


> My intuition let's me know that you can not get below O(n log n) (Lower limit for comparison based ordering)

That intuition should point at O(log n), shouldn't it?

In any case, it totally depends how your data is stored and how/what you want to look up.

If you already have some kind of id of your node in the graph you want to look up, you can get O(1) lookup and still call it a graph database. If you have to traverse the graph, then it depends on the structure of the graph and where your entry point is, etc.

I'm rather skeptical of graph databases. Whatever they can do, you can do with a relation database and datalog. (Think of datalog as SQL plus recursion, perhaps.) See https://en.wikipedia.org/wiki/Datalog


Because it allows te encoding of information the id. This makes it, at least in my experience, somewhat sortable.


When you sort them they will be ordered by generation time. This is useful in UIs, and greatly improves performance in databases (how much depends on DB and use case).


UUIDs can serve different purposes. As others have mentioned, database performance on inserts might trump the need for difficult to guess UUIDs.

In other cases, the UUID needs to be as random as possible.

It really depends on the use case.


Yea. This is an easy one. We use both.

For our “session” records, it’s a UUIDv7. This sorts beautifully, and if I wanted to, I could look at a log and easily see all the entries in a particular session.

For a larger db, we just need unique entries and at least in Dynamo, it is an advantage to equally distribute them as much as possible. UUIDv4 there.


Iirc you dont want to spend entropy that you don't need. The advantage of using the time means you need to spend less bits on expensive randomness and thus generation is cheaper.


I only work in embedded, so I’m not sure about fancy computers - but when we make UUIDv7s, I am certain the random is faster than pulling the time, formatting for milliseconds, then storing in position.


PRNGs are cheap to compute, and 'random' enough.


And they play nicely with sort ordering in popular databases like postgres!

They’re not in tree yet but there are a bunch of awesome pg extensions that provide access to uuidv7 (outside of just using it at the application level)


The problem with UUIDs is they are completely unreadable. Not just you can't understand them, it is prohibitively hard to even distinguish between them visually. This is why identifiers which would not include any bit of information (or noise) beyond the necessary minimum are useful in some cases.


Thus UUIDv7.


Don't even need the "same millisecond" part to save a few cycles, use case depending. An overflowing increment with a counter of any sort plus a few random bits is usually enough.

If you're clever enough, neither needs a branch.


UUIDv7 has two at least two such monotonic strategies. You can use a counter, or you can use those bytes to increase the random entropy.


Isn’t it simpler to use sequence keys then?


Yes, but only on single machines, UUID and co are intended for distributed systems.

Although now I wonder if / how UUID v7 can do sequential keys on distributed systems. Mind you, on those systems "close enough" will probably be good enough, and sorting will be done by date instead of incremental ID.


You could also keep a global incremental index for that, assuming there's some authoritative server that has to be polled sequentially to get them.

Probably too much overhead when conflicting UUIDs are a few orders of magnitude less likely than the clients crashing from some random bug though.


So just prefix the node ID which is assigned when the node joins a swarm


How is that easier or better? Now you're letting your infrastructure details bleed into your database.


They should bleed into your database. The database is part of your infrastructure and it contains details like where to find the replicas and so on


Why? Tight coupling doesn't make things easier.


You’re complaining that the infrastructure details are bleeding into the intrastructure implementation?


Infrastructure details are bleeding into non-infrastructure implementation. Your database content is not an infrastructure implementation.


the content is not, but it can use a database function, such as CURRENT_TIMESTAMP or RANDOM() can it not?


I'm not sure what you're getting at. How are those connected to the topic? Those functions aren't infrastructure implementations either.


Ah but you see, UUIDs are web scale.

And by web scale, I mean they're too big to exchange by any offline channel.


It's 128 bits vs 64 bits - not really that much of a difference.


FWIW… we’re using UUIDv7s over BLE.


Should you, though? BLE itself uses UUIDs, but the ones it uses are in a shortened format just because the BLE frames are small and the protocol is not too fast.

Granted I've even used JSON over GATT so maybe I should keep my mouth shut :).


It’s for an auth use. And at least we’re using CBOR. But I get your point.


> This is why you should use ids that combine both a time component and a sequence.

Computers should run like clockwork, so in this example of using all the cores, in Windows and likely some other OS's, threads are assigned to cores when they are started and you can have many threads per core, ergo the time component should also have the thread and the core number combined with it with multi core systems.

Its possible to write the code in such a way some variables are bounced out of the cpu cache back to memory to avoid any caching issues because I think the cache on some cpu's are per core, and some are per cpu.


> threads are assigned to cores when they are started

Do they? I thought the normal behaviour was for cores to pick any available thread to run, so core migration is quite normal.

> ergo the time component should also have the thread and the core number combined with it with multi core systems.

Sorry, how exactly does it follow from the previous? You seem to have omitted the other half of your syllogism. After all, clockworks do not have thread nor core numbers so I don't quite see how having those in UUIDs will make computers run like clockwork.


>> threads are assigned to cores when they are started

>Do they? I thought the normal behaviour was for cores to pick any available thread to run, so core migration is quite normal.

The CPU and its cores know very little about threads, threads are a figment of the OS.

>Sorry, how exactly does it follow from the previous? You seem to have omitted the other half of your syllogism

Syllogism would perhaps be a intuitive word to use, why is this? What happened to you?


Related

I used to be the program manager owner of the security event log in windows.

When things happen simultaneously or very closely in time on multi-core systems, thread scheduling can significantly affect observations of those things. For example, your thread quantum might expire before you get to the syscall to get a time stamp, or before you can pass a buffer to queue your event for subsequent time stamping.

In fact, on multiprocessing systems, it was very common to see out of order event log entries on Windows back in the day (2000-oughts). You also could not count on the log timestamp accuracy too precisely; 1s was pretty much the safe lower bound (some components truncated or rounded time stamps IIRC).


If you want unique identifiers, use version 4 (random) UUIDs. Problem solved.

The probability of a collision is roughly the same as the probability of a fully grown dinosaur spontaneously manifesting in your bedroom due to quantum fluctuations.


I will take my chance :-)

More seriously, If you can use them, good old increments are probably best. They are fast and cheap. Especially in a database. They can have privacy/security issues (you could guess things by the values of ids of stuff). UUIDs are better in those case or when you deal with a distributed system.


> They can have privacy/security issues (you could guess things by the values of ids of stuff).

Push them through a secure hash function, and that problem is solved too (assuming you can keep the base counter private).


If you're going to do that then you might as well just use UUID, since you effectively reintroduce the negative aspects of that (infinitesimally miniscule chance of collisions, computation involved in the calculation, etc.)


The difference is that you can still use sequential IDs internally, while exposing hashed IDs to the outside. This protects your database from collisions under all circumstances, while in the absolute worst case, a single user might experience bugs because two external IDs collide.


This is a weird proposal. If you're using non-hashed IDs internally and exposing hashed IDs externally, you are going to need to map those (securely hashed) ids back to internal ids when the client hands them to you.

I guess you could do this with complete table scans, hashing the ids and looking for matches, but that would be horribly inefficient. You could maintain your own internal reverse index of hash -> id but now I have to ask what's the point? You aren't saving any storage and you're adding a lot of complexity.

Seems like if you want random unguessable external ids, you're always better off just generating them and using them as primary keys.

Also, you aren't protecting your database "from collisions under all circumstances" - there's no guarantee your hash won't collide even if the input is small.


Yes, it is more reasonable to use encrypted IDs externally from structured/sequential IDs internally, not hashed IDs. Recovering the internal ID from the external ID is computationally trivial since it will fit in a single AES block and you don't have to worry about collisions.


Yes, I tend to like this philosophy in database design, of internal sequential ids which are used for joins between tables etc. and an exposed "external reference". But I typically would use a UUID for my external reference rather than a hash of the internal id.


Doesn't that just add a whole lot of unnecessary complexity? If elements have multiple IDs, one of which should not be leaked to the outside, that's just asking for trouble in my opinion.

Is generating UUIDv4 or UUIDv7 really too much effort? I'd assume that writing the row to the database takes longer than generating the UUID.


It also means once your hash function leaks for whatever reason or gets brute forced because of whatever weird weakness in your system, it's game over and everybody will forever be able to predict any future ids, guess neighboring ids, etc., unless you're willing to change the hash and invalidate all links to any content on your site.

If I'm in a scenario where I think I need consecutive ids internally and random ones externally, I'll just have two fields in my tables.


You need 2 fields anyway, unless you want to have to brute force your hash function when you need to invert it.


Store just the sequential id, compute the hash on the edge.

This keeps your database simple and performant, and pushes complexity and work to the backend servers. This can be nice because developers are typically more at home at that layer, and scaling the backend can be a lot easier than scaling your database. But it also comes with the downsides listed in this thread.


That's fine, but when a request comes in referencing only a hash and not an id (because you're not leaking ids to clients), how do you get the id?


Good point. Back when we did that we just used a reversible hash function (some would call it encryption). There are some simple algorithms meant for encrypting single integers with a reasonable key.


I might be misremembering, but didn't YouTube do this in the early days? So yeah, that was what I was thinking of when replying, not a traditional hash function.


If you're hashing for security reasons, I think you should still maintain a salt.


Hum, no. You can easily hash numbers from 1 up to the value you see and guess the next value.

If you want a secure identifier, make a random 64 or 128 bits number (a UUID type 4). And do not use this number as an internal identifier, because identifiers performance is all about predictability and low entropy.


If you use a robust salt, such a randomly generated long salt, attacker won't be able to guess the next hash.


A block cipher is better, as it's guaranteed to not have collisions.


Unless the RNG itself is the problem, there's no reason not to. All kinds of civilization-ending catastrophes are vastly more likely than a collision in a space of size 2^122.


v7 looks nicer since it solves v4's locality issue and you're still a gazillion times more likely to win the lottery than generate a collision.


Unfortunately, many libraries don't implement it yet.


It is still in draft and I've watched the spec changed at least once.


> The probability of a collision is roughly the same as the probability of a fully grown dinosaur spontaneously manifesting in your bedroom due to quantum fluctuations.

I'd love to see the math for the probability of the second option.


> The probability of a collision is roughly the same as the probability of a fully grown dinosaur spontaneously manifesting in your bedroom due to quantum fluctuations.

But now the probability of bad things happening increased by about a factor of two, which is not acceptable.


I don't see how sudden dinosaur appearance is a bad thing.


How will you sleep?

(Anyways, comparing the probabilities does not necessarily mean that the things being compared are both bad or both good.)


Perhaps it is made of anti-matter.


the dinosaur is now on my bed, what next?


I would just call Randall Munroe, he will know what to do


Close the door and immediately switch to v5.


Enjoy the non heat-death of the universe


Knowing nothing about UUID v4 generation, I have likely a stupid question. What makes you so confident that all implementations and their entropy sources are flawless enough to make actual collision probability close enough to theory?


What makes us so confident that our database implements ACID correctly, the RAM stores bins correctly, and the disk drivers store the data correctly?

In the end we have to make some assumptions about the correctness of (some of the) components.


We are not confident in any of that. Which is why we mitigate it with checksums, ECC etc.

So depending on the consequences you might opt to reduce the risk or help mitigate the consequences.

To just state that it is about as likely as my coffee maker being a portal to the future isn't very helpful. Poor entropy sources or bugs are not uncommon.


but is it higher or lower if the dinosaur is not fully grown?


I would guess slightly higher probability for a not fully grown dinosaur because the mass will be less, I think.


Despite the resolution being nanoseconds, what is the actual precision of computer clocks? I can't imagine it is actually nanoseconds. Takes me back to teaching physics labs where I had to hound students to remember that the accuracy of their measuring device is not identical to the smallest number it displays...


For devices clocked above 1Ghz it's perfectly possible for the clock to increment every ns, although that doesn't make it accurate to that level, and multi core systems may have clocks that are not synchronised to that level.

ARMv8 guarantees that it's clock increments at at least 1Ghz, for intel and earlier ARM it's more complicated


Cycle counting is one thing, but it gets tricky when you have frequency scaling in play. Another problem that even without freq scaling, cpu clocks are not designed to be super accurate/exact, and the true frequency might vary significantly even if the nominal freq is fixed


I think on newer x86 chips the ‘cycle’ counter increments at a ~constant rate, rather than changing with variable cpu frequencies. There’s a cpuid flag for this and I think Linux exposes it in procfs somewhere too. Older chips do have the problem you describe, as well as problems with cycle counts diverging (or never being close) between cores/sockets. It still isn’t exact in the way you describe. The most reasonable thing you can do is regularly calibrate based on an actual clock (whose rate also gets calibrated based on ntp…) to have a cycles<->ns conversion factor.


It doesn't matter that much they're not super accurate exact, they do in fact count ticks n tocks j dandy. I wonder if they are counted equally, interesting question if i do say so myself about my own question, i know rising edge is steady w the next rising edge, n falling edge w falling edge, but yeah...no i think that's a specification figure of merit, that they be balanced w respect to each other. N Intel® despite it's many failures, long ago forecast n long overdue due to the sheer difficulty of their business model n how long they kept it going--they were saying it was going to end soon in the late seventies already--n you know what, i'm fine w that. They don't make old chips rot like other software companies i could mention, bits rot but glass is timeless. Which brings me back to the point, in my analysis the problem is not the clocks being inaccurate but rather the jitter, which means a single run will not suffice in describing a repeatable exact clock time taken for eg an inner loop, which is what is worth bothering with. The minimum jitter attainable currently is 1 cycle, n then i guess you run the same code repeatedly n take the minimum with more repeatability as a consequence of that low jitter.

In the early nineties it was not so, you'd get the same number of clock cycles again n again n again.

N then it gets tricky because cycle counts are thresholds set within which, if voltages n frequency are proper, the operation will complete deterministically w a soft error rate no greater than the system as a whole, about one per 30 years of hammering on the chip at sea level. Which is not enough for my tastes, n the jitter is a fucking mess.

I much prefer the GA144, least jitter of any platform bar none, sensible because it is fully asynchronous logic, no clock anywhere in sight until you connect it to an oscillator, n even then the vibration doesn't rock the system like that of a grandfather clock synchs w another grandfather clock with whose pendulum's swing the former clock's pendulum swing is aligned. GA144 it's pretty easy to tell average case complexity of a function down to the tens of picoseconds, at which point you have to check that there's no open window letting a draft in. In fact the time trial will tell you such a draft is coming into the house, it happened to me, while not from a parallel universe in spacial dimensions it is by all means from another universe in the time dimension.


> I can't imagine it is actually nanoseconds

It's nanoseconds.


1? 10? 100? Those are all nanoseconds. When someone is asking about precision it tends to be a good thing to be precise in your answer.


You were similarly not precise in your skepticism. ;)


Erlang/Elixir (BEAM VM) makes this very clear - it's a distinction between monotonic vs strictly monotonic.

https://www.erlang.org/doc/apps/erts/time_correction.html#mo...

> In a monotonically increasing sequence of values, all values that have a predecessor are either larger than or equal to its predecessor.

These are available via the https://www.erlang.org/doc/man/erlang.html#monotonic_time-0 function.

https://www.erlang.org/doc/apps/erts/time_correction.html#st...

> In a strictly monotonically increasing sequence of values, all values that have a predecessor are larger than its predecessor.

Strictly monotonic values imply some synchronization / coordination, with an associated performance impact when there are many concurrent processes. This functionality is available via the https://www.erlang.org/doc/man/erlang.html#unique_integer-1 function. With a warning associated with it:

> Strictly monotonically increasing values are inherently quite expensive to generate and scales poorly. This is because the values need to be synchronized between CPU cores. That is, do not pass the monotonic modifier unless you really need strictly monotonically increasing values.


Relatedly, Erlang's own refs aren't created by a strictly-monotonic global generator; but rather are internally a pair of a regular monotonic identifier, and the PID of the requesting process. In other words, they're akin to UUIDv1s, or to https://en.wikipedia.org/wiki/Snowflake_ID s.

You only really need strictly monotonic global identifiers if you need immediately-consistent first/last-write-wins. If you can instead rely on eventually-consistent first/last-write-wins (i.e. if your write events enter an event store/queue that linearizes them by ID, and then all "simultaneous" writes but the one with highest ID priority can be dropped/ignored, either during processing, or on read), then I'd recommend first considering packed (nodeID, seq) pairs. And, if you want global event orderibility, considering the snowflake ID formulation (timestampMajor, nodeID, timestampMinor, seq) specifically.


Comment out CLOCK_MONOTONIC_RAW because that's not available on FreeBSD, and it looks OK to me? My understanding is that we should see some timestamps repeating if there's a collision, correct? I can't get any collisions...

  > ./clock_gettime_demo/clock_gettime_demo
  clock_getres(CLOCK_REALTIME, ...)=1 ns
  clock_getres(CLOCK_MONOTONIC, ...)=1 ns

  CLOCK_REALTIME 30 samples:
  1689957434662039455
  1689957434662039526 (diff=71)
  1689957434662039566 (diff=40)
  1689957434662039606 (diff=40)
  1689957434662039636 (diff=30)
  1689957434662039676 (diff=40)
  1689957434662039706 (diff=30)
  1689957434662039736 (diff=30)
  1689957434662039766 (diff=30)
  1689957434662039796 (diff=30)
  1689957434662039826 (diff=30)
  1689957434662039856 (diff=30)
  1689957434662039886 (diff=30)
  1689957434662039916 (diff=30)
  1689957434662039946 (diff=30)
  1689957434662039987 (diff=41)
  1689957434662040016 (diff=29)
  1689957434662040047 (diff=31)
  1689957434662040076 (diff=29)
  1689957434662040107 (diff=31)
  1689957434662040137 (diff=30)
  1689957434662040167 (diff=30)
  1689957434662040197 (diff=30)
  1689957434662040227 (diff=30)
  1689957434662040257 (diff=30)
  1689957434662040297 (diff=40)
  1689957434662040327 (diff=30)
  1689957434662040357 (diff=30)
  1689957434662040387 (diff=30)
  1689957434662040417 (diff=30)

  CLOCK_MONOTONIC 30 samples:
  2168262770533974
  2168262770534023 (diff=49)
  2168262770534054 (diff=31)
  2168262770534094 (diff=40)
  2168262770534124 (diff=30)
  2168262770534164 (diff=40)
  2168262770534194 (diff=30)
  2168262770534224 (diff=30)
  2168262770534254 (diff=30)
  2168262770534284 (diff=30)
  2168262770534314 (diff=30)
  2168262770534344 (diff=30)
  2168262770534374 (diff=30)
  2168262770534404 (diff=30)
  2168262770534435 (diff=31)
  2168262770534464 (diff=29)
  2168262770534495 (diff=31)
  2168262770534524 (diff=29)
  2168262770534555 (diff=31)
  2168262770534585 (diff=30)
  2168262770534615 (diff=30)
  2168262770534645 (diff=30)
  2168262770534685 (diff=40)
  2168262770534715 (diff=30)
  2168262770534745 (diff=30)
  2168262770534775 (diff=30)
  2168262770534805 (diff=30)
  2168262770534835 (diff=30)
  2168262770534865 (diff=30)
  2168262770534895 (diff=30)


Do you run it on 4 cores simultaneously as the author did?


I'm running it on real hardware (Ryzen 9 5900X, 12 cores 24 threads) and all I'm doing is executing the program once. My understanding is that the program is using runtime.GOMAXPROCS(0) to ensure it's using all available CPU cores because that falls back to runtime.NumCPU ?


Oh, I was only running the C program not the Go binary which orchestrates it, that's why I wasn't seeing the longer stats. But I'm still getting no duplicate collisions.

  running longer zeros test ...
  sampled 10000000 pairs; 0 time diff zeros = 0.000000%; 0 nano diff zeros = 0.000000%

  starting parallel test 24 goroutines x 10000000 samples ...
  10000000 samples from a thread; 0 collisions inside the thread; 0 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 945466 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 1982688 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 2739919 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 3334361 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 4109772 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 4489881 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 5157178 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 5596508 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 5854763 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 5937583 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 6434076 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 6521917 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 6932626 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 7104428 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 7285076 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 7514904 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 7833317 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 7737265 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 7723545 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 7730441 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 8311161 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 7965117 collisions with other threads
  10000000 samples from a thread; 0 collisions inside the thread; 8460173 collisions with other threads
  102297835 final samples; 137702165 total collisions = 57.375902%; possible duplicate collisions? 0


At some point doesn’t this come down to the ISA? A CPU running at 3GHz gets 3 clock cycles per nanosecond.

I bet there is a fair amount of optimization in the compiler that leads to back to back assembly calls of reading the clock register. If subsequent time.Now() calls happen within 3 clock cycles of each other, can you really fairly expect unique nanosecond precision…


Linux will (x86_64) use RDTSC and adjust that against a value read from the VDSO, so it can indeed happen very quickly.


It takes like 20 cycles to read the cycle count register on modern chips. Even if collisions are somewhat unlikely, it’s still much worse to get several a day than ‘almost never going to happen’.


Reminds me of some Lotus Notes folklore. Apparently they used to use timestamps with a resolution of 1 second as unique IDs for things. When there was a collision, they would just add 1 second. Eventually, you'd end up with items being in the future because there would be so many collisions.


Silly Rabbit - absolutely accurate times are a security problem. CPU designers (even waay back in Alpha @ DEC) intentionally introduced clock jitter, just to prevent total predictability. For x86, I think if you performed 3-4 of them, and then saved the values in to registers, and then upon completion reported those values, you would find that the time deltas are NOT exactly the same.


Do you have any sources for this? My googling skills are failing me. I'm surprised early x86 (which I assume you're including) were aware of security issues with accurate clocks; I certainly wasn't until this millennium :D I would rather guess observed clock jitter would be explained by interrupts or some such. Not saying you're wrong, I'd just like to learn more.


I’ve met too many people who are surprised by millisecond or microsecond timestamp collisions.

My most memorable and least favorite variety of this is when people try to assemble a timestamp from two system calls, one for the most significant digits and a second for the least.

If the smallest digits roll over from 99x to 00x after you read the large digits, due to process preemption, you can create a time stamp for an entity that happens before the entity that caused it to exist. This breaks some code spectacularly (I’ve seen infinite loops at least twice).

If you haven’t memorized this as a thing to always avoid, you end up with tests that pass 99.5% of the time, and it can take someone with very good pattern matching skills to catch that the same test has been red once a week for a month and a half, which is a very long time for a logic bomb to live in CI/CD code before getting fixed.


My most memorable example was some support discussion that went something like “we think you have a race condition, <description>”

“It can’t be a race condition as those two events happened in the exact same time”


Timestamps should probably never be used as a "unique" id.


The problem is achieving locality in cohesion/meaning, which usually involves locality in time, which provokes using timestamps as part of the id at the very least. But it's a chain of very lazy thinking IMHO and it's like a peek at the house of cards some large systems are built like.


Would a Snowflake ID work for this? https://en.wikipedia.org/wiki/Snowflake_ID


Why do you want this kind of locality anyway?


Database indices.


I solve this problem with a time check coupled with an atomic test-and-set of a global variable.

If the global is less than the timer, use the timer value and set the global to it.

If the global is greater than or equal to the timer, increment it and use that value.


Our Go ULID package has millisecond precision + monotonic random bytes for disambiguation while preserving ordering within the same millisecond. https://github.com/oklog/ulid


> On my system, the minimum increment was 32 ns.

Then you aren’t using a nanosecond clock.

If you want actual nanosecond precision then you probably want rdtsc.


If you need unique nanosecond, keep track of the previously generated one and increase it if necessary. Would require global lock or atomic stuff, but should be good enough for practical uses.


"Logical Physical Clocks" may be of interest. The timestamps are monotonic and don't require atomic clocks like in google spanner.

HLC captures the causality relationship like logical clocks, and enables easy identification of consistent snapshots in distributed systems. Dually, HLC can be used in lieu of physical/NTP clocks since it maintains its logical clock to be always close to the NTP clock. Moreover HLC fits in to 64 bits NTP timestamp format, and is masking tolerant to NTP kinks and uncertainties.

https://cse.buffalo.edu/~demirbas/publications/hlc.pdf


One benefit of UUID is that you don't need coordination. In coordinated systems, unique IDs are a non-issue.


Is there any other benefit? That's the raison d'être - Universally.


Depends on what you compare them with.


If you do that, just take some redis server and increment an integer every time you need a new number?


or y'know, use the thing that was designed specifically for this case: UUID's


UUIDs are designed to solve a slightly different, albeit related, problem: where you don’t have synchronisation of the system as a whole. The solution the GP is solving is for systems that are synchronised and therefore generating a UUID is additional and unnecessary overhead.


If there are collisions seen, that means there would be lock contention at this critical section, thus not a “good enough” solution.


This is why it scares me a bit to use a raw timestamp as a sort key in DynamoDB. I append a (random) unique ID to the timestamp text to avoid it. Better safe than sorry, I figure.


I have a use case that uses a similar solution, but for a different reason: it doesn't scare me to use timestamp as sort key I since I know my hash keys are unique and I only write every few seconds. But I still add random amount of ms (well under the frequency of writes/updates), because otherwise I'll be hitting hot partition issues on the GSI (indexed by timestamp, as you can guess).


We ran into this, though we were using millisecond timestamps with the Number type.

Ended up working around it by adding numbers after the decimal point. DynamoDB apparently supports up to 38 digits of precision, and JavaScript (we're NodeJS based) by default encodes numbers to allow for up to 16 digits (combined, before and after the decimal point). A UNIX epoch millisecond timestamp is 13 digits, so we were able to use 3 just to add uniqueness without having to update anything else in our codebase.

We certainly now plan to essentially always use string-based (and generically named) partition and sort keys, which would allow us to more easily do what you're describing.


I was going to post about "use a UUID", but I was surprised to learn that no UUID uses both timestamp + a random component. You can either get fully random with UUID4, or have a time + MAC based UUID with UUID1. Strange, I would have thought there would exist a UUID that uses time + random to minimize collisions like described in the post.


I believe the proposed UUIDv7 standard uses this.

https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...

It's a draft but there's a lot of implementations out there.


ULID (and I think UUIDv7 draft) has millisecond timestamp + 80b randomness.


the new UUIDv6, 7 and 8 work the way you deacribe.


Only v6 and v7, v8 is "whatever you want".


That's not much of a standard.


From my reading, it's standardized because it has a version number embedded, so it won't get confused with other uuids.

Probably a good idea to have somewhere to silo bespoke implementations. Just gotta hope they end up being unique.


It's not uncommon to provide a way how to do it in standards because it will happen whether you want it or not. See e.g. HTTP headers starting with X- or MIME types in application/vnd.


Ksuid may be what you want. Pretty much time sortable uuid.

Go implementation: https://github.com/segmentio/ksuid


someone else in the thread mentioned UUIDv7


Can't find much info about it but is this UUID v7?


> I was wondering: how often do nanosecond timestamps collide on modern systems? The answer is: very often, like 5% of all samples, when reading the clock on all 4 physical cores at the same time.

A few things to consider:

1. This would depend a lot on how regularly you are checking, the more regular, the more collisions.

2. There may also be some weirdness where threads doing similar tasks will synchronise or desynchronise to maximize throughput.

3. Your system may not offer accurate nanosecond time. For this reason some OSes tend not to offer lower than microsecond time, as the clocks simply cannot offer higher usable accuracy. You're also measuring the time to call the function(s), create the timestamp, etc.

A simple solution I had years ago was to add random precision to reduce collisions. That breaks most tie situations. If you have 5% ties in nano seconds, with random pico second accuracy your collisions are down to 0.005%. You could also seed the random offset by an ID.


Let me tell you about the time the performance guy figured out one of the big cycle eaters was that every CPU in the SMP system was trying to read the time of day from the same place in memory so they "fixed" that by giving every CPU its only copy of the current time ...


In other news: Water is wet.

The high resolution precision time counters are derived from the system base clock, usually operating at ~33MHz, which translates exactly into that 30ns granularity observed.

If you really want robust time derived timestamp identifiers, truncate the high resolution timer to at best 10µs resolution replace the low bits with the hash of them, concatenated with the value of the CPU cycle counter (`RDTSC` on x86).


The HRT patch set for Linux (which is eight years old) claims that Linux had a 10μs clock resolution before the patch, and conversations on SO suggest the resolution is now 1ns, so I believe this information is dated, or at least OS dependent.


I don’t think this is right. I am pretty doubtful of the discussion of the granularity of the results of clock_gettime, but I failed to find any sources and I don’t have a suitable computer to hand to experiment.

But here are two more doubts:

1. On at least some systems, clock_gettime will be implemented by a vdso call that (on x86) uses rdtsc to give you the time, so you should expect its results to be pretty highly correlated with an rdtsc immediately afterwards, so you aren’t really adding useful entropy (whereas you are losing time-locality of your identifiers which may be desirable if they are to end up as primary keys somewhere)

2. On some systems (eg some AMD processors), rdtsc gave me pretty low precision results (eg rounding to the nearest 10 cycles) so you also won’t necessarily get the entropy you would expect from the description of the instruction. I failed to find a reference for this too apart from an offhand mention in a paper about timing attacks.


> replace the low bits with the hash of them

Doesn't this hash-step only increase the probability of collisions?

What is this step intended to solve?


> replace the low bits with the hash of them, concatenated with the value of the CPU cycle counter (`RDTSC` on x86)

you're concatenating two values and then taking the hash of the combination, ie:

hash(low bits + CPU cycle counter)


But a hash() destroys information.

When you are trying to reduce collisions, why would you use a function that is known to introduce collisions?


Because the raw value from the timestamp will be very low entropy and have the short scale variation concentrated in just a few bits. A hash not just destroys information, it also creates entropy by mixing that information over all the bits that come out of it. And using 64 bit hash replacing a 64 bit nanosecond counter that has short term entropy of less than 16 bits, you're in fact reducing the likelihood of a collision by a factor of 2^48.


The hash, in this case, is just one deterministic way to shorten the CPU counter to few bits which can then be used to increase the entropy of the timestamp by replacing the timestamps stale bits. What's being asked here is not why use some compressed bits of the CPU counter increases the entropy of the timestamp overall. Rather, why you'd use a hash of the entropy providing information to do this since hashes allow for many:1 mappings (i.e. allow for decreases in entropy) while never providing better than 1:1 mappings (i.e. preserving entropy). At best, the hash is preserving the entropy of the timestamp and counter bits and, at worst, it is throwing some away. The question that follows: is there a better way that preserves the entropy of the inputs without the risk of throwing some away and, if so, was there some other reason to still use the hash? This is what I took amelius to be asking.

Knowing the timestamp delta is ~30ns then even a 1 THz processor would only execute 30,000 cycles before said timestamp increases and the outcome stays unique anyways. From that perspective, taking the low 16 bits of the cycle counter is already guaranteed to help produce perfectly unique timestamps, and without the risk of introducing hash collisions. It's also significantly easier to compute and now monotonic* now, whereas hashes were neither. From that perspective, I too don't see what value the hash is supposed to add to the information being fed to it in this case.

In threaded/distributed/parallel scenarios you may wish to throw the lane/core/node number in the bits as well. In the case having the same timestamp is supposed to be invalid this leaves you a simple deterministic way to tiebreak the situation as part of the creation of the timestamp itself.

*A 10 GHz CPU running for a little over 58 years could risk a 64 bit counter rollover in the middle of a time delta. If that's too close for comfort or you want the code to work on e.g. 32 bit counter systems, you'd need to eat more cycles and registers to set another bit to the outcome of whether current cycle count is < the one at the start of the current timestamp delta.


Yeah, when I was thinking about the hash, I thought of it as stuffing to fill the unused portion of the number that would look better than using zeroes.

TIMESTAMP010111101010101TSC "looks better" than TIMESTAMP000000000000000TSC even though they contain the same information.

I would drop the hash, it's deceptive.

I don't believe it breaks the monotonicity, though? I mean, it would if it weren't already broken. If you're taking the low 16 bits of the TSC, then a rollover in those 16 bits during the same timestamp will already go backwards. TIMESTAMP0...0 follows TIMESTAMPf...f.


I guess it depends which portion you look at. If you solely look at the time based portion you do indeed still have a function which never decreases but that is true even in the case of reading the raw counter whole on its own anyways. If you look at the whole value, including the hashed portion, it's no longer monotonic.

In the cycle based case looking at the whole value is the same thing as looking at a relative time stamp which has more precision that the system clock. In this way, it's "truly" monotonic across the entire value, not just monotonic on a part and unique in another.

Side topic: It also comes with an even stronger guarantee of always increasing instead of just "not changing direction". Is there a name for that?


> it also creates entropy

It's a nitpick, but it concentrates the entropy. It doesn't create any.

I do think it will make the answer more clear, as the hash concentrates the less than 64 bits of entropy on that 128 bits of original data into a usable 64 bits package.


Actually hashes do create entropy (every computation creates entropy in some form or another). What's the entropy of a 4 bit number? What's the entropy of a 4 bit number hashed by a 64 bit hash function? The act of computation does in fact create entropy, as per the 2nd law of thermodynamics, a part of which shows up in the hash.


I don't think you understand what this conversation is about. We are considering information theoretic entropy, not thermodynamic entropy from the mechanism of computation itself.

The result of applying a deterministic function on a random variable cannot have more entropy than the underlying random variable. This is a theorem, one that is trivial enough to not have a name. But you can find solution sets to homework that will prove it for you: https://my.ece.utah.edu/~rchen/courses/homework1_sol_rr.pdf


> every computation creates entropy in some form or another

Ok, what is the entropy created by this function that maps a 4-bit number to a 64 bit number:

    0 -> 0
    1 -> 1
    2 -> 1
    3 -> 1
    4 -> 1
    ...
    15 -> 1


60 bits. Yes, I know, you can compress it down very well. But consider that entropy in computation involves not just the bits you store, but also the bits that the processor touches and eventually dissipates as heat into the universe.


What definition of entropy do you use?

(I'm using Shannon entropy.)


Boltzmann. But it doesn't really matter, it's the same thing. Yes, I know that looking at a sequence of, say 1000 identical bits looks like it's got just 10 bits of entropy after simple RLE compression. But you must not forget the entropy that also generated in the computation itself, and subsequently dissipated into the universe.


It's not the same thing. If I define a function that always returns 1 then the Shannon entropy is extremely low regardless if the Boltzmann entropy of running it on a CPU is high. That the two measures can be different shows they cannot be the same thing. Related in concept, different in definition. In fact, you can even use the same formulas for calculating it - what differs is what your calculating it on.


> If I define a function that always returns 1…

then it's Kolmogorov complexity is also extremely low.

Look if you have a well enough hash function, it output should be near the Shannon limit and hardly compressible, and ideally contain as much entropy as it has bits. But you can feed in just a single bit or the entire knowledge of humanity, in the end you're going to get a fixed amount of bits, and entropy near of that, and if you throw any form of lossless compression at it, it will hardly compress.

But quantum mechanics tells us, that information cannot be destroyed. So when you feed it more bits, than it emits, then its mostly the entropy of the information you feed in, that you get out of the hash. But if you feed it just a single bit, the additional entropy comes from the computational process.

I know, this is now getting really philosophical, but here's something to ponder on: How would you implement a hash function for a reversible computing architecture?


Most hashes are really good but the point was why replace the perfectly unique information in the cycle counter + time stamp combo with "most likely nearly unique" in the first place. After all, if the former isn't unique then neither are the hashes anyways.

Hashes are EXTREMELY compressible, albeit known algorithms are extraordinarily slow. E.g. I can compress any SHA256 output to a matter of kilobytes, maybe less, by using the SHA256 algorithm as the compressor algorithm and iterating through seeds until I get a match. With true entropy you can't guarantee that for all inputs, regardless of how long you take.

Different types of "information" ate at play here with the different types of entropy as well. If I have a 16 bit hash function and feed it a 64 bit value 48 bits of computational information is lost (at minimum). What happens with the physical information you used to represent the computation after you get the information result is separate from what happens with the computational information.


Why?

    low bits + CPU cycle counter
is enough. No need of the hash()


You don't know precisely at which frequency the cycle counter runs. Depending on the system load it might either run faster or slower than the lowest bits the HPTC. For what it's worth this part is more or less nondeterministic, so the sane thing to do, is spread out the information as much as possible (maximize entropy), in order to minimize the probability of collisions.


That's ok, the bits past the low bits are just there to avoid collisions, not an actual measure of high precision time beyond the low bits.

It's not worse than the hash solution, I'm just saying it's not necessary to hash it if the only objective is to reduce collisions.

In fact the hashing solution, if it is replacing the low bits with a hash of low bits plus something else, is actually destroying valuable time information.


That only works, if you know exactly, that the low bits are constant. Otherwise you may run into the issue that due to unsteady rate of RDTSC between two processes/threads that may be preemptively unscheduled between reading the HPTC and the RDTSC you might again end up with colliding time stamps. And if you took the derivative between timestamps taken in succession, you might even find is to be non-monotonic in places.

The combination of multiple counters incremented by individual unsteady clocks used to be a source for pseudo random scrambler sequences; these days we prefer LFSRs, but overall this is something that can be weird.

Hence my recommendation: Just throw xxHash32 on concatenation of the HPTC's low bits and the CPU clock cycle counter, and forgo any pretense of monotony in the low bits (because very likely you don't have it anyway).


I thought clock_gettime() usually does use rdtsc(p) on Linux? Possibly depending on the particular clock type (montonic, realtime, etc). Either way I'd be interested in knowing more.


RDTSC is directly influenced by frequency scaling. So while monotonic, its clock interval is neither constant, nor deterministic on modern systems.

Here's a small online visualization of RDTSC average and standard deviation I just hacked together: https://gist.github.com/datenwolf/151486f6d73c9b25ac701bdbde...

On a system with frequency scaling you can see that under higher load the difference between RDTSC in subsequent iterations of a tight loop that does nothing else than reading that register will drop. Here's how it looks on the system I'm currently using: https://www.youtube.com/watch?v=FKKjSJ1JZ78


> RDTSC is directly influenced by frequency scaling

Unfortunate wording, RDTSC itself is not influenced by frequency scaling, it has constant frequency on modern CPUs after all. Your video nicely shows that RDTSC delta is influenced by CPU frequency, as expected, but how does it affect using RDTSC as a clock? On my CPU RDTSC seems to tick at 3GHz, for example. I wonder how precise it is though, how much its frequency can drift from spec.


Ah crud, you're right. What'd really be interesting to see is the ratio between d/dt rdtsc and d/dt clock_monotonic.


I did update my program, now it measures the ratio.


Wouldn't those operations reduce the accuracy of the time stamp?


What is this "accuracy" you're talking about? Between the scheduler hanging over a process's head, ready to yank away its compute time for milliseconds between the syscall to read out the HPTC or the CPU cycle counter, and the process actually writing the timestamp into a variable/in-memory struct, reading the HPTC from the underlying hardware also not being super accurate, and the CPU cycle counter being influence by frequency scaling, on a bog-standard machine the highest precision you can reasonable expect is on the order of 100µs or so.


Modern CPUs don't really give you accurate nanosecond-scale time stamps anyways. The CPU will randomly speed up or slow down, execute instructions out of order, and even speculatively execute instructions. Not to mention that it'll have a dozen different clocks - which are not guaranteed to be in sync.


This may be coming from a place of ignorance, but if that were the case, then time would drift significantly constantly due to Intel speed step for example. And if that were the source of truth for time, then when your computer is off, it wouldn’t be able to keep. I’m pretty sure they all have real time clock chips in the motherboards.


There's time keeping (which uses the real time clock) and high resolution clocks, which are good for relative timers. They're never the same components.


> The CPU will randomly speed up or slow down

constant_tsc has been a thing for more than a decade

> execute instructions out of order, and even speculatively execute

rdtscp serialises

> Not to mention that it'll have a dozen different clocks - which are not guaranteed to be in sync.

synchronised_tsc has been a think for about 6 years now


None of these are performant, no?

Generally, you can have consistency, speed, or low cost - but not more than two at the same time.


Invariant (Constant) TSC is detectable via `cpuid` and applies to `rdtsc/rdtscp` by default. In that aspect, there's no tradeoff being made there (observable to software) AFAICK.


interesting results! i guess it depends in if you consider 35ish cycles expensive or not.

[https://community.intel.com/t5/Software-Tuning-Performance/H...]


Are there cheaper ways of getting elapsed time with sub microsecond precision? Interested as I've only ever heard of rdtsc at the lowest level in userspace for x86.


I ran across a random stackoverload thread with benchmarks claiming it was about 2x the cost of doing a naive gettime(). But frankly, hard to figure out when you factor in all the various caches, OO execution pipelines, etc,


Integrating a hash might improve (not guarantee) the uniqueness situation but not the monotonicity situation. Right?


Well, you can always attempt to catch the CPU cycle counter overflow (happens at roughly 10Hz on current machines), adding up the carries and add it to a nanosecond counter shifted up by a few bits. Problem with the CPU cycle counter is, that it's not in lockstep with the HPTC, due to dynamic frequency scaling.

If you really, really, really need system wide, nanosecond precision timestamps, you'll have to resort to dedicated hardware incrementing a counter at the desired rate, and with a well known latency for each read access. On the hardware and driver level you'd have some MMIO port mapped into user space with an identity transform between bus address and process address space.

However this still doesn't solve the problem, of the scheduler being able to throw in several milliseconds between reading the high precision timer value into a register and writing it back into the memory holding the timestamp variable. Seriously, on your typical computer system software derived timestamps at more than 100µs resolution are kind of bogus.


Some of this is not current.

Constant-rate TSCs are ~15-20 years old. Synchronized TSCs are at least ten.

Also RDTSC produces a 64-bit result, so it does not overflow at a rate of several Hz.


It's 64 bit on 64 systems. In the world of hard realtime applications there's still a huge armada of systems out there in the field running on 32 bit (think motion control, CNC machines, industrial robotics). If you're developing software that's concerned with this kind of precision, you might find yourself confronted with such "outdated" systems more often, than not.

Also see https://news.ycombinator.com/item?id=36814762


Or UUID7


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: