blackenedgem's comments

blackenedgem · 2024-08-11T21:48:14.000000Z

Stormcow

Bluestein · 2024-08-12T08:38:55.000000Z

Tornadocalf

blackenedgem · 2024-07-22T20:30:45.000000Z

I think the problem you'll eventually run into is figuring out intent from the diff. It seems like an easier version of reverse compiling.

When it comes down to semantic diffs I'm more interested in something like the Semantic Patch Language by Coccinelle. Being able to represent mundane refactorings across an entire codebase in a few lines seems great. And it unifies intent with the diff.

golergka · 2024-07-22T22:04:03.000000Z

And just like that, another GPT-4 wrapper startup was born.

blackenedgem · 2024-07-06T17:35:54.000000Z

You can start the sequence at -2b, or wrap it around when it gets close to the signed limit. Hopefully you haven't depended on it not wrapping around by that point.

For queue tables you can even use `CYCLE` to do that automatically.

blackenedgem · 2024-07-06T10:15:28.000000Z

Because it's much better for range queries and joins. When you inevitably need to take a snapshot of the table or migrate the schema somehow you'll be wishing you had something else other than a UUID as the PK.

rezonant · 2024-07-06T11:47:10.000000Z

This. Highly recommend using a numeric primary key + UUID. Using UUID relations internally can have some strategic advantages, but when UUIDv4 is used as the only primary key, you completely lose the ability to reliably iterate all records across multiple independent queries.

Also, the external thing isn't just for exposing it out to your own apps via APIs, but way more importantly for providing an unmistakable ID to store within external related systems. For example, in your Stripe metadata.

Doing this ensures that ID either exists in your own database or does not, regardless of database rollbacks, database inconsistencies etc. In those situations a numeric ID is a big question mark: Does this record correspond with the external system or was there a reuse of that ID?

I've been burnt taking over poorly managed systems that saved numeric IDs externally, and in trying to heal and migrate that data, ran into tons of problems because of ill-considered rollbacks of the database. At least after I leave the systems I build won't be subtly broken by such bad practices in the future.

groestl · 2024-07-06T11:11:56.000000Z

Ha? Please elaborate.

kroolik · 2024-07-06T11:28:26.000000Z

When running a batched migration it is important to batch using a strictly monotonic field so that new rows wont get inserted in already processed range

blackenedgem · 2024-07-06T13:44:46.000000Z

It's not even necessarily it being strictly monotonic. That part does help though as you don't need to skip rows.

For me the bigger thing is the randomness. A uid being random for a given row means the opposite is true; any given index entry points to a completely random heap entry.

When backfilling this leads to massive write amplification. Consider a table with rows taking up 40 bytes, so roughly 200 entries per page. If I backfill 1k rows sorted by the id then under normal circumstances I'd expect to update 6-7 pages which is ~50kiB of heap writes.

Whereas if I do that sort of backfill with a uid then I'd expect to encounter each page on a separate row. That means 1k rows backfilled is going to be around 8MB of writes to the heap.

valenterry · 2024-07-06T13:55:38.000000Z

Isn't that solved because UUIDv7 can be ordered by time?

blackenedgem · 2024-07-06T14:42:29.000000Z

Yeah pretty much, although ids can still be a little better. The big problem for us is that we need the security of UUIDs not leaking information and so v7 isn't appropriate.

We do use a custom uuid generator that uses the timestamp as a prefix that rotates on a medium term scale. That ensures we get some degree of clustering for records based on insertion time, but you can't go backwards to figure out the actual time. It's still a problem when backfilling and is more about helping with live reads.

kroolik · 2024-07-06T15:10:55.000000Z

Are page misses still a thing in the age of SSDs?

x3al · 2024-07-06T14:20:24.000000Z

Strictly monotonic fields are quite expensive and the bigserial PK alone won't give you that.

kroolik · 2024-07-06T15:06:22.000000Z

PG bigserial is already strictly monotonic

blackenedgem · 2024-07-06T17:05:03.000000Z

No they're not, even with a `cache` value of 1. Sequence values are issued at insert rather than commit. A transaction that commits later (which makes all updates visible) can have an earlier value than a previous transaction.

This is problematic if you try to depend on the ordering. Nothing is stopping some batch process that started an hour ago from committing a value 100k lower than where you thought the sequence was at. That's an extreme example but the consideration is the same when dealing with millisecond timeframes.

groestl · 2024-07-06T13:48:10.000000Z

Okay, but in a live DB, typically you won't have only inserts while migrating, won't you?

kroolik · 2024-07-06T15:06:51.000000Z

Yes, but updates are covered by updated app code

asah · 2024-07-06T12:46:13.000000Z

would creation/lastmod timestamps cover this requirement?

kroolik · 2024-07-06T15:09:48.000000Z

Yes, although timestamps may have collisions depending on resolution and traffic, no? Bigserials (at least in PG), are strictly monotonic (with holes).

blackenedgem · 2024-06-12T20:53:25.000000Z

I find average leaf density to be the best metric of them all. Most btree indexes with default settings (fill factor 90%) will converge to 67.5% leaf density over time. So anything below that is bloated and a candidate for reindexing.

blackenedgem · 2024-06-12T20:47:02.000000Z

Because there's a good chance down the line you will need to do some sort of range query. Let's say you want to add and backill a column. Not too bad, you create a partial index where the column is null and use that for backfilling data.

But at a certain scale that starts taking too long and a bigint column would be quicker. Or you decide you need to periodically scan the table in batches for some reason. Perhaps to export the contents to a data warehouse as part of an initial snapshot.

You can skip enumerating these possibilities by having a bigint surrogate key from the get go. There's other advantages as well like better joins and temporal locality when the bigint index can be used rather than the uuid.

blackenedgem · 2024-05-17T08:13:53.000000Z

That's all well and good until something goes down and you need someone knowledgeable to diplomatically shout at a vendor.

blackenedgem · 2023-12-18T16:22:10.000000Z

It's not really the UK regulator's fault though, if anything they were the best as they gave their response first (a provisional no). The EU was still investigating and the US DOJ was also preparing similar investigations. The CMA also provided Adobe with a list of changes they could make in order for the application to be approved, so it's not even like they were unwilling to entertain it.

As we saw with the Blizzard acquisition the UK CMA will bend to international pressure if it's the only one holding out.

Anon826 · 2023-12-18T18:25:50.000000Z

> the Blizzard acquisition the UK CMA will bend to international pressure if it's the only one holding out.

I read that MS agreed to remedial changes for the purchase to be approved by CMA, but those changes were the same ones that CMA originally asked for at the start that MS refused. So CMA got what it wanted but it's still seen as a MS as eventual winner and CMA loser.