jschafer's comments

jschafer · 2024-12-23T00:11:27 1734912687

Note that the measurements in the paper were made before they fixed a bug where they confused bits and bytes. So SQLite only used 1/8 of the reserved bloom filter space, thus increasing the false positive rate significantly:

https://sqlite.org/src/info/56d9bb7aa63043f5

I found and reported the bug because I wanted to know how the bloom filters work in SQLite for my uni seminar paper. Still wondering if one can find those kind of bugs with test cases.

agilob · 2024-12-23T08:47:31 1734943651

On top of that I don't think it's fair to say it's 10x faster when it btree was tested only on integer index primary key column. Benchmarks with that bold statements should include short string (1-16 chars maybe) and UUID indexes at least.

jschafer · 2024-12-23T15:24:00 1734967440

I do not know if it is still the case, but the last time I looked into the source code SQLite did hash all strings to the exact same value.

So the bloom filter optimization does not work there.

It had to do with the different ways strings can be compared with collating functions, as strings may be equal even if they have different bytes: https://sqlite.org/forum/forumpost/0846211821

thaumasiotes · 2024-12-23T12:58:30 1734958710

Why do you want a UUID index? Use an integer index and have the UUID in another column.

dymk · 2024-12-23T13:35:27 1734960927

Because if you want to refer to things by a UUID, now you have two indexes

leourbina · 2024-12-23T14:37:13 1734964633

UUIDs are very wasteful [1]. For most use cases you can replace them with much shorter strings and still have very low chances of collisions [2]

[1] https://henvic.dev/posts/uuid/

[2] https://alex7kom.github.io/nano-nanoid-cc/

cogman10 · 2024-12-23T16:38:47 1734971927

Call me crazy, but I'm simply splitting my UUID into the higher and lower bits and indexing off that.

IE

    CREATE TABLE foo(
        id_ms    UNSIGNED BIG INT NOT NULL,
        id_ls    UNSIGNED BIG INT NOT NULL,
        PRIMARY KEY (id_ms, id_ls)
    ) WITHOUT ROWID;

That works well with UUIDv7 and is just storing 128bits rather than a full string. In most languages it's pretty trivial to turn 2 longs into a UUID and vice versa.

diroussel · 2024-12-25T12:50:19 1735131019

Is there any advantage to this approach over Postgres native uuid support which should store the same number of bits?

cogman10 · 2024-12-26T00:11:09 1735171869

No. This approach is strictly for DBS like sqlite without uuid or 128bit integer support.

PittleyDunkin · 2024-12-23T14:56:53 1734965813

Sure, at cost of increased complexity of access. Sometimes the waste is worth the simplicity.

dymk · 2024-12-23T20:39:07 1734986347

Sounds complex, just use a UUID. If that’s the dominating factor for storage, then you have a different problem to solve.

sgarland · 2024-12-24T01:33:26 1735004006

In SQLite, if you were to define a TEXT column (or anything other than INTEGER, for that matter) with a UUID as the PK, you’d already have two indices, because it stores data based on the rowid [0]. So you’d already have a level of indirection, where the “PK” would be pointing to the rowid.

You could define the table as WITHOUT ROWID [1], but as docs point out, the average row size shouldn’t exceed 200 bytes for the default 4 KiB page size. Since a UUID in text form is at best 32 chars, that doesn’t leave much for the rest of the columns.

[0]: https://www.sqlite.org/lang_createtable.html#rowid

[1]: https://www.sqlite.org/withoutrowid.html

Scene_Cast2 · 2024-12-23T12:57:43 1734958663

It's also a problem in machine learning. Your data might be mangled due to a bug but the NN will still extract something useful out of it. Or, on the flip side, if you make a change to the data and things do break (learning stops converging), you never really know if it's the architecture or the data that's the issue.

ramraj07 · 2024-12-23T04:02:56 1734926576

How much of a slowdown did you estimate this bug caused?

jschafer · 2024-12-23T15:40:38 1734968438

SQLite only knows nested loop joins and the bloom filter can just tell us "no need to do a join, there is definitely no matching entry".

If it has a false positive all the time (the worst case) then the performance is the same as before the bloom filter optimization was implemented (besides the small bloom filter overhead).

As the bloom filter size in SQLite directly depends on the table size I estimated a false positive rate of 63.2% due to this bug, while it could have been just 11.75%.

metadat · 2024-12-23T05:41:28 1734932488

It actually would have performed faster, but the false positive rate drastically increased.

LudwigNagasena · 2024-12-23T14:31:06 1734964266

I guess the person is asking how much a slowdown did the whole query receive.

devoutsalsa · 2024-12-23T04:44:28 1734929068