Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] SQLite vs. ObjectBox vs. Isar (ente.io)
46 points by vishnumohandas 10 months ago | hide | past | favorite | 72 comments



Doesn't pass the sniff test. 19 seconds sounds like the performance of 100,000 separate transactions. I bet a crisp $10 bill they've forgotten to wrap all of this in a transaction. In SQLite it makes a huge difference. I struggle to think of a way that you could make SQLite take 19 seconds to write and read 100,000 rows otherwise.


Way back in the day, like a decade or more ago when I was working on Aperture, we increased the speed of the database by dropping CoreData and moving to an embedded SQLite with tuning set up for the application.

We had a test database of 1,000,000 images, each with large amounts (several hundred) of metadata items - EXIF and the like, all of which could be queried. We had a target time of 2 seconds to get any query back. Most of the time, we hit that mark using the raw SQLite database approach.

So this was:

- On far older hardware, we still used spinning rust for $deity sake...

- Seems like much more data

- Seems like much more metadata

And it was still very fast (relatively speaking - a couple of seconds is still a long wait-time, and then there was all the UI overhead on top).

I think the author might be missing a trick or two when running SQLite, as mentioned in their write-ups, we found it faster than the filesystem as a storage medium...


> when I was working on Aperture

As a long-term Aperture user, I concur with the above sentiments. It was an impressive real-world use of SQLite by Adobe.

Hence I agree with the conclusion that if SQLite is "slow" its usually a PBCAK error.


Aperture was Apple’s pro competitor to Lightroom, not made by Adobe.


> Aperture was Apple’s pro competitor to Lightroom, not made by Adobe.

My bad, apologies.

Although in my defence, Lightroom also uses SQLite as backend, although clearly I'm not sure how extensive it is compared to Apple Aperture.


Author of the sqlite_async package here.

There isn't anything obviously wrong with how the lib is used - the writes are properly batched in a single transaction, but the performance is much slower than expected.

From a quick look, garbage collection appears to be a big overhead - it could be related to how data is passed between isolates. In this case, it may actually be faster to do writes in smaller batches, since the current approach may use a lot of memory for the batch. I'll have to test further to see whether that helps, or whether there are other optimizations for this case.


From some further testing: 1. With a batch size of 10k, there is a negligible difference in write performance on my machine (with SQLite taking around 10% longer). 2. Reads are still slower, with the majority of the time spent in decoding. There are probably faster ways to convert List<double> into a Uint8List - probably using the typed_data library. I haven't tested this yet.

With larger batch sizes, memory usage becomes a problem with the current sqlite_async batch implementation. There can probably be improved in the library, but a workaround is to split the writes into smaller batches.

Overall, you should be able to get very similar performance between Isar and SQLite, but there is something to say for Isar providing better performance for the default/obvious way of doing things in this case.


Hey, the writes were batched[1]. What is expensive here is the serialization.

[1]: https://github.com/ente-io/edge-db-benchmarks/blob/43273607d...

Edit: Updated the permalink


This is the smoking gun, unfortunately; I was right. This code is not correct. You're doing a single insert per transaction (which is the same thing you'd get if you hadn't used an explicit transaction at all). What you want is a single transaction with ALL of the inserts in it. Do it again with a single transaction and you'll find it massively improves the performance. This is a common pitfall for beginners.

Edit: You've changed the link after I wrote this reply. Did you fix the code? Is the blog post using the original code or this updated code?


You're asking too many questions. Didn't you read the title? SQLite isn't enough! /s


Just because you could increase the performance by putting everything into a transaction doesn't make it right to benchmark with.

Most people won't be inserting large amounts in a single transaction. Ex webserver with many concurrent clients inserting.


My bad, I linked to the wrong portion of the code. Here's the updated link[1].

We are executing the entire write in a single batch query.

From what we observed, and have documented, it is the serialization that is causing the bottleneck, not the write itself.

Also, more than writes, it is the latency associated with deserializing the read data into models that pushed us to look at options outside SQLite.

[1]: https://github.com/ente-io/edge-db-benchmarks/blob/43273607d...


Why are you comparing DB engines if the bottleneck is the serialization? You're the one doing the serialization! You could store the embedding directly into SQLite columns if you wanted.


I agree that deserialization is not the responsibility of the database.

But as an end consumer, the overall throughput while using SQLite as the DB engine was not sufficient to serve our use case of reading 100k embeddings.


What people are saying is your premise is wrong, and you’re getting defensive.

Either of those would not be an interesting problem except that your title is calling out a particular and beloved database as being the culprit. Which feels a lot like defamation.

Even if deserialization is the bottleneck, bottlenecks are relative and don’t have to be singular. You can have multiple bottlenecks and being blind to one because of PEBCAK can still skew results and thus conclusions.

The people you’re responding to want you to fix your code and run it again. I think that’s reasonable. And maybe deserves a follow up post or a rework of the existing one.


Sorry if I sounded defensive!

I have nothing against SQLite, but I unfortunately haven't yet understood how to write and read Lists more efficiently. Once I do, I'll definitely post a follow up :)


Update: https://ente.io/blog/tech/sqlite-objectbox-isar#update

TL;DR: Removing Protobuf and directly serializing the list to a blob helped!


If the serialization is really the bottleneck, as you say, then you aren't serializing the data in the Isar evaluation. Which makes the comparison unfair.

Your comment reads a bit like saying you don't like a Toyota car for transporting apples, because it takes too long to bake a pie out of them first.

BTW, if Isar's data format really is "very close to in-memory representation", then why aren't you simply packing the 512 floats into an array (which I guess is how this data is represented anyhow), and use this continuous block of memory directly as the serialized value?


Hey, since SQLite does not yet support lists, if there's a better way to serialize so that reads can be sped up, do let me know, it would help!


SQLite supports raw binary BLOBs, and I would store embeddings as a simple 512*4 = 2048 bytes blob representing just the float32 embeddings. Then you can access the data from the buffer directly in dart: https://api.dart.dev/stable/2.8.1/dart-typed_data/ByteBuffer...


This[1] has improved the situation considerably!

```

[log] SQLite: 100000 embeddings inserted in 2635 ms

[log] SQLite: 100000 embeddings retrieved in 561 ms

```

Isar is still ~2x faster, but there's room for optimization.

Thanks a bunch for sharing this, I will update the post.

[1]: https://github.com/ente-io/edge-db-benchmarks/commit/51ec496...


Looking at the rest of your code, it looks like you are doing a lot of unnecessary processing on these embeddings.

They even seem to be doubly json encoded at some point:

https://github.com/ente-io/clip-ggml/blob/main/lib/clip_ggml...

In my opinion, they should probably just be a memory buffer representing the raw floats all the way down: from the output of the model to the database. They should never be encoded, neither in json, nor as a dart List<double>.


Type-conversions with FFI turned out to be non-trivial, so `{"embedding":[...]}` was a way out at the cost of a small performance hit (when compared to the time spent on inference).

We'll take another look.


The result of clip.[...].call(...) is already a buffer, isn't it? You just have not to touch it at all.

And on the cpp side, remove the json encoding and just return a raw buffer.


Wow, kudos for actually implementing the proposed fix and documenting it here so quickly!


I found this interesting and have been profiling to understand why the increase in performance was so significant. It looks to me that the main culprit was Protobufs + Garbage collection. Serializing to protos performs a lot of allocations. Using just the Float32View skips all that.

``` I/scudo ( 641): Stats: SizeClassAllocator64: 572M mapped (0M rss) in 11986660 allocations; remains 257629 I/scudo ( 641): 00 ( 64): mapped: 1024K popped: 506106 pushed: 491660 inuse: 14446 total: 15044 rss: 0K releases: 0 last released: 0K region: 0x7ceae87000 (0x7ceae86000) I/scudo ( 641): 01 ( 32): mapped: 1024K popped: 92137 pushed: 73047 inuse: 19090 total: 26708 rss: 0K releases: 0 last released: 0K region: 0x7cfae8c000 (0x7cfae86000) ```

I think this is because during the proto encoding/decoding stage the protobuf lib ended up creating a bunch of objects to support the process

``` "Class","Library","Total Instances","Total Size","Total Dart Heap Size","Total External Size","New Space Instances","New Space Size","New Space Dart Heap Size","New Space External Size","Old Space Instances","Old Space Size","Old Space Dart Heap Size","Old Space External Size" _FieldSet,package:protobuf/protobuf.dart,100010,4800480,4800480,0,0,0,0,0,100010,4800480,4800480,0 PbList,package:protobuf/protobuf.dart,108536,3473152,3473152,0,0,0,0,0,108536,3473152,3473152,0 Embedding,package:edge_db_benchmarks/models/embedding.dart,108535,3473120,3473120,0,0,0,0,0,108535,3473120,3473120,0 EmbeddingProto,package:edge_db_benchmarks/models/embedding.pb.dart,100010,1600160,1600160,0,0,0,0,0,100010,1600160,1600160,0 ```

What's missing here is that these have to be copied over to the database isolate as well.


Maybe that's an opportunity for a pull request to the dart protobuf library?


Are batch operations automatically atomic in SQLite? SQLite makes no mention of batch operations.

Looking at the library, it seems it's just using prepared statements outside of a transaction...

https://github.com/powersync-ja/sqlite_async.dart/blob/f994e...

Seems the other commenter is correct that wrapping that line in a tx will fix it.


https://github.com/powersync-ja/sqlite_async.dart/blob/f994e...

https://github.com/powersync-ja/sqlite_async.dart/blob/f994e...

I think you may be looking too deeply. It does look like that object will wrap in a tx.


Ah, yeah I think you're right.


Within the DB world, “batch” usually means a sequence of queries in a single transaction. In software a batch is clumping tasks to run together with any sort of separation from other clumps. Logical, time…

If you use one definition while everyone else is using the other, there’s going to be a loud and confused argument.

Since we are talking about databases, we’d all better be using the former, but I suspect maybe the author is not.


> deserializing the read data into models

What does this have to do with SQLite?


You have changed the code snippets in your link.

The original link highlighted this snippet:

  Future<void> insertEmbedding(EmbeddingProto embedding) async {
      final db = await _database;
      await db.writeTransaction((tx) async {
        await tx.execute('INSERT INTO $tableName ($columnEmbedding) values(?)',
            [embedding.writeToBuffer()]);
      });
  }
Now highlighted:

  Future<void> insertMultipleEmbeddings(List<EmbeddingProto> embeddings) async {
    final db = await _database;
    final inputs = embeddings.map((e) => [e.writeToBuffer()]).toList();
    await db.executeBatch(
        'INSERT INTO $tableName ($columnEmbedding) values(?)', inputs);
  }
Did you post the wrong link the first time and corrected your mistake? Or did you change the snippet because @electroly was right about the code being inefficient?


The code was always calling the latter[1], I had pasted the incorrect link.

[1]: https://github.com/ente-io/edge-db-benchmarks/blob/43273607d...


Thank you for pointing that out! It does make more sense, too. :)


I second it. I'm regularly using SQLite to look and format data across millions of rows, it's much faster than that, but you have to use transactions and optimize queries. It sounds to me like they didn't read the manual and are blaming the tool.


Hello, OP here!

The code we used for running the benchmarks is available here[1].

If you have suggestions on how we could speed up the deserialization during reads, please do share. We'd be happy to go back to SQLite.

[1]: https://github.com/ente-io/edge-db-benchmarks


The embeddings are just a flat array of floating point numbers. Try typed arrays/buffers to read and write the typed arrays directly as BLOB data. Protobuf and JSON and other serialization schemes are way overkill for something that should be as simple as a memcpy.


Yes, this[1] did the job, thank you!

[1]: https://news.ycombinator.com/item?id=39291411


Full table scans?


This article doesn't describe any of their methods.

> It was clear that serialization was the culprit, so we switched to Protobuf instead of stringifying the embeddings. This resulted in considerable gains. For 100,000 embeddings, writing took 6.2 seconds, reading 12.6 seconds and the disk space consumed dropped to 440 MB.

Doing what though? What was their schema like? What are their indices? Primary keys? Column types? Query structure?

It's lacking so little information I find this article not useful and not interesting to me. Which is a bit disappointing since I was hoping to learn something relevant.


Look here: https://github.com/ente-io/edge-db-benchmarks/blob/43273607d...

  SELECT $columnEmbedding FROM $tableName LIMIT 10000 OFFSET $offset
They're not using any indexes, because they're doing offset pagination [1] in a loop. And offset pagination is causing them to scan 550K total rows [2] instead of 100K.

OP, change your code. You're probably wrong about SQLite being slower than Isar, or at least it must be very close.

[1] https://use-the-index-luke.com/no-offset

[2] 10K + 20K + 30K + 40K + 50K + 60K + 70K + 80K + 90K + 100K = 550K


Nice catch, thank you!

I was playing around with the offsets incorrectly, while attempting to reduce the memory footprint on a low-end Android device.

I've updated the code[1] to read all embeddings in one shot, and I'm now measuring the time to read and the time to deserialize separately.

On an iPhone simulator running on an Apple M3 Pro, this is what it says:

```

[log] SqliteDB fetch all took: 1572 ms

[log] SqliteDB deserialization took: 4268 ms

...

[log] Isar: 100000 embeddings retrieved in 616 ms

```

[1]: https://github.com/ente-io/edge-db-benchmarks/commit/e02c3d3...


Sorry, we should have linked to the DB Schema.

Here you go: https://github.com/ente-io/edge-db-benchmarks/blob/43273607d...


Autoincrement is usually not needed in SQLite, and adds a lot of overhead to every insertion since it needs to read and update the sqlite_sequence table on every insert.

https://www.sqlite.org/autoinc.html


One obvious explanation is the lack of indexing on the column you're searching over. It's currently an opaque blob, but a design that actually uses the database's features will serve you much better.

What your article is really saying is, "SQLite isn't a good fit for our data that we didn't want to serialize differently to be efficient with SQLite." And that's fine, it looks like you found the right tool for your data and requirements, but the title is click-baity and isn't really about SQLite.


I would have appreciated some more details on the benchmark. SQLite has notoriously slow writes in the default journal mode, but proper configuration and WAL/WAL2 mode[0] should be a starting point for any comparison.

It's the first time I've heard of Isar[1] though. I'm always surprised at how many solid-but-underused Apache projects are out there, chugging along.

[0] https://phiresky.github.io/blog/2020/sqlite-performance-tuni...

[1] https://isar.dev/


Agreed. From the look at it, a few things: WAL is not enabled, insertion is not wrapped in transactions. I did a benchmark a while ago and if everything done right it is much faster: https://dflat.io/benchmark/

Also, even if it is done right, inserting 100,000 of 512 byte objects will be around 50MiB, which will be the point to trigger checkpointing WAL file into the main db, which can further slow down SQLite.


WAL mode does introduce a number of important usage limitations.

https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf

“To accelerate searching the WAL, SQLite creates a WAL index in shared memory. This improves the performance of read transactions, but the use of shared memory requires that all readers must be on the same machine [and OS instance]. Thus, WAL mode does not work on a network filesystem.”

“It is not possible to change the page size after entering WAL mode.”

“In addition, WAL mode comes with the added complexity of checkpoint operations and additional files to store the WAL and the WAL index.”

https://www.sqlite.org/lang_attach.html

“SQLite does not guarantee ACID consistency with ATTACH DATABASE in WAL mode. “Transactions involving multiple attached databases are atomic, assuming that the main database is not ":memory:" and the journal_mode is not WAL. If the main database is ":memory:" or if the journal_mode is WAL, then transactions continue to be atomic within each individual database file. But if the host computer crashes in the middle of a COMMIT where two or more database files are updated, some of those files might get the changes where others might not.”


Isn't WAL more about concurrent performance? I'd be surprised if its all that much faster when there is only a single thread doing anything.

Still dont think this is sqlite's fault. Serialization is something outside sqlite and a cost you have to pay no matter what you are using.


More than writes, reads is what we struggled with.

Within reads the most expensive part was the re-construction of models from the rows returned by SQLite.

While deserialization is not the responsibility of the database, the overall throughput with SQLite was too low for it to serve our usecase.


https://github.com/ente-io/edge-db-benchmarks/blob/43273607d...

If this is how you are doing reads, this is your problem. Limit/offset reads are slow.

You'll get much better results if you do a proper pagination implementation.

This article describes the problem and solution that'd give you much better read results.

https://use-the-index-luke.com/sql/partial-results/fetch-nex...

(don't forget page 2 https://use-the-index-luke.com/sql/partial-results/window-fu... )


Hey yes, thank you for pointing it out!

Removing the offsets has unfortunately not sped things up, since it's the step post reading the entries (deserialization) that is the bottleneck[1].

Since SQLite does not support Lists, Protobufs came across as the best way to serialize the data at hand.

If there's more native way to solve this problem within SQLite, please let me know!

[1]: https://news.ycombinator.com/item?id=39290784


Yeah code snippets for the benchmark would also be useful.



You don't seem to be using the WAL or any other SQLite optimizations.

https://github.com/ente-io/edge-db-benchmarks/blob/main/lib/...

Any reason for that?


My guess is because they are an SQLite beginner, but for engagement purposes on their blog they had to write something that sounds very authoritative, with a title like "SQLite Isn't Enough".


That's a bit cynical. They may or may not be a newbie, but there is always something for everyone to learn, and saying they did this for engagement is something you don't know.


The sqlite_async lib uses WAL mode by default.



Ha, I guess this title is a play on my submission yesterday “Postgres Is Enough”:

https://news.ycombinator.com/item?id=39273954


I'm sure the database is nice for other reasons, but also without really understanding the problem it sounds like you could represent two separate things here: (1) metadata about photos and their metadata sync state, (2) embeddings.

You could store a dense mapping to integers in (1) and then represent (2) as a separate file that is literally just packed floats. Photo number n has its embeddings start at file offset n * 512 * sizeof(float). Updating any given photo's embeddings is an easy seek+write.

Given that the use case of similarity search is to stream through the embeddings and not any complex seeking/rewriting/indexing/etc, feels like a relatively simple bit of code.

The post talks about serializing 100k photos. With 32-bit floats each embedding vector is 4*512=2048 bytes, so that's ~205mb of data, much less than the >500mb quoted in the post.


It seems like the author would've had better luck with https://pub.dev/packages/sqlite_async, which appears to be made for Flutter users who don't necessarily know much about configuring SQLite.


Hey, we're already using sqlite_async, and yes it does a nice job of abstracting away the complexity introduced by isolates.


I'd be curious to know the results you get if you forgo serialization entirely and save the embeddings on a table with 512 columns, one for each embeddings. Some rough calculation, assuming you store doubles, seem to show that the total db size should be less than 500 MB.


> 500ms is an acceptable latency for a search that runs offline without network calls.

Definitely disagree, having fast search makes a huge difference. Apple Photo's search is fast. Especially since CLIP embeddings aren't perfect, you may need to try a few different keywords.

> For 100,000 embeddings, writing took 6.2 seconds, reading 12.6 seconds and the disk space consumed dropped to 440 MB.

Something sounds so wrong here, I read/write a 100-500 GiBs of data from Python, and it's much faster than this.

In fact I'm seeing ~1.3s to write, and ~700ms to read from trivial Python:

    import pickle
    import sqlite3

    import numpy as np

    IMAGES = 100_000
    EMBEDDING_DIMENSIONS = 1024

    ### write
    db = sqlite3.connect("images.db")
    db.execute("CREATE TABLE IF NOT EXISTS image_embeddings (id INTEGER PRIMARY KEY, embedding BLOB)")

    # generate synthetic embeddings
    embeddings = np.random.normal(size=(IMAGES, EMBEDDING_DIMENSIONS))  # 1.61 s

    # serialize embeddings for storage
    serialized_embeddings = [(pickle.dumps(embedding),) for embedding in embeddings]  # 626 ms

    # insert into db
    db.executemany("INSERT INTO image_embeddings (embedding) VALUES (?)", serialized_embeddings)  # 679 ms
    db.commit()  # 95.3 ms
    db.close()

    #### read
    db = sqlite3.connect("images.db")
    serialized_embeddings = db.execute("SELECT embedding FROM image_embeddings").fetchall()  # 370 ms
    embeddings = [pickle.loads(x[0]) for x in serialized_embeddings]  # 322 ms


For reads from the original code, part of the problem is they are doing limit/offset (rather than a fetch all). Even if they didn't want to read all, it'd be a lot faster to remove the offset and instead query based on the id. Limit/offset often result in really expensive query plans.


So the problem is with dart library sqlite_async.dart [1] and not with sqlite itself.

Kind of a misleading title, TBH.

[1] https://github.com/powersync-ja/sqlite_async.dart

Reference to an HN comment that clarifies this - https://news.ycombinator.com/item?id=39290411


> What I thought would be a short fairytale, did not have a happy ending. Writing and reading 100,000 serialized embeddings took ~19 seconds.

WTF? Are you running a separate transaction per embedding? Using a heavy weight ORM incorrectly? SQLite can handle inserts very, very fast; but if you put lots of layers on top of it, (like an ORM,) they usually slow things down.

> It was clear that serialization was the culprit, so we switched to Protobuf instead of stringifying the embeddings. This resulted in considerable gains.

Why are you serializing and then inserting into a SQL database? You should have columns for your fields, and directly query on those columns.

Anyway, I think I understand why this article is (unfairly) flagged: You are criticizing SQLite, and clearly do not know how to program with a SQL database.


Thanks for sharing this! This comparison is quite helpful as I was facing write speed issues for my desktop app that uses SQLite (just text, not embeddings).

For working with embeddings on desktop, would you recommend using a vector DB? Or can we get away with SQLite (or Isar)?


> On the brighter side of things, the number of embeddings we've to query over is small. An average user has < 100,000 photos, and running a brute force search over 100,000 items on midrange mobile devices takes < 500ms.

So your embeddings are an index? That's kind of the point of an index.


... in our one case. It probably works fine for you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: