Doesn't pass the sniff test. 19 seconds sounds like the performance of 100,000 separate transactions. I bet a crisp $10 bill they've forgotten to wrap all of this in a transaction. In SQLite it makes a huge difference. I struggle to think of a way that you could make SQLite take 19 seconds to write and read 100,000 rows otherwise.
Way back in the day, like a decade or more ago when I was working on Aperture, we increased the speed of the database by dropping CoreData and moving to an embedded SQLite with tuning set up for the application.
We had a test database of 1,000,000 images, each with large amounts (several hundred) of metadata items - EXIF and the like, all of which could be queried. We had a target time of 2 seconds to get any query back. Most of the time, we hit that mark using the raw SQLite database approach.
So this was:
- On far older hardware, we still used spinning rust for $deity sake...
- Seems like much more data
- Seems like much more metadata
And it was still very fast (relatively speaking - a couple of seconds is still a long wait-time, and then there was all the UI overhead on top).
I think the author might be missing a trick or two when running SQLite, as mentioned in their write-ups, we found it faster than the filesystem as a storage medium...
There isn't anything obviously wrong with how the lib is used - the writes are properly batched in a single transaction, but the performance is much slower than expected.
From a quick look, garbage collection appears to be a big overhead - it could be related to how data is passed between isolates. In this case, it may actually be faster to do writes in smaller batches, since the current approach may use a lot of memory for the batch. I'll have to test further to see whether that helps, or whether there are other optimizations for this case.
From some further testing:
1. With a batch size of 10k, there is a negligible difference in write performance on my machine (with SQLite taking around 10% longer).
2. Reads are still slower, with the majority of the time spent in decoding. There are probably faster ways to convert List<double> into a Uint8List - probably using the typed_data library. I haven't tested this yet.
With larger batch sizes, memory usage becomes a problem with the current sqlite_async batch implementation. There can probably be improved in the library, but a workaround is to split the writes into smaller batches.
Overall, you should be able to get very similar performance between Isar and SQLite, but there is something to say for Isar providing better performance for the default/obvious way of doing things in this case.
This is the smoking gun, unfortunately; I was right. This code is not correct. You're doing a single insert per transaction (which is the same thing you'd get if you hadn't used an explicit transaction at all). What you want is a single transaction with ALL of the inserts in it. Do it again with a single transaction and you'll find it massively improves the performance. This is a common pitfall for beginners.
Edit: You've changed the link after I wrote this reply. Did you fix the code? Is the blog post using the original code or this updated code?
Why are you comparing DB engines if the bottleneck is the serialization? You're the one doing the serialization! You could store the embedding directly into SQLite columns if you wanted.
I agree that deserialization is not the responsibility of the database.
But as an end consumer, the overall throughput while using SQLite as the DB engine was not sufficient to serve our use case of reading 100k embeddings.
What people are saying is your premise is wrong, and you’re getting defensive.
Either of those would not be an interesting problem except that your title is calling out a particular and beloved database as being the culprit. Which feels a lot like defamation.
Even if deserialization is the bottleneck, bottlenecks are relative and don’t have to be singular. You can have multiple bottlenecks and being blind to one because of PEBCAK can still skew results and thus conclusions.
The people you’re responding to want you to fix your code and run it again. I think that’s reasonable. And maybe deserves a follow up post or a rework of the existing one.
I have nothing against SQLite, but I unfortunately haven't yet understood how to write and read Lists more efficiently. Once I do, I'll definitely post a follow up :)
If the serialization is really the bottleneck, as you say, then you aren't serializing the data in the Isar evaluation. Which makes the comparison unfair.
Your comment reads a bit like saying you don't like a Toyota car for transporting apples, because it takes too long to bake a pie out of them first.
BTW, if Isar's data format really is "very close to in-memory representation", then why aren't you simply packing the 512 floats into an array (which I guess is how this data is represented anyhow), and use this continuous block of memory directly as the serialized value?
SQLite supports raw binary BLOBs, and I would store embeddings as a simple 512*4 = 2048 bytes blob representing just the float32 embeddings. Then you can access the data from the buffer directly in dart: https://api.dart.dev/stable/2.8.1/dart-typed_data/ByteBuffer...
In my opinion, they should probably just be a memory buffer representing the raw floats all the way down: from the output of the model to the database. They should never be encoded, neither in json, nor as a dart List<double>.
Type-conversions with FFI turned out to be non-trivial, so `{"embedding":[...]}` was a way out at the cost of a small performance hit (when compared to the time spent on inference).
I found this interesting and have been profiling to understand why the increase in performance was so significant. It looks to me that the main culprit was Protobufs + Garbage collection. Serializing to protos performs a lot of allocations. Using just the Float32View skips all that.
I think this is because during the proto encoding/decoding stage the protobuf lib ended up creating a bunch of objects to support the process
```
"Class","Library","Total Instances","Total Size","Total Dart Heap Size","Total External Size","New Space Instances","New Space Size","New Space Dart Heap Size","New Space External Size","Old Space Instances","Old Space Size","Old Space Dart Heap Size","Old Space External Size"
_FieldSet,package:protobuf/protobuf.dart,100010,4800480,4800480,0,0,0,0,0,100010,4800480,4800480,0
PbList,package:protobuf/protobuf.dart,108536,3473152,3473152,0,0,0,0,0,108536,3473152,3473152,0
Embedding,package:edge_db_benchmarks/models/embedding.dart,108535,3473120,3473120,0,0,0,0,0,108535,3473120,3473120,0
EmbeddingProto,package:edge_db_benchmarks/models/embedding.pb.dart,100010,1600160,1600160,0,0,0,0,0,100010,1600160,1600160,0
```
What's missing here is that these have to be copied over to the database isolate as well.
Within the DB world, “batch” usually means a sequence of queries in a single transaction. In software a batch is clumping tasks to run together with any sort of separation from other clumps. Logical, time…
If you use one definition while everyone else is using the other, there’s going to be a loud and confused argument.
Since we are talking about databases, we’d all better be using the former, but I suspect maybe the author is not.
Future<void> insertEmbedding(EmbeddingProto embedding) async {
final db = await _database;
await db.writeTransaction((tx) async {
await tx.execute('INSERT INTO $tableName ($columnEmbedding) values(?)',
[embedding.writeToBuffer()]);
});
}
Now highlighted:
Future<void> insertMultipleEmbeddings(List<EmbeddingProto> embeddings) async {
final db = await _database;
final inputs = embeddings.map((e) => [e.writeToBuffer()]).toList();
await db.executeBatch(
'INSERT INTO $tableName ($columnEmbedding) values(?)', inputs);
}
Did you post the wrong link the first time and corrected your mistake? Or did you change the snippet because @electroly was right about the code being inefficient?
I second it. I'm regularly using SQLite to look and format data across millions of rows, it's much faster than that, but you have to use transactions and optimize queries. It sounds to me like they didn't read the manual and are blaming the tool.
The embeddings are just a flat array of floating point numbers. Try typed arrays/buffers to read and write the typed arrays directly as BLOB data. Protobuf and JSON and other serialization schemes are way overkill for something that should be as simple as a memcpy.
This article doesn't describe any of their methods.
> It was clear that serialization was the culprit, so we switched to Protobuf instead of stringifying the embeddings. This resulted in considerable gains. For 100,000 embeddings, writing took 6.2 seconds, reading 12.6 seconds and the disk space consumed dropped to 440 MB.
Doing what though? What was their schema like? What are their indices? Primary keys? Column types? Query structure?
It's lacking so little information I find this article not useful and not interesting to me. Which is a bit disappointing since I was hoping to learn something relevant.
SELECT $columnEmbedding FROM $tableName LIMIT 10000 OFFSET $offset
They're not using any indexes, because they're doing offset pagination [1] in a loop. And offset pagination is causing them to scan 550K total rows [2] instead of 100K.
OP, change your code. You're probably wrong about SQLite being slower than Isar, or at least it must be very close.
Autoincrement is usually not needed in SQLite, and adds a lot of overhead to every insertion since it needs to read and update the sqlite_sequence table on every insert.
One obvious explanation is the lack of indexing on the column you're searching over. It's currently an opaque blob, but a design that actually uses the database's features will serve you much better.
What your article is really saying is, "SQLite isn't a good fit for our data that we didn't want to serialize differently to be efficient with SQLite." And that's fine, it looks like you found the right tool for your data and requirements, but the title is click-baity and isn't really about SQLite.
I would have appreciated some more details on the benchmark. SQLite has notoriously slow writes in the default journal mode, but proper configuration and WAL/WAL2 mode[0] should be a starting point for any comparison.
It's the first time I've heard of Isar[1] though. I'm always surprised at how many solid-but-underused Apache projects are out there, chugging along.
Agreed. From the look at it, a few things: WAL is not enabled, insertion is not wrapped in transactions. I did a benchmark a while ago and if everything done right it is much faster: https://dflat.io/benchmark/
Also, even if it is done right, inserting 100,000 of 512 byte objects will be around 50MiB, which will be the point to trigger checkpointing WAL file into the main db, which can further slow down SQLite.
“To accelerate searching the WAL, SQLite creates a WAL index in shared memory. This improves the performance of read transactions, but the use of shared memory requires that all readers must be on the same machine [and OS instance]. Thus, WAL mode does not work on a network filesystem.”
“It is not possible to change the page size after entering WAL mode.”
“In addition, WAL mode comes with the added complexity of checkpoint operations and additional files to store the WAL and the WAL index.”
“SQLite does not guarantee ACID consistency with ATTACH DATABASE in WAL mode. “Transactions involving multiple attached databases are atomic, assuming that the main database is not ":memory:" and the journal_mode is not WAL. If the main database is ":memory:" or if the journal_mode is WAL, then transactions continue to be atomic within each individual database file. But if the host computer crashes in the middle of a COMMIT where two or more database files are updated, some of those files might get the changes where others might not.”
My guess is because they are an SQLite beginner, but for engagement purposes on their blog they had to write something that sounds very authoritative, with a title like "SQLite Isn't Enough".
That's a bit cynical. They may or may not be a newbie, but there is always something for everyone to learn, and saying they did this for engagement is something you don't know.
I'm sure the database is nice for other reasons, but also without really understanding the problem it sounds like you could represent two separate things here: (1) metadata about photos and their metadata sync state, (2) embeddings.
You could store a dense mapping to integers in (1) and then represent (2) as a separate file that is literally just packed floats. Photo number n has its embeddings start at file offset n * 512 * sizeof(float). Updating any given photo's embeddings is an easy seek+write.
Given that the use case of similarity search is to stream through the embeddings and not any complex seeking/rewriting/indexing/etc, feels like a relatively simple bit of code.
The post talks about serializing 100k photos. With 32-bit floats each embedding vector is 4*512=2048 bytes, so that's ~205mb of data, much less than the >500mb quoted in the post.
It seems like the author would've had better luck with https://pub.dev/packages/sqlite_async, which appears to be made for Flutter users who don't necessarily know much about configuring SQLite.
I'd be curious to know the results you get if you forgo serialization entirely and save the embeddings on a table with 512 columns, one for each embeddings. Some rough calculation, assuming you store doubles, seem to show that the total db size should be less than 500 MB.
> 500ms is an acceptable latency for a search that runs offline without network calls.
Definitely disagree, having fast search makes a huge difference. Apple Photo's search is fast. Especially since CLIP embeddings aren't perfect, you may need to try a few different keywords.
> For 100,000 embeddings, writing took 6.2 seconds, reading 12.6 seconds and the disk space consumed dropped to 440 MB.
Something sounds so wrong here, I read/write a 100-500 GiBs of data from Python, and it's much faster than this.
In fact I'm seeing ~1.3s to write, and ~700ms to read from trivial Python:
import pickle
import sqlite3
import numpy as np
IMAGES = 100_000
EMBEDDING_DIMENSIONS = 1024
### write
db = sqlite3.connect("images.db")
db.execute("CREATE TABLE IF NOT EXISTS image_embeddings (id INTEGER PRIMARY KEY, embedding BLOB)")
# generate synthetic embeddings
embeddings = np.random.normal(size=(IMAGES, EMBEDDING_DIMENSIONS)) # 1.61 s
# serialize embeddings for storage
serialized_embeddings = [(pickle.dumps(embedding),) for embedding in embeddings] # 626 ms
# insert into db
db.executemany("INSERT INTO image_embeddings (embedding) VALUES (?)", serialized_embeddings) # 679 ms
db.commit() # 95.3 ms
db.close()
#### read
db = sqlite3.connect("images.db")
serialized_embeddings = db.execute("SELECT embedding FROM image_embeddings").fetchall() # 370 ms
embeddings = [pickle.loads(x[0]) for x in serialized_embeddings] # 322 ms
For reads from the original code, part of the problem is they are doing limit/offset (rather than a fetch all). Even if they didn't want to read all, it'd be a lot faster to remove the offset and instead query based on the id. Limit/offset often result in really expensive query plans.
> What I thought would be a short fairytale, did not have a happy ending. Writing and reading 100,000 serialized embeddings took ~19 seconds.
WTF? Are you running a separate transaction per embedding? Using a heavy weight ORM incorrectly? SQLite can handle inserts very, very fast; but if you put lots of layers on top of it, (like an ORM,) they usually slow things down.
> It was clear that serialization was the culprit, so we switched to Protobuf instead of stringifying the embeddings. This resulted in considerable gains.
Why are you serializing and then inserting into a SQL database? You should have columns for your fields, and directly query on those columns.
Anyway, I think I understand why this article is (unfairly) flagged: You are criticizing SQLite, and clearly do not know how to program with a SQL database.
Thanks for sharing this! This comparison is quite helpful as I was facing write speed issues for my desktop app that uses SQLite (just text, not embeddings).
For working with embeddings on desktop, would you recommend using a vector DB? Or can we get away with SQLite (or Isar)?
> On the brighter side of things, the number of embeddings we've to query over is small. An average user has < 100,000 photos, and running a brute force search over 100,000 items on midrange mobile devices takes < 500ms.
So your embeddings are an index? That's kind of the point of an index.