Our seed round was 100% made of SAFE, so VCs did not have the power to force us to do anything.
The sentence in the blog post is a tad misleading. I suspect François is not really talking about VCs that had already invested in quickwit, but about the usual flow of other VCs who contacted us, to know about the company and be part of our eventual series A.
It just generally felt like we were "at a crossing".
Thanks for the clarification, and sorry for jumping to an incorrect conclusion based on vague wording. (I would edit my comment accordingly but I can't anymore.)
Developer of tantivy chiming in! (I hope that's ok) Database performance is a space where there are a lot of lies and bullshit, so you are 100% right to be suspicious.
I don't know SeekStorm's team and I did not dig much into the details, but my impression so far is that their benchmark's results are fair. At least I see no reason not to trust them.
- it does not do vector search. It can rank docs using BM25, but usually people just want to sort by timestamp.
- its does not use an SSD cache. Quickwit reads directly into the object storage.
- it is append-only (you can't modify documents)
- it scales really well and typically shines on the 1TB .. 100PB range
- it has a Elastic search compatible API.
Exactly! Which is again one of the reasons it's confusing that people apply full text search technology to logs. Machine logs are quite a lot less entropic than human prose, and therefore can be compressed a whole lot better. A corrollary is that because of the redundancy in the data "grepping" the compressed form can be very fast, so long as the compression scheme allows it.
If the query infrastructure operating on these compressed data is itself able to store intermediate results, then we've killed two birds with one stone because we've also gotten rid of the restrictive query language. That's how cascading mapreduce jobs (or Spark) does it, allowing users to perform complex analyses that are entirely off the table if they're restricted to the lucene query language. Imagine a world where your SQL database was one giant table and only allowed you to query it with SELECT. That's pretty limiting, right?
So as a technology demonstration of Quickwit this seems really cool--it can clearly scale!--but it's kind of also an indictment of Binance (and all the other companies doing ELKish things out there).
Quickwit (like Elasticsearch/Opensearch) stores you data compressed with ZSTD in a row store, builds a full text search index, and stores some of your fields in a columnar. The "compressed size" includes all of this.
The high compression rate is VERY specific to logs.
- What happens when you alter an index configuration? Or add or remove an index?
Changing an index mapping was not available in 0.8. It is available in main and will be added in 0.9. The change only impacts new data.
- Or add or remove an index?
This is handled since the beginning.
- What about cold storage?
What makes Quickwit special is that we are reading everything is on S3. We adapted our inverted index to make it possible to read straight from S3.
You might think this is crazy slow, but we typically search into TBs of data in less than a second. We have some in RAM cache too, but they are entirely optional.
> 2. Sampled data, generally for debugging. I would generally try to keep this at 10TB or less;
Sometimes, sampling is not possible. For instance, some of Quickwit users (including Binance) use their logs for user support too. A user might come asking details about something fishy that happened 2 months ago.