More

fulmicoton · 2024-04-12T06:07:42

A math puzzle, its relationship with the average case complexity of computing top-K using a min heap, and a simple algorithm that performs better.

fulmicoton · 2024-03-24T14:50:46

We used to have one. Maybe we can revive it. What is your use case?

fulmicoton · 2024-03-23T13:02:35

Quickwit co-founder here... I actually agree. For a few GBs, done right, columnar works fine AND is cost efficient.

After all, it does not matter much if a log search query answers in 300ms or 1s. However, there are use cases where a few GB just does not cut it.

The tale saying that you can always prune your dataset using timestamp and tags is simply not always valid.

kikimora · 2024-03-31T13:03:11

Can you share your experience of when columnar fails?

It is possible to scan NVMe at a speed of multiple GB/sec, scans can be parallel and happen on multiple disks, over compressed data (10 Gb of logs ~ 1Gb to scan), data can be segmented and prefaced with Blum filters, to quickly check if a segment is worth scanning.

nh2 · 2024-04-02T11:35:36

I'm not the person you asked, but say you have 10 TB of logs.

Assuming 3 GB/s SSD, 10 SSDs, and a compression as you suggested of 10x, a query for finding a string in the text would take 10000 / 3 / 10 / 10 = 33 seconds.

With an index, you can easily get it 100x faster, and that factor gets larger as your data grows.

In general it's just that O(log(n)) wins over O(n) when n gets large.

I didn't take your Bloom filter idea into consideration as it is not immediately obvious how a Bloom filter can support all filter operations that an index can. Also, the index gives you the exact position of the match, when the bloom filter only gives you existence, thus potentially still resulting in a large read amplication factor of a scan in the segment vs direct random access.

kikimora · 2024-04-02T19:09:07

I’m thinking of how a data lake with parquet files can be structured. Each parquet file has header with summary statistics about the data, it has Blum filters too. A scanner would inspect files falling into the requested time range, for each file it would check headers, find ranges of data to scan. This is the theory, in which the scanner is not so much slower than index access while also allowing for efficient aggregations over log data.

fulmicoton · 2024-03-23T12:52:48

Very valuable contribution!

fulmicoton · 2024-03-23T07:50:36

What is your frame of reference?

dist1ll · 2024-03-23T12:31:41

Per-core store bandwidth is at least 14GB/s on Zen3, 35GB/s for non-temporal stores. Parsing JSON can be done at +2GB/s.

It's very healthy to take maximum bandwidth limits into consideration when reasoning about performance. For instance, for temporal stores, the bottlenecks you see are due to RAM latency and memory parallelism, because of the write-allocate. The load/store uarch can actually retire way more data from SIMD registers.

So there's already some headroom for CPU-bound tasks. For instance 11MB/s is very slow for JIT baseline compiler. But if your particular problem demands arbitrary random access that exceed L3 regularly, maybe that speed is justified.

fulmicoton · 2024-03-23T14:02:38

What we do is CPU bound and we are not just parsing JSON here.

The largest work we do is building an inverted index. Oversimplified, it is equivalent to this:

  inverted_index = defaultdict(list)
  for (doc_id, doc_json) in enumerate(doc_jsons):
    c = json.loads(payload)
    for (field, field_text) in c.items():
      for (position, token) in enumerate():
        inverted_index[token].push((doc, position))

serialize_in_compressed_way_that_allows_lookup(inverted_index)

You can implement it in a couple of hours in the language of your choice to get a proper baseline.

I am sure we can still improve our indexing throughput... but I have never seen any search engine indexing as fast as tantivy.

If someone knows a project I should know of, I'd be genuinely keen on learning from it.

dist1ll · 2024-03-23T17:04:55

I'm curious, what is your frame of reference with regards to maximum speed of building inverted indices? Like, what is the maximum throughput you'd expect for this type of task, and what is your reasoning for it?

fulmicoton · 2024-03-18T02:33:16

Quickwit is very similar to what is described here.

Unfortunately, the files are not in Parquet so even though Quickwit is opensource, it is difficult to tap into the file format.

We did not pick Parquet because we want to actually be able to search and do analysis efficiently, so we ship an inverted index, a row-oriented store, and a columnar format that allows for random access.

We are planning to eventually add ways to tap into the file and get data in the Apache arrow format.

amluto · 2024-03-18T16:05:55

For a quick skim through the docs, it wasn’t clear to me: can I run a stateless Quickwit instance or even a library to run queries, such that the only data accessed is in the underlying object store? Or do I need a long-running search instance or cluster?

fulmicoton · 2024-02-05T08:40:23

That looks quite promising! Thank you for crediting tantivy in the github README, that's well appreciated! Ping me if I can help with anything.

fulmicoton · 2024-01-10T15:55:10

Quickwit is compatible with S3-compatible object storages yes. I don't remember the feedback for R2 specifically, but we have users running Quickwit on S3, GFS, Azure, MinIO, Garage, IBM, and all of the major chinese clouds.

> Secondly, do you see Quicwit being used for analytics, such as tracking daily visits or analyzing user retention?

Excellent question. Quickwit is very fast on ElasticSearch aggregations. We do not support cardinality-aggregation but it is scheduled for version 0.8.

Analyzing user retention and generally speaking running complex analytics, will not be possible any time soon. Maybe next year?

matesz · 2024-01-14T15:50:58

Tangent question. What do you think about the vision of the future of storage from folks at ScyllaDB have?

There is a great presentation from their recent conference [1]. I am asking this because I find it quite similar to what you are doing - decoupling storage from compute and utilising S3. I really appreciate if you could share your insights.

[1] starts at 30:45 https://m.youtube.com/watch?v=ZX7rA78BYS0

fulmicoton · 2024-01-23T07:25:00

I had a look at this passage of the talk. Generally S3 is more obvious for OLAP where insert are very batchy and reads are large.

For OLTP, the equation is less obvious. I'd be scared to suffer from the cost associated with PUT/GET requests. (Pushing once every 30s or so equates to a few dollars per month.)

Since Scylla is based on an LSM-Tree (I think), I would have expected the talk to be about saving sstables on S3 but keeping the WAL on disk... But the slides seem to point to pushing the WAL to S3.

matesz · 2024-01-10T19:55:20

This is great, thanks for your input.

fulmicoton · 2024-01-09T12:02:04

Zookeeper is not used by Quickwit at all. It is used by Kafka however.

mdaniel · 2024-01-09T20:06:11

I continue to see that dependency almost everywhere I see Kafka used, and yet KIP-500 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A...> and its Jira <https://issues.apache.org/jira/browse/KAFKA-9119> are shown as "Resolved Apr-2021" allegedly in 2.8.0

I wonder if it's going to be the "python2.7" (or I guess more relevant "Java8") of running Kafka :-(