Hacker News new | past | comments | ask | show | jobs | submit login

I never being able to understand why log indexing has to build inverted index. Decent columnar store with partitioning by date should be enough to quickly filter gigabytes of logs.



Because you want to find all occurrences of "error abc123" over the last year, immediately?


Quickwit co-founder here... I actually agree. For a few GBs, done right, columnar works fine AND is cost efficient.

After all, it does not matter much if a log search query answers in 300ms or 1s. However, there are use cases where a few GB just does not cut it.

The tale saying that you can always prune your dataset using timestamp and tags is simply not always valid.


Can you share your experience of when columnar fails?

It is possible to scan NVMe at a speed of multiple GB/sec, scans can be parallel and happen on multiple disks, over compressed data (10 Gb of logs ~ 1Gb to scan), data can be segmented and prefaced with Blum filters, to quickly check if a segment is worth scanning.


I'm not the person you asked, but say you have 10 TB of logs.

Assuming 3 GB/s SSD, 10 SSDs, and a compression as you suggested of 10x, a query for finding a string in the text would take 10000 / 3 / 10 / 10 = 33 seconds.

With an index, you can easily get it 100x faster, and that factor gets larger as your data grows.

In general it's just that O(log(n)) wins over O(n) when n gets large.

I didn't take your Bloom filter idea into consideration as it is not immediately obvious how a Bloom filter can support all filter operations that an index can. Also, the index gives you the exact position of the match, when the bloom filter only gives you existence, thus potentially still resulting in a large read amplication factor of a scan in the segment vs direct random access.


I’m thinking of how a data lake with parquet files can be structured. Each parquet file has header with summary statistics about the data, it has Blum filters too. A scanner would inspect files falling into the requested time range, for each file it would check headers, find ranges of data to scan. This is the theory, in which the scanner is not so much slower than index access while also allowing for efficient aggregations over log data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: