Just as with fulltext search, there's no shortage of clever approximate algorithms for kNN vector search (with the additional caveat of the "curse of dimensionality").
And just as here, without careful tuning, a well-optimized brute force search blows most "clever", widely popular kNN implementations out of the water on large datasets . There's certainly something to be said for simplicity and good "systems engineering".
Were you referring to the number of libraries the author couldn't get to work? This certainly might be a failure of over complicated algos, but it seemed more like an engineering failure than a performance failure to me.
The approximative kNNs are superior if done well, no doubt about that. It's just that "doing them well" is a major challenge, even for well-established and well thought out tools, such as scikit-learn. So many things can go wrong, and despite the increased complexity & settling for approximative results, you can end up with worse runtime performance than O(n) brute force...
The performance of brute force is pleasantly predictable and stable. I just found the parallels with OP's article (which is on fulltext search, a mostly unrelated field) interesting, that's all.
They can also leverage the bulk speed to reduce the size of posting lists. Just index each term to a chunk of, say, 1-5GB. And don't even index full terms, just hash them - a 64-bit hash gives you enough room for around 10 billion unique terms without causing much problems with collisions. New data in the last hour can be brute forced while the index is being generated.
OTOH they seem successful so I suppose it's working quite fine.
Couple of terrabytes processing easy but yeah limited to only a few queries per day.
A lot of the effort goes into caching so disk reads and swap file use is minimised. Really that's where all the overhead is.
All the public stuff for that is at
The interactivity shifts quite a bit when you get close to "instant". Though I know some companies that are content with even 30s query times, but that's gotta limit flexibility.
There's a number of thresholds you can hit:
5 minutes: this is taking a while, better grab a coffee
1 minute: this is taking a while, I can let my thoughts drift a bit
10 seconds: it will be done soon, I'll just stare at the screen till it's done
1 second: it's not slow, but sluggish and interrupting my flow
200 milliseconds: that result came instantly when I pressed the button
50 milliseconds: the results update instantly when I use this slider
Hitting another of these milestones transforms the user experience and allows the dataset to be experienced in a new way, allowing more spontaneous exploration of the data
There is http://accidentallyquadratic.tumblr.com with a nice collection of "it was fast enough at the time" code.
In the end, they did have to do some pre-computes (which is better than indexing, but still not 100% on-the-fly resolution).
The trick is that the CPU isn't actually reading all those bytes, as the longer the string is to be sought, the more bytes can be skipped over if the last character doesn't match.
I wouldn't call this "brute force" though - it's nowhere near as complex as indexing, but what I consider "brute force" is more like strcmp() in a loop.
If you read sequentially with gaps, you only have to open each row at most once. So it's like normal sequential reading, except that you may end up skipping few rows here and there and you may end up not transferring some parts of the rows you opened.
So the time to completion can only be shorter, even if average throughput is lower.
From a cost perspective, this system is a bet that most data it writes is rarely or never read from.
So, a regexp like /banana/ can be seen as a search for the trigrams "ban" and "ana" and "nan" and "ana", all in that order. A regexp like /ca[tp]/ can be seen as a search for "cat" or "cap". Once you have simple n-grams like that it's trivial to index content on the way in (note that what you index doesn't actually need to interpret content, all you need is to split every n characters) and use that index to restrict potential matches.
I can't imagine who would think that would have been useful to program IF it does exist.
Category prefixes and time stamps alone take up enough space that you could benefit from having them stored concisely. Indentation actually takes up space. The lines are delimited, not counted (you can have a trailing count). And so on. Space will always be a bottleneck, and generating the data will always take time; no reason to make that worse. The average log byte will be written once and then never looked at by a human.
Producing ASCII text also requires a lot more code. Entire articles have been written on the correct way to print a float as decimal digits... or, you could memcpy 4 bytes into a buffer, while knowing ahead of time exactly how much space you'll need (and the value will round trip perfectly). Something similar goes for integers (using something like LEB128 encoding). In both cases the encoded data will likely take up less space than most useful text representations, even taking into account any control bytes, while being a lot quicker to produce. "But what if there's a bug?"... well, you'll probably find it in one of your first tests, when you write the first thing that sprang to mind. There's not much space for bugs to sneak in.
People don't like this because we have crappy tools. If `cat' supported this kind of encoding as-is, nobody would complain about the idea. You can tell by the popularity of gzip... who complains about gzip? But good luck trying to read a gzip file without the right tools.
And/or zgrep. In the worst case scenario, you pipe the output of a gunzip through whatever tools you really want to use.
The same tools work on compressed and uncompressed logs, and all of that "wasted space" compresses really well. Nothing like watching gigs of log files compress down to a few megs. It's even supported automatically in most Linux operating systems via logrotate.
Another case is if you primarily send data via messages between processes, you could save those streams as log data. For example tcp streams or protobuf. The binary formats will be faster and smaller than converting all log output to text.
While these scenarios may not apply to the common case, they certainly exist within the field.
Not saying it's wrong for all cases, but it's not for me.