
Searching 20 GB/sec: Systems Engineering Before Algorithms (2014) - gk1
http://blog.scalyr.com/2014/05/searching-20-gbsec-systems-engineering-before-algorithms/
======
Radim
This reminded me of another, related task that comes up a lot in machine
learning and IR: searching by vectors (rather than keywords). Aka k-NN ( _k_
nearest neighbors) search.

Just as with fulltext search, there's no shortage of clever approximate
algorithms for kNN vector search (with the additional caveat of the "curse of
dimensionality").

And just as here, without careful tuning, a well-optimized brute force search
blows most "clever", widely popular kNN implementations out of the water on
large datasets [1]. There's certainly something to be said for simplicity and
good "systems engineering".

[1] [http://rare-technologies.com/performance-shootout-of-
nearest...](http://rare-technologies.com/performance-shootout-of-nearest-
neighbours-contestants/)

~~~
mattb314
Your linked blog post was a really good read, but I'm not totally sure if the
correct takeaway was that brute force outperforms "clever" algorithms. Unless
I misinterpreted the results, FLANN and Annoy (which implement hierarchical
k-means and a binary space partitioning algorithm, respectively) both ran
about 100x faster than brute force.

Were you referring to the number of libraries the author couldn't get to work?
This certainly might be a failure of over complicated algos, but it seemed
more like an engineering failure than a performance failure to me.

~~~
Radim
Well, complex algos and engineering failures go hand in hand :-)

The approximative kNNs are superior if done well, no doubt about that. It's
just that "doing them well" is a major challenge, even for well-established
and well thought out tools, such as scikit-learn. So many things can go wrong,
and despite the increased complexity & settling for approximative results, you
can end up with worse runtime performance than O(n) brute force...

The performance of brute force is pleasantly predictable and stable. I just
found the parallels with OP's article (which is on fulltext search, a mostly
unrelated field) interesting, that's all.

------
MichaelGG
I think it's astounding how "easy" we can get away with stuff these days. But
how does this scale to larger datasets? Say, capturing all signalling traffic
for a telco. It's not hard to have 100s of GB a day - that's only a couple 10s
of Mbps. Suddenly 20GB/sec means 5-10s queries just for that day.

They can also leverage the bulk speed to reduce the size of posting lists.
Just index each term to a chunk of, say, 1-5GB. And don't even index full
terms, just hash them - a 64-bit hash gives you enough room for around 10
billion unique terms without causing much problems with collisions. New data
in the last hour can be brute forced while the index is being generated.

OTOH they seem successful so I suppose it's working quite fine.

~~~
busterarm
Having worked at a telco where our tools would take a minute or more for
results to come back, 5-10 second queries that we only do a few (or even a few
dozen) times a day is _fine_.

~~~
MichaelGG
I was thinking more along the lines of SIP tracing for customers, which
happens much more frequently for tech support roles, at least in my
experience. With a quad core and a single 1TB drive I was able to service 100s
of GB a day of SIP with ~1s response times. Use a mix of brute force (for
temporary log) then index in background.

The interactivity shifts quite a bit when you get close to "instant". Though I
know some companies that are content with even 30s query times, but that's
gotta limit flexibility.

~~~
windowsworkstoo
I guess its an interesting difference right - what say a developer would
consider "instant" (lets say <10ms) and what the end user is satisfied with as
instant (maybe <1min)...it's interesting the tradeoffs you can make when you
loosen your latency requirements.

~~~
adrianN
Thinking like this leads to products like Rational Doors, a requirements
database where searching is dog slow and you need to know where to look if you
want to find anything.

~~~
windowsworkstoo
Really? I've no idea what that product is but I hardly think it's a "thinking"
problem, more a managing expectation or shit requirements gathering problem.
If it was necessary to search in a certian way over a certain size in a
certain time, you could make the appropriate trade off. Sounds like you just
wanted to bitch out about a particular product pal.

~~~
adrianN
I admit I like bitching about slow software. But if you don't put speed as one
of your top requirements, over time your software will become slower and
slower. Having one operation that takes 100ms instead of 10ms is fine until
two years later when somebody else who doesn't bother to read your code uses
it to implement something that now takes one second instead of 100ms, which is
incidentally used a hundred times to complete some query that now takes
minutes instead of seconds. Now imagine you're a company like IBM or SAP that
works on the same code base for decades and the cruft keeps layering and
layering until even finding out which part of the code needs to be tuned
becomes too expensive to explain to your customers.

There is
[http://accidentallyquadratic.tumblr.com](http://accidentallyquadratic.tumblr.com)
with a nice collection of "it was fast enough at the time" code.

~~~
mnkmnk
Why would it be difficult to profile and find the bottleneck after a few
decades? It should actually be even easier with better profiling tools and
practices.

~~~
adrianN
Uniformly shitty code is hard to profile. See for example what Chandler has to
say about writing good code

[https://www.youtube.com/watch?v=fHNmRkzxHWs](https://www.youtube.com/watch?v=fHNmRkzxHWs)

------
karterk
Also, see the follow-up post:

[http://blog.scalyr.com/2014/07/impossible-problems-often-
are...](http://blog.scalyr.com/2014/07/impossible-problems-often-arent/)

In the end, they did have to do some pre-computes (which is better than
indexing, but still not 100% on-the-fly resolution).

------
userbinator
Searching at several times memory bandwidth is definitely possible on
reasonably modern hardware:

[http://www.codeproject.com/Articles/250566/Fastest-strstr-
li...](http://www.codeproject.com/Articles/250566/Fastest-strstr-like-
function-in-C)

The trick is that the CPU isn't actually reading all those bytes, as the
longer the string is to be sought, the more bytes can be skipped over if the
last character doesn't match.

I wouldn't call this "brute force" though - it's nowhere near as complex as
indexing, but what I consider "brute force" is more like strcmp() in a loop.

~~~
zurn
Your needles (search terms) need to be quite long if you want to save on
memory traffic: cache lines are 64-128 bytes and random access to standard
DRAM is slower than sequential access. It can help if your data is in-cache
though. (Or if you have an unusual system where memory is faster than what CPU
can keep up with).

~~~
qb45
The high cost of random access to DRAM comes from jumping between rows (blocks
of consecutive few -to- few tens of kB).

If you read sequentially with gaps, you only have to open each row at most
once. So it's like normal sequential reading, except that you may end up
skipping few rows here and there and you may end up not transferring some
parts of the rows you opened.

So the time to completion can only be shorter, even if average throughput is
lower.

------
fizx
Deciding to do things this way is a tradeoff. Writing data into a search index
is maybe 10x more expensive than writing it into a log. But doing actual
search typically happens in something like log(n) time. Even for pathological
queries, they're still far more efficient than scans.

From a cost perspective, this system is a bet that most data it writes is
rarely or never read from.

~~~
foota
The articles also discusses that it makes more complicated queries much easier
to process. For instance, allowing regular expressions, or special character
aware searches.

~~~
rakoo
It is possible to index content and then retrieve it with (simple) regular
expressions, see
([https://swtch.com/~rsc/regexp/regexp4.html](https://swtch.com/~rsc/regexp/regexp4.html)).
The trick is that each regular expressions can be seen as a tree where leaves
are n-grams to be searched for, and nodes are AND and OR logical operations of
n-grams and intermediate nodes.

So, a regexp like /banana/ can be seen as a search for the trigrams "ban" and
"ana" and "nan" and "ana", all in that order. A regexp like /ca[tp]/ can be
seen as a search for "cat" or "cap". Once you have simple n-grams like that
it's trivial to index content on the way in (note that what you index doesn't
actually need to interpret content, all you need is to split every n
characters) and use that index to restrict potential matches.

~~~
Retric
This quickly falls down for searches like:

/a[0-9][0-9][0-9]k[0-9][0-9][0-9]c

------
zachrose
What does 20 GB/sec mean in the context of a search?

~~~
nomel
The article [1] was about a log search/parsing service, so 20 GB of log
files/second. Log files are usually text.

[1] [http://blog.scalyr.com/2014/05/searching-20-gbsec-systems-
en...](http://blog.scalyr.com/2014/05/searching-20-gbsec-systems-engineering-
before-algorithms/)

~~~
gravypod
Every few years I stay in CS I see something crazy. I hope that the next crazy
thing I see is a log file that isn't text.

I can't imagine who would think that would have been useful to program IF it
does exist.

~~~
to3m
Why would you want your log files purely as text? It tends to take up a lot
more space when you're storing a useful amount of readable information.

Category prefixes and time stamps alone take up enough space that you could
benefit from having them stored concisely. Indentation actually takes up
space. The lines are delimited, not counted (you can have a trailing count).
And so on. Space will always be a bottleneck, and generating the data will
always take time; no reason to make that worse. The average log byte will be
written once and then never looked at by a human.

Producing ASCII text also requires a lot more code. Entire articles have been
written on the correct way to print a float as decimal digits... or, you could
memcpy 4 bytes into a buffer, while knowing ahead of time exactly how much
space you'll need (and the value will round trip perfectly). Something similar
goes for integers (using something like LEB128 encoding). In both cases the
encoded data will likely take up less space than most useful text
representations, even taking into account any control bytes, while being a lot
quicker to produce. "But what if there's a bug?"... well, you'll probably find
it in one of your first tests, when you write the first thing that sprang to
mind. There's not much space for bugs to sneak in.

People don't like this because we have crappy tools. If `cat' supported this
kind of encoding as-is, nobody would complain about the idea. You can tell by
the popularity of gzip... who complains about gzip? But good luck trying to
read a gzip file without the right tools.

~~~
falcolas
The right tools for a gzipped log file is: less (courtesy of lesspipe).

And/or zgrep. In the worst case scenario, you pipe the output of a gunzip
through whatever tools you really want to use.

The same tools work on compressed and uncompressed logs, and all of that
"wasted space" compresses really well. Nothing like watching gigs of log files
compress down to a few megs. It's even supported automatically in most Linux
operating systems via logrotate.

------
makach
Um, what if their logs grow exponentially or faster..? Will they not struggle
using that approach? The article reads like an advertisement to me.

~~~
logicchains
In what circumstances would logs grow exponentially? I imagine any approach
would have trouble with logs that grew in size that quickly.

