Hacker News new | comments | show | ask | jobs | submit login
Searching 20 GB/sec: Systems Engineering Before Algorithms (2014) (scalyr.com)
177 points by gk1 on May 27, 2016 | hide | past | web | favorite | 36 comments

This reminded me of another, related task that comes up a lot in machine learning and IR: searching by vectors (rather than keywords). Aka k-NN (k nearest neighbors) search.

Just as with fulltext search, there's no shortage of clever approximate algorithms for kNN vector search (with the additional caveat of the "curse of dimensionality").

And just as here, without careful tuning, a well-optimized brute force search blows most "clever", widely popular kNN implementations out of the water on large datasets [1]. There's certainly something to be said for simplicity and good "systems engineering".

[1] http://rare-technologies.com/performance-shootout-of-nearest...

Your linked blog post was a really good read, but I'm not totally sure if the correct takeaway was that brute force outperforms "clever" algorithms. Unless I misinterpreted the results, FLANN and Annoy (which implement hierarchical k-means and a binary space partitioning algorithm, respectively) both ran about 100x faster than brute force.

Were you referring to the number of libraries the author couldn't get to work? This certainly might be a failure of over complicated algos, but it seemed more like an engineering failure than a performance failure to me.

Well, complex algos and engineering failures go hand in hand :-)

The approximative kNNs are superior if done well, no doubt about that. It's just that "doing them well" is a major challenge, even for well-established and well thought out tools, such as scikit-learn. So many things can go wrong, and despite the increased complexity & settling for approximative results, you can end up with worse runtime performance than O(n) brute force...

The performance of brute force is pleasantly predictable and stable. I just found the parallels with OP's article (which is on fulltext search, a mostly unrelated field) interesting, that's all.

This is especially true if you can fit all the data on a GPU. GPU is really really good computing cosine distance fast and in parallel.

I think it's astounding how "easy" we can get away with stuff these days. But how does this scale to larger datasets? Say, capturing all signalling traffic for a telco. It's not hard to have 100s of GB a day - that's only a couple 10s of Mbps. Suddenly 20GB/sec means 5-10s queries just for that day.

They can also leverage the bulk speed to reduce the size of posting lists. Just index each term to a chunk of, say, 1-5GB. And don't even index full terms, just hash them - a 64-bit hash gives you enough room for around 10 billion unique terms without causing much problems with collisions. New data in the last hour can be brute forced while the index is being generated.

OTOH they seem successful so I suppose it's working quite fine.

If I read correctly, this was staged out on a single core user machine, then pushed out to an Amazon 16 core machine. Not the most powerful either, and no mention of any tuning. They mentioned looking to hit max scale out by simply adding more machines. Which is why they mention going from 20MG -> 100 -> 1000 not being an issue. In theory, wire speed is their limitation before they have to look at other methods. At 1000MB/s you could start chunking your data into result sets in advance, taking the time it takes a human to read the results visually as always being the bottleneck, and using that time to your advantage, to make the user think the app returns answers instantly all the time. I can't imagine 1000MB/s ever being too slow when you factor in their current dashboard and what it is returning. More complex dashboards and things certainly will need reworking.

I do something similar with the stuff for http://lifecode.co

Couple of terrabytes processing easy but yeah limited to only a few queries per day.

A lot of the effort goes into caching so disk reads and swap file use is minimised. Really that's where all the overhead is.

All the public stuff for that is at


Having worked at a telco where our tools would take a minute or more for results to come back, 5-10 second queries that we only do a few (or even a few dozen) times a day is _fine_.

I was thinking more along the lines of SIP tracing for customers, which happens much more frequently for tech support roles, at least in my experience. With a quad core and a single 1TB drive I was able to service 100s of GB a day of SIP with ~1s response times. Use a mix of brute force (for temporary log) then index in background.

The interactivity shifts quite a bit when you get close to "instant". Though I know some companies that are content with even 30s query times, but that's gotta limit flexibility.

I guess its an interesting difference right - what say a developer would consider "instant" (lets say <10ms) and what the end user is satisfied with as instant (maybe <1min)...it's interesting the tradeoffs you can make when you loosen your latency requirements.

>what the end user is satisfied with as instant (maybe <1min).

There's a number of thresholds you can hit:

5 minutes: this is taking a while, better grab a coffee

1 minute: this is taking a while, I can let my thoughts drift a bit

10 seconds: it will be done soon, I'll just stare at the screen till it's done

1 second: it's not slow, but sluggish and interrupting my flow

200 milliseconds: that result came instantly when I pressed the button

50 milliseconds: the results update instantly when I use this slider

Hitting another of these milestones transforms the user experience and allows the dataset to be experienced in a new way, allowing more spontaneous exploration of the data

Thinking like this leads to products like Rational Doors, a requirements database where searching is dog slow and you need to know where to look if you want to find anything.

Really? I've no idea what that product is but I hardly think it's a "thinking" problem, more a managing expectation or shit requirements gathering problem. If it was necessary to search in a certian way over a certain size in a certain time, you could make the appropriate trade off. Sounds like you just wanted to bitch out about a particular product pal.

I admit I like bitching about slow software. But if you don't put speed as one of your top requirements, over time your software will become slower and slower. Having one operation that takes 100ms instead of 10ms is fine until two years later when somebody else who doesn't bother to read your code uses it to implement something that now takes one second instead of 100ms, which is incidentally used a hundred times to complete some query that now takes minutes instead of seconds. Now imagine you're a company like IBM or SAP that works on the same code base for decades and the cruft keeps layering and layering until even finding out which part of the code needs to be tuned becomes too expensive to explain to your customers.

There is http://accidentallyquadratic.tumblr.com with a nice collection of "it was fast enough at the time" code.

Why would it be difficult to profile and find the bottleneck after a few decades? It should actually be even easier with better profiling tools and practices.

Uniformly shitty code is hard to profile. See for example what Chandler has to say about writing good code


There's an explanatory post there that in particular is very much worth reading: http://accidentallyquadratic.tumblr.com/post/113840433022/wh...

Also, see the follow-up post:


In the end, they did have to do some pre-computes (which is better than indexing, but still not 100% on-the-fly resolution).

Searching at several times memory bandwidth is definitely possible on reasonably modern hardware:


The trick is that the CPU isn't actually reading all those bytes, as the longer the string is to be sought, the more bytes can be skipped over if the last character doesn't match.

I wouldn't call this "brute force" though - it's nowhere near as complex as indexing, but what I consider "brute force" is more like strcmp() in a loop.

Your needles (search terms) need to be quite long if you want to save on memory traffic: cache lines are 64-128 bytes and random access to standard DRAM is slower than sequential access. It can help if your data is in-cache though. (Or if you have an unusual system where memory is faster than what CPU can keep up with).

The high cost of random access to DRAM comes from jumping between rows (blocks of consecutive few -to- few tens of kB).

If you read sequentially with gaps, you only have to open each row at most once. So it's like normal sequential reading, except that you may end up skipping few rows here and there and you may end up not transferring some parts of the rows you opened.

So the time to completion can only be shorter, even if average throughput is lower.

The fastest data to search through is always the data you ignore.

Deciding to do things this way is a tradeoff. Writing data into a search index is maybe 10x more expensive than writing it into a log. But doing actual search typically happens in something like log(n) time. Even for pathological queries, they're still far more efficient than scans.

From a cost perspective, this system is a bet that most data it writes is rarely or never read from.

The articles also discusses that it makes more complicated queries much easier to process. For instance, allowing regular expressions, or special character aware searches.

It is possible to index content and then retrieve it with (simple) regular expressions, see (https://swtch.com/~rsc/regexp/regexp4.html). The trick is that each regular expressions can be seen as a tree where leaves are n-grams to be searched for, and nodes are AND and OR logical operations of n-grams and intermediate nodes.

So, a regexp like /banana/ can be seen as a search for the trigrams "ban" and "ana" and "nan" and "ana", all in that order. A regexp like /ca[tp]/ can be seen as a search for "cat" or "cap". Once you have simple n-grams like that it's trivial to index content on the way in (note that what you index doesn't actually need to interpret content, all you need is to split every n characters) and use that index to restrict potential matches.

This quickly falls down for searches like:


What does 20 GB/sec mean in the context of a search?

The article [1] was about a log search/parsing service, so 20 GB of log files/second. Log files are usually text.

[1] http://blog.scalyr.com/2014/05/searching-20-gbsec-systems-en...

Every few years I stay in CS I see something crazy. I hope that the next crazy thing I see is a log file that isn't text.

I can't imagine who would think that would have been useful to program IF it does exist.

Why would you want your log files purely as text? It tends to take up a lot more space when you're storing a useful amount of readable information.

Category prefixes and time stamps alone take up enough space that you could benefit from having them stored concisely. Indentation actually takes up space. The lines are delimited, not counted (you can have a trailing count). And so on. Space will always be a bottleneck, and generating the data will always take time; no reason to make that worse. The average log byte will be written once and then never looked at by a human.

Producing ASCII text also requires a lot more code. Entire articles have been written on the correct way to print a float as decimal digits... or, you could memcpy 4 bytes into a buffer, while knowing ahead of time exactly how much space you'll need (and the value will round trip perfectly). Something similar goes for integers (using something like LEB128 encoding). In both cases the encoded data will likely take up less space than most useful text representations, even taking into account any control bytes, while being a lot quicker to produce. "But what if there's a bug?"... well, you'll probably find it in one of your first tests, when you write the first thing that sprang to mind. There's not much space for bugs to sneak in.

People don't like this because we have crappy tools. If `cat' supported this kind of encoding as-is, nobody would complain about the idea. You can tell by the popularity of gzip... who complains about gzip? But good luck trying to read a gzip file without the right tools.

The right tools for a gzipped log file is: less (courtesy of lesspipe).

And/or zgrep. In the worst case scenario, you pipe the output of a gunzip through whatever tools you really want to use.

The same tools work on compressed and uncompressed logs, and all of that "wasted space" compresses really well. Nothing like watching gigs of log files compress down to a few megs. It's even supported automatically in most Linux operating systems via logrotate.

Some folks force every line to be Jason serializable. That can be nice for not needing to rewrite a parser for structured log lines.

Another case is if you primarily send data via messages between processes, you could save those streams as log data. For example tcp streams or protobuf. The binary formats will be faster and smaller than converting all log output to text.

While these scenarios may not apply to the common case, they certainly exist within the field.

Doesn't systemd do that with journald?

Yes and it's one of the many reasons I'm migrating to FreeBSD once 11.0 is out (chipset support).

Not saying it's wrong for all cases, but it's not for me.

Um, what if their logs grow exponentially or faster..? Will they not struggle using that approach? The article reads like an advertisement to me.

In what circumstances would logs grow exponentially? I imagine any approach would have trouble with logs that grew in size that quickly.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact