
Searching 20 GB/sec: Systems Engineering Before Algorithms - snewman
http://blog.scalyr.com/2014/05/06/searching-20-gbsec-systems-engineering-before-algorithms/?updated=true
======
patio11
This comes up in my work modestly frequently, generally with a slightly
different scenario for the tradeoff between "complicated and considered"
versus "cheap and dirty but gets the job done."

e.g. We could spend 3 weeks using feature vectors and backtesting against
prior data to figure out what signals accounts which are likely to churn send
(for the purpose of proactively identifying them and attempting to derisk
them, like by having the customer success team help them out directly)... or
we could write a two-line if statement based on peoples' intuitive
understandings of what unsuccessful accounts look like. (No login in 30 days
and doesn't have the $STICKY_FEATURE enabled? Churn risk!)

The chief benefit of the second approach is that it actually ships, 100% of
the time, whereas priorities change and an emergency develops and suddenly 75%
complete analysis gets thrown in a git repo and forgotten about. Actually
shipping is a very useful feature to have.

~~~
lifeisstillgood
I get the feeling this is more "era-defining" than that...

What hit me was the "Processors are so fast now we can Brute force grep over
100GB in a second".

We are entering a world where 20TB on a magnetic disk is viable, but randomly
accessing that data could take months to extract. So how we store data on
disks will become vitally important to how we use the data - not unlike tape
drives of pre-1980s era where rewinding to the front of the tape cost you
several minutes of (expensive) waittime.

Imagine a scenario where these guys design how to stream logs to the disk to
maximise streaming reads, then optimise reading that out to SSD thence to RAM
and L2 and so forth. All designed to drive text past a regex running at
bazillions of times a second.

Lets call it the New Brute Force, where its just as much effort to get out the
door as elegant algorithms, but it is much much much simpler. And of course,
sells more hardware.

Expect to see Intel Inside wrapped around the New Brute Force any time soon
:-)

Edit: And they used Java ! I was expecting low-level C optimisations all over
the place

~~~
npsimons
_What hit me was the "Processors are so fast now we can Brute force grep over
100GB in a second"._

Ironically, this may only be because grep uses very well tuned, not
immediately straightforward algorithms. See in particular
[http://lists.freebsd.org/pipermail/freebsd-
current/2010-Augu...](http://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)

~~~
jug6ernaut
Awesome read, thanks for sharing! Almost worth a HN post in itself if you ask
me.

~~~
npsimons
I think it was posted a while back; it was an interesting enough article that
it stuck in my head and I was able to find it pretty easily by googling "why
is grep so fast?".

Ah, here it is:

[https://news.ycombinator.com/item?id=2393587](https://news.ycombinator.com/item?id=2393587)

And again!

[https://news.ycombinator.com/item?id=6813937](https://news.ycombinator.com/item?id=6813937)

And duplicate submissions are listed in those postings.

------
vinkelhake
I wrote a simple code search tool a number of years ago that had the ability
to run arbitrary regex queries on a codebase, much like a recursive grep. I
was working on a medium sized codebase and our primary platform was Windows.
The problem with grep was that opening and scanning lots of small files
apparently wasn't something that Windows was optimized for. If the file cache
was cold, a simple search could turn into minutes of waiting.

My approach was to, essentially, concatenate all files into a single blob. The
search tool would then let the regex loose on the contents of that single
file. I had a simple index on the side to map byte positions to filenames. To
get a line number from the position I would simply count the number of
newlines from the start of the file to the regex match position.

Add some smarts to avoid duplicating work, compression and multithreading and
I had a tool that could find all lines that contained "int(8|16|32|64)_t" in
the Linux kernel source in a third of a second. This was about 20 times faster
than grep (with a hot file cache).

It was a simple solution that scaled well enough to handle the codebases I was
working on at the time without problems. I later read Russ Cox's article[1] on
how Google code search worked. I recommend that article to anyone who's
interested in running regular expressions on a large corpus.

Edit: The source[2] is available on github. Tested on Linux and Windows.

[1]
[http://swtch.com/~rsc/regexp/regexp4.html](http://swtch.com/~rsc/regexp/regexp4.html)

[2] [https://github.com/kalven/CodeDB](https://github.com/kalven/CodeDB)

~~~
justin66
> The problem with grep was that opening and scanning lots of small files
> apparently wasn't something that Windows was optimized for.

It's something _hard disk drives_ were not optimized for.

~~~
vinkelhake
Windows is (or at least was) a dog when it comes to this even if the files are
in the cache. Like barrkel said, the number I quoted was for the case when the
OS doesn't hit the drive.

~~~
to3m
Random rant time!

You have to code for Windows just so. Dear POSIX people in general, there is
more to cross platform coding than being able to build it with cygwin.

Every grep I've used on Windows handles wildcards, so it must be doing the
expansion itself. The right way to do this sort of thing on Windows is to call
FindFirstFile/FindNextFile, which is a cross between readdir and stat in that
each call returns not only the name of one directory entry that matches the
pattern (as per readdir+fnmatch) but also its metadata (as per stat). So, if
you were going to call readdir (to get the names) and then stat each name (to
get the metadata), you should really be doing all of it in just this one loop
so that each bit of data is retrieved just the once.

But this is exactly what POSIX programmers would seemingly never, ever do.
Probably because doing it that way would involve calling a Win32 function.
Well, more fool them, because getting this data from
FindFirstFile/FindNextFile seems pretty quick, and getting it any other way
seems comparatively slow.

I cobbled together a native Win32 port of the_silver_searcher a while ago and
it was about twice as fast as the stock EXE thanks to retrieving all the file
data in one pass. In this case the readdir wrapper just needed to fill in the
file type in the dirent struct; luckily it seems that some POSIX-style systems
do this, so there was already code to handle this and it just needed the right
#define adding. (I have absolutely no idea why MingW32 doesn't do this
already.)

Prior to switching to the_silver_searcher, I used to use grep; the grep I
usually used is the GNU-Win32 one
([http://gnuwin32.sourceforge.net/packages/grep.htm](http://gnuwin32.sourceforge.net/packages/grep.htm)),
and it looks to call readdir to get the names, and then stat to get each
name's details. I checked that just now, and after all those years I can
finally imagine why it's so slow.

GNU-Win32 find is really slow, too.

~~~
shantanugoel
Aside to the discussion, can you point out a link to the changes you did? I'd
like to get silver searcher faster as well on windows.

~~~
to3m
Sure. Here's a binary: [https://github.com/tom-
seddon/the_silver_searcher/tree/_vs20...](https://github.com/tom-
seddon/the_silver_searcher/tree/_vs2012/bin)

The source code is there too, but I wouldn't look too closely. "Cobbled
together", like I said. Though I've been using it all the time, in conjunction
with ag-mode -
[https://github.com/Wilfred/ag.el](https://github.com/Wilfred/ag.el) \- and
haven't noticed any obvious problems.

------
corysama
A friend of mine does a lot of work that often boils down to neighborhood
searches in high-dimensional spaces (20-200 dimensions). The "Curse of
Dimensionality" means that trees and tree-based spacial data structures are
ineffective at speeding up these searches because there are too many different
paths to arrive at nearly the same place in 150 dimensional space.

Usually the solutions end up to use techniques like principle component
analysis to bit-pack each item in the dataset as small as possible. Then to
buy a bunch of Tesla GPUs with as much RAM as possible. The algorithm then
becomes: load the entire dataset in the GPUs memory once, brute force linear
search the entire dataset for every query. The GPUs have enormous memory
bandwidth and parallelism. With a bunch of them running at once, brute forcing
through 30 gigabytes of compressed data can be done in milliseconds.

------
userbinator
_modern processors are really, really fast at simple, straight-line
operations_

Also really tiny loops; what slows them down is having to make lots of
conditional branches/calls.

This related article could be interesting for those who are curious to know
how fast plain C (not even inline Asm) can do string searches:
[http://www.codeproject.com/Articles/250566/Fastest-strstr-
li...](http://www.codeproject.com/Articles/250566/Fastest-strstr-like-
function-in-C)

I do wish Intel/AMD would make the REP SCASB/CMPSB instructions faster, since
they could enable even faster and efficient string searches limited only by
memory bandwidth. (Or they could add a new instruction for memmem-like
functionality, although I'd prefer the former.)

------
dragontamer

        I’d certainly seen this at Google, where they’re 
        pretty good at that kind of thing. But at Scalyr, 
        we settled on a much more brute-force approach: a 
        linear scan through the log
    

Art of Computer Programming mentions this methodology, and the importance of
learning about it. There are entire sections of "Tape Algorithms" that
maximize the "brute force linear scan" of tape drives, complete with diagrams
on how a hypothetical tape drives work in the situation.

Few "fancy algorithms" books actually get into the low level stuff. But Knuth
knows: modern computers are extremely fast at linear data, and accessing data
linearly (as well as learning algorithms that access data linearly) is a good
idea.

I'll bet you that a B-tree approach would be fastest actually. B-Trees are
usually good at balancing "massive linear performance" with "theoretical
asymptotic complexity". IIRC, there is some research into this area: (look up
cache sensitive search trees)

~~~
TheSOB888
Why would you need a B-tree when you only append to the end of the logs?

~~~
dragontamer

        We use some special tricks for searches that are executed 
        frequently, e.g. as part of a dashboard. (We’ll describe 
        this in a future article.) 
    

And...

    
    
        (You might wonder why we store log messages in this
         4K-paged, metadata-and-text format, rather than
         working with raw log files directly. There are many 
         reasons, which boil down to the fact that internally,
         the Scalyr log engine looks more like a distributed 
         database than a file system. Text searches are often 
         combined with database-style filters on parsed log 
         fields; we may be searching many thousands of logs at 
         once; and simple text files are not a good fit for our 
         transactional, replicated, distributed data 
         management.)
    

It sounds like they're doing more than just "appending to the end of the log".
If you're going to make an index of any kind, the index will likely be fastest
with some sort of B-Tree.

------
j2kun
They're not using any fancy algorithms...except when they implement a
customized version of Boyer-Moore for better string searching in a critical
section of their algorithm. Not to mention all the fancy algorithms optimizing
their brute-force code underneath their immediately visible program.

A better title would be, "How we determined when to use brute force."

~~~
jfasi
Say what you will, but while Boyer-Moore can be tricky to implement, it's not
exactly a fancy algorithm.

~~~
snewman
Exactly; "fancy" is relative. It's a bit tricky, but it's nothing like the
complexity of maintaining and using a keyword index. The reference
implementation given in the article we linked to is thirty-odd lines of code.
What we're using in practice is somewhat larger, in part because Java is more
verbose for this kind of thing, but still reasonable. (If there's interest,
we'd be happy to post the code.)

~~~
victor106
Would love to look at your code...

------
carsongross
_modern processors are really, really fast at simple, straight-line
operations_

Exactly.

Especially given the overhead of the network: for most apps, if the user
notices anything else, you've screwed up.

KISUFI [kiss-you-fee]: Keep it simple, you fantastic individual!

~~~
TheLoneWolfling
Canadian version of KISS?

~~~
carsongross
Or the PG version of what I really mean. ;)

~~~
coopaq
You are referring to the fantastically challenged?

------
noelwelsh
I don't know that this is simple. There is a lot of fancy work going on
getting data in and out fast.

I think it's more that memory is getting slower relative to CPU, as we all
know, so complicated data structures that involve lots of random access to
memory aren't so efficient anymore vs linear scans through memory that can
benefit from cache. It's still engineering, just the tradeoffs have shifted.

------
happyhappy
Interesting product. Have you considered writing an output plugin for Heka
([https://github.com/mozilla-services/heka](https://github.com/mozilla-
services/heka)), so that people could use the parsers and client side
aggregators etc written for Heka with your service?

~~~
snewman
We'd certainly be open to that if asked. We practice Complaint Driven
Development [0] when it comes to API integrations and the like -- we
prioritize what our customers ask for.

For our core experience, we went with a custom agent because that allowed us
to viciously simplify the setup process. But we're very much open to working
with other tools as well.

[0] [http://blog.codinghorror.com/complaint-driven-
development/](http://blog.codinghorror.com/complaint-driven-development/)

------
MCarusi
It's taught sometimes that the simple method is never the "good" approach, and
that the fancier and more elaborate you are, the better the solution will be.
I'm not just talking about code, either.

When did "simple" become such a dirty word? Simple isn't synonymous with lazy.

------
ddorian43
Look at their pricing. The best gb/$ public plan, is 500$ month for (worst
case scenario) 300gb of data at all time.

With that price, you can keep most of the data in memory-ssd and don't care to
use fancy-algorithm.

But what if the pricing was 10x lower ?

~~~
jcampbell1
300 GB of ram, is going to cost ~ $3000/month.

Reading 300GB from an SSD is going to be way too slow. It will take minutes
per search. In a simple test on linode, the SSD seems to read at about 1GB/s.

Their pricing seems like a bargain to me.

~~~
Wilya
300GB of ram might cost $3000/month on AWS. There are a few dedicated server
providers that offer 384GB or even 512GB configs for less than 1000$/month.

~~~
herf
lz4 typically compresses logs 6:1 (note: the "high compression" mode allows
fast searching.)

------
babs474
This randomly reminded me of an ancient kuro5hin article that I found
interesting back in the day. Where the proposal was to use brute force to
implement comment search on kuro5hin.

[http://www.kuro5hin.org/story/2004/5/1/154819/1324](http://www.kuro5hin.org/story/2004/5/1/154819/1324)

------
stuki
In BI, a similar tradeoff between lots of fancy preaggregation, versus
optimizing search across the raw, un- (or lightly-) processed base data comes
up quite frequently.

Commonly, the choice of approach is dictated by the number of end users
querying the same data. If it's relatively specific queries ran frequently by
many users, the aggregation approach can be made snappier and save lots of
processing power.

But for power users, it becomes a nuisance real fast to be limited not only by
stored base data, but also by the somewhat arbitrarily chosen aggregate
structures. I'm assuming people doing log analyses of server logs fall within
"power users" in most shops. At least I'd hope so :)

As an aside, the Go language (that I'm currently flapping between loving and
not so much), versus Python at al., seems to be born of somewhat similar
thinking.

------
sorbits
The article’s title is: “Searching 20 GB/sec: Systems Engineering Before
Algorithms”.

Current title on Hacker News is misleading as the “simple code” is not
compared against any “fancy algorithms”. This is about not spending time
devising fancy algorithms when the simple O(n) approach is good enough.

~~~
dang
Thanks. We missed that one.

Submitters: It's against the guidelines to rewrite titles to put your own
editorial spin on things. Please don't.

------
stcredzero
I've implemented a brute-force search with an exponential algorithm, in a
context where the user would want instant results. Basically, I implemented a
bipartite graph producing algorithm that took the "obvious, low hanging fruit"
first, and only worked for a set amount of time. This produced a "sloppy
matching" tool that did most of the user's busy work for matching natural gas
coming from "upstream" and going to points "downstream." Then the user could
eyeball a few optimizations and put those in by hand.

I've also implemented a "web search" that was just a complete scan through a
text file. But this was back in 1997, and the amount of data we had didn't
justify anything more complicated.

------
iadapter
Reminds of the way LMAX achieve high throughput and low latency on their
trading exchange. One of the key points is that they use parallelism but not
concurrency. Keep your threads busy doing one thing only and avoid
synchronising with other threads. No blocking, no queues and they achieve
>100K messages/s.

Where they do need pass messages between threads they use the famous LMAX
Disruptor which keeps data copying and blocking to a minimum.

------
zimpenfish
I did a similar thing when holding a forum in Redis - generating a keyword
index for searching took up 3x the amount of space of the raw text and, whilst
faster for single whole word searches, fetching each article and grep'ing it
in Ruby was plenty fast enough for my needs. Plus no overheads of index
maintenance (new posts, expiring posts) and search limits ("only posts by X",
"last 7 days") etc.

------
serbrech
Why not building on top of existing full text index/search engines? Why would
I choose this over something like Kibana
([http://rashidkpc.github.io/Kibana/](http://rashidkpc.github.io/Kibana/))
that provides me with all this and more, is built on top of existing capable,
scalable open source products, and is itself free and open source?

------
coldcode
Simple is an algorithm more or less. Simple and in-memory is something I've
used in a number of cases. The other benefit of simple is reliability. 20-30
years ago this type of solution wasn't really possible but with today's 64 bit
CPU's, tons of ram, (and if you have to ) SSD's a lot of what used to require
cleverness can now be done with simple.

------
retube
In order to facilitate sub-word or wildcard searches can't you use a
keycharacter index instead of a keyword index?

~~~
snewman
Yes, you can go down that road [0], and in some applications it's a good
approach. However, it adds quite a bit of complexity, and the cost for
creating and storing the index is substantial. For us, it doesn't pencil out
to a win.

[0]
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.362...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.362.2207&rep=rep1&type=pdf)

~~~
sharkbot
I wonder if bitap [0] would be a good fit for the 4K search algorithm. It
would let you do linear-time regexp matching for relatively short patterns (32
characters on a 32-bit machine, 64 characters on 64-bit, etc).

[0]
[http://en.wikipedia.org/wiki/Bitap_algorithm](http://en.wikipedia.org/wiki/Bitap_algorithm)

------
tantalor
Who is the author of this? Byline please!

------
tantalor
Title should be, "Searching 20 GB/sec: Systems Engineering Before Algorithms"

------
a8da6b0c91d
Herb Sutter gave a talk somewhat recently going over the basics of working
intelligently with the cache. He had some kinda surprising results where naive
algorithms on vectors/arrays outperformed the intuitively better list/tree
approaches. Circumstances where you'd think all the copying elements and re-
sizing std::vector would be expensive turn out not to matter.

[http://channel9.msdn.com/Events/Build/2014/2-661](http://channel9.msdn.com/Events/Build/2014/2-661)

The basic take-away is to always prefer std::vector and
boost::flat_set/flat_map unless you have evidence to the contrary.

~~~
CHY872
Heard the same thing from Bjarne Stroustrup. The cache properties of vectors
are incredible.

------
wbsun
How fancy and efficient the underneath runtime library, memory management,
process scheduling, network stack, block I/O, device drivers have been
designed and implemented so that someone can just write naive code to achieve
such a high performance and think himself a genius.

------
gailees
Always.

~~~
bnegreve
One can name thousands of examples (literally) where the simple code doesn't
beat the fancy algorithm. What's your point?

------
taeric
Isn't this answered quite simply by "when they get the job done to your
satisfaction?"

------
suyash
@Author: First of all simple code and fancy algorithms are not opposites.
Actually a very simple code (elegantly and concisely written) can be have a
really fast performance on the order of O(1). What you're suggesting is a very
naive way of looking at the problem.

Secondly, you need to understand Algorithm analysis in little more detail. If
you look at run time analysis of your brute force algorithm, you can
determine, of it grows in linear time, logarithmic, exponential etc and how to
improve upon the core algorithm. Anyone can add more machines, more memory,
faster processor which will obviously help in improving performance but in
comparison to order of growth of 'n' all of that pales in the background.

