
How long does it take to linear search 200MB in memory? - jotto
https://www.jonathanotto.com/linear-search-benchmark
======
glangdale
Not wanting to be the 'Old Man Yells At Cloud' figure, but this is a perfect
illustration of how _not_ to do it, particularly multithreading in Go before
using a reasonable algorithm.

Searching for a long string like this is preposterously easy. You can use a
classical algorithm, or, given you are looking for a 64 byte long string, you
could do a strided inspection - every 32 bytes, look to see if you have a
4-gram from the first 35 bytes of the string. More or less _anything_ works as
almost any reasonable string searching mechanism can outrun memory bandwidth.

Or you could go use my old project
([https://github.com/intel/hyperscan](https://github.com/intel/hyperscan)) and
then you could look for regexes too. :-)

~~~
s_ngularity
Depends on how often you need to search. The implementation of the parallel
linear search in Go is a lot easier to get right than what you’re
recommending.

Though I would certainly agree that paying for more vm cores just so you can
do a linear search on each of them is a bit silly

~~~
glangdale
The linear search code should probably be in a library, and in fact often is,
if you have the GNU extension memmem().

I'm also not convinced that it's that hard. This seems to be the new cult;
just because someone buggered up binary search with an integer overflow once,
we're not going to write non-pathological algorithms anymore?

~~~
s_ngularity
I’m not saying it’s a bad idea to implement it, but there’s a greater time
cost, and if the performance gains aren’t worth the effort, then there’s no
reason to make the improvement

~~~
glangdale
Fair enough. I guess you would figure out how long it takes to either clag the
code from somewhere else or write it yourself and plug in the math at the
bottom of the page (as they are trying to quantify the value of doing this).

I strongly suspect dragging in memmem() would be the most cost effective way
of doing this and run in the most places, even if it's a non-standard GNU libc
extension.

------
loser777
0.25s for 200MB (in memory) seems pretty slow for a modern CPU (800MB/s) which
is more than an order of magnitude below what you would expect from main
memory.

There can be a lot of performance left on the table even in standard library
implementations: [http://0x80.pl/articles/simd-
strfind.html](http://0x80.pl/articles/simd-strfind.html)

~~~
Rapzid
This isn't super surprising to me. String comparisons are super slow if not
handle with care, and even then may be surprisingly slow. This does a LOT of
string comparisons.

Depending on the requirements, sliding a window through doing byte comparisons
may be better.
[https://golang.org/src/bytes/bytes_decl.go?s=522:550#L7](https://golang.org/src/bytes/bytes_decl.go?s=522:550#L7)

EDIT: I believe paper trail does or did have a limitation on case sensitive
searching. You can imagine why they may be doing this if you consider the
performance implications.

------
swerling
The ruby version looked a bit odd to me. It would be unusual to not use regex
in this context.

I'm not sure if this is in the spirit of the test, but using regex reduced the
time from 27s to 0.04s on my machine.

    
    
        require 'benchmark'
    
        contents = File.read("/tmp/200mb.txt")
        find = "07123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC"
        result = nil
        elapsed = Benchmark.realtime do
          result = (contents =~ /#{find}/)
        end
        puts "Found at index #{result} in #{elapsed}s"
    
    

(The search string was placed at the end of a 200mb file)

~~~
sinap
You can reduce the time to 0.02s by just using `String#index`.

------
userbinator
The fastest, Golang at slightly over 2.8GB/s, is still surprisingly slow for
modern (as in, under 10 years old) hardware. I just wrote a similar linear
search in Asm using the not-so-fast REPNZ SCASB/CMPSB instructions and got
6GB/s on an i7 860 from 2009.

If you're just scanning for the string then algorithms like BMH can do up to x
times memory bandwidth, where x is the length of the string, since they start
matching from the end and can skip over chunks of x bytes at once.

This article from someone with a rather... distinctive style shows >10GB/s on
5-year-old hardware with target strings only slightly over 50 bytes:

[https://www.codeproject.com/Articles/250566/Fastest-
strstr-l...](https://www.codeproject.com/Articles/250566/Fastest-strstr-like-
function-in-C)

~~~
gfody
I recall repne scasb was significantly slower than the equivalent loop
unrolled a few times. Not sure if that's still the case but it was something
like 2x slower.

------
drej
I don’t have a computer at hand, but I imagine it would be a lot shorter (and
likely faster due to algorithmic optimisations) to use bytes.Index. As other
have pointed out, 1 GB/s seems slow. I would expect that performance when
reading from disk.

(Also, iterating over indexes is not too idiomatic in Go. You can iterate both
the index and value using `range`, just like enumerate in Python.)

[https://golang.org/pkg/bytes/#Index](https://golang.org/pkg/bytes/#Index)

~~~
twotwotwo
Turned out bytes.Index is initially similar or slower in speed, but search
time goes down to 40ms as soon as you lop off the last byte.

For strings over (on my machine) 63 bytes, bytes.Index uses a rolling Rabin-
Karp hash. For this data it's ~25% slower than the loop to search. However,
that evens out if you make the whole file hex, which makes the search loop
have some branch mispredictions. (Also, the simple loop search isn't complete!
More below.) The R-K search is more reasonable than this one comparison makes
it sound; it avoids bad worst-case behavior when there are a lot of partial
matches. (Also, it may be less optimized because authors think most
bytes.Index searches will be for short bits of data.)

Relatedly, the loop search doesn't back up its position when it discovers a
partial match isn't a complete match, which means it's incorrect; it won't
find a real match that overlaps with a false match (a prefix of a match).
Like, when I made a file that contained
"07123E1F482356C415F684407123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC",
it didn't find the match starting at the second 071... because it overlaps
with a prefix of a match (at the start of the string) that didn't turn out to
be a full match. (The overlap is two digits here, but same thing would come up
if it were one digit.)

Searching for shorter strings, Index on Intel platforms uses AVX/SSE search.
If you're going to search for long strings and feel lucky, you could search
for a prefix of whatever your platform's max SIMD-searchable length is, then
check if it's a full match. If so, you're done; if not, keep searching
starting at the next byte. In case of pathological data you could bail out to
the safe R-K approach after n false matches, like one of Index's existing
strategies does. It might be something for someone who wants a stdlib patch to
try submitting to Go, though it might also be that they see it as too much of
a corner case to merit complicating things more.

------
kwillets
That search doesn't handle head/tail overlaps, but otherwise it's a good
approximation. I'm surprised it isn't faster; I think I got a few hundred MB/s
on trie-based matching.

Protip: bring benchmarks like this to your meeting with the sales rep.

~~~
Matheus28
It's pretty bad: Say it's searching for "AB". It won't find it in this string:
"AAB"

~~~
baddox
I think the point was just to test each string against the query for equality.
The benchmark could be slightly tweaked to test each string against the query
for substring inclusion or regular expression match, but I suspect the results
would only change by the cost of the test predicate.

~~~
Matheus28
Not necessarily when comparing different language, as they might allow
different optimizations. For example, you'd be able to compare several bytes
at a time with C if you pad the strings, while node.js' V8 is unlikely to
optimize that far.

------
tbrownaw

        console.log(bulk.length); console.log(new Date()); pos= bulk.indexOf("788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14"); console.log(pos); console.log(new Date());
        209715266                           debugger eval code:1:1
        Date 2018-04-28T00:08:30.231Z       debugger eval code:1:27
        209715202                           debugger eval code:1:139
        Date 2018-04-28T00:08:30.252Z       debugger eval code:1:157 
    

That's about 0.021 seconds for however Firefox implements String.indexOf().

    
    
        $ dd if=/dev/urandom of=200mb.dat bs=1M count=200
        200+0 records in
        200+0 records out
        209715200 bytes (210 MB, 200 MiB) copied, 2.98279 s, 70.3 MB/s
        
        $ echo 788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14 >>200mb.dat 
        $ time grep -o 788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14 <200mb.dat 
        Binary file (standard input) matches
        
        real    0m0.056s
        user    0m0.028s
        sys     0m0.028s
    

That's not too terribly different.

~~~
fao_
Something feels off about this, but I can't put my finger on it.

------
mappu
The multi-core Go benchmark is spawning the goroutines before calling `start
:= time.Now()`.

~~~
rollcat
Go's concurrency model defines that memory writes in one goroutine won't be
observable in another until a context switch can happen (e.g. by send/recv on
a channel). In practice, newly spawned goroutines often don't get the chance
to run at all until that context switch. But yeah, it's not a good idea to
rely on this behavior here.

~~~
zlynx
I'm not sure where you got that information?

Go's goroutines launch on threads when multiple CPUs are available. They have
all the race conditions of C code. Memory writes certainly are visible on
other goroutines.

~~~
rollcat
> I'm not sure where you got that information?

From the very official spec?
[https://golang.org/ref/mem](https://golang.org/ref/mem)

Goroutines do not map to OS threads 1:1. Whether the scheduler decides to
spawn a thread or not is its own, entirely private decision. No action is
guaranteed to be taken until scheduling occurs.

~~~
zlynx
Well, from that spec: "When multiple goroutines access a shared variable v,
they must use synchronization events to establish happens-before conditions
that ensure reads observe the desired writes."

However, that very specifically does NOT say that writes may not be observed
BEFORE synchronization events. They MAY.

Under "Incorrect Synchronization": "Note that a read r may observe the value
written by a write w that happens concurrently with r. Even if this occurs, it
does not imply that reads happening after r will observe writes that happened
before w."

These memory ordering rules are the same as C, C++ or machine code. Go makes
no special effort to avoid the natural machine behavior of x86, ARM or MIPS.

And on goroutine spawning and execution order, I have observed both behaviors.
Sometimes Go will continue on the current goroutine before launching other
goroutines. Other times the new goroutines begin executing immediately. It's
possible that some other, background goroutine is causing a scheduling event,
but if so it's so random that no one could predict it.

------
fallingfrog
A single frame of Quake is 800x600=1mb of memory, and quake ran at 30fps on a
pentium-233 in software. That's 30mb per second circa 1995. So your times all
seem extremely slow to me. You should be getting a couple orders of magnitude
better speeds. Try it in rust!

------
piinbinary
I'm actually somewhat surprised it wasn't faster. I may play with this and see
how fast I can get it to go.

Edit: After a number of attempts (such as using Golang's regexp library or
strings.Contains()), I wasn't able to speed it up much. Interestingly, grep
performs the same search in about 5ms on my machine.

~~~
selljamhere
grep is the product of a lot of thought and engineering. There's a great
rundown by GNU grep's author, if you're interested [1].

One big takeaway is that grep doesn't simply traverse the string linearly,
instead using the Boyer-Moore algorithm [2].

1: [https://lists.freebsd.org/pipermail/freebsd-
current/2010-Aug...](https://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html#)

2:
[https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sea...](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm)

EDIT: Interestingly, there seems to be an internal Boyer-Moore implementation
in Golang's strings package[3], though it only appears to be used in
strings.Replace("")[4]

3:
[https://golang.org/src/strings/search.go](https://golang.org/src/strings/search.go)

4:
[https://golang.org/src/strings/replace.go?h=stringFinder](https://golang.org/src/strings/replace.go?h=stringFinder)

~~~
Teknoman117
And if you want an even faster version of grep, checkout ripgrep (which
happens to be written in Rust).

[https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)

------
contravariant
Ignoring the somewhat low performance, unless I'm misreading things, this
version won't be able to find the string:

    
    
       07123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC
    

in the string:

    
    
        07123E1F482356C415F684407123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC
    

or to make this a bit easier to read, it won't find:

    
    
        AXAY
    

inside the string:

    
    
        AXAXAY
    

at the start it will match all the way up to "AXA" then fail to match "AXAY"
with "AXAX", and then it will try to search for "AXAY" in the remainder "XAY"
which fails.

The correct way of handling this is either searching from the start at for
every possible position, which will usually fail quickly so it's not _that_
inefficient, or you need to take into account that partial matches might
overlap with an actual match, using something like the KMP algorithm.

------
nartz
Maybe i'm blind - but i'm not seeing where the benchmark results are? Also, is
this code the same code that you actually ran? For ruby, i notice you are
outputting 'contens' not 'contents' so i imagine you would get a runtime
error?

~~~
revel
I also had a hard time finding performance. It's listed in the grey bar at the
top of each language.

I'd love to see this comparison expanded to a few other languages
(particularly C/C++ with custom memory allocation).

~~~
quickben
A c++ implementation gives 5.7[ms] for searching through 200MB of
vector<string> on a Ryzen 1700x.

EDIT: Reworking the code.

~~~
prewett
I did some benchmarking for a dictionary app a while back, and the fastest I
could get was strstr(), if I recall correctly. It substantially beat
std::string::find(). Using a loop is bound to give terrible performance; for
large data you should be using one of the string-search algorithms (which
strstr() surely does). And given the very large search string, those string-
search algorithms should be able to skip large chunks at a time.

Here's a small version that allocates 200 MB of uninitialized memory.

    
    
      #include <stdlib.h>
      #include <string.h>
      int main(int argc, char *argv[])
      {
    	int size = 200 * 1024 * 1024;
    	char *bytes = malloc(size);
    	bytes[size - 1] = '\0';
    	strstr(bytes, "788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14");
    	return 0;
      }
      
      > gcc test.c
      > time ./a.out
      real	0m0.006s
      user	0m0.002s
      sys	0m0.002s
    

(Since the author ignores time to read off disk, I figure uninitialized memory
should be comparable. The chances of strstr() actually finding anything are
pretty slim, but I ran it a number of times just to be sure, with the same
results.)

This is on a 2012 MacBook Pro.

~~~
cozzyd
Note that strstr will stop as soon as you have a 0 byte so this is not a fair
comparison.

Uninitialized (practically often zeroed) also produces different results from
random and explicitly zeroed, since I assume it affects how often it can fail
fast.

    
    
        #define _GNU_SOURCE
        #include <stdlib.h> 
        #include <stdio.h> 
        #include <string.h> 
        #include <time.h> 
    
        int main(int nargs, char ** args) 
        {
          struct timespec start; 
          struct timespec end; 
          int mode = atoi(args[1]); 
          const char * needle = "788bbd6b12d79a4ee18e36a8a264fb496465c71b7932a965ed731397b7c97d14";  
          const int N = 200*1024*1024 ;
          char * haystack = malloc(N + sizeof(needle)); 
          strcpy(haystack+N,needle); 
    
          printf("Mode is %s\n" , mode == 0 ? "zeroed" : 
                                  mode == 1 ? "random"        : 
                                  "uninitialized"); 
    
          if (mode == 0) 
          {
            memset(haystack,0,N); 
          }
        
          if (mode == 1) 
          {
            FILE * urandom = fopen("/dev/urandom","r"); 
            int nread = fread(haystack,1,N,urandom); 
            fclose(urandom); 
            if (nread < N) 
            {
              fprintf(stderr,"read returned %d\n",nread); 
            }
          }
        
          clock_gettime(CLOCK_REALTIME, &start); 
          void * found = memmem(haystack,N+sizeof(needle),needle,sizeof(needle)); 
          clock_gettime(CLOCK_REALTIME, &end); 
        
          printf("Took %g secs to find string at offset of %llu\n", end.tv_sec - start.tv_sec + 1e-9*(end.tv_nsec -    start.tv_nsec), found-(void*)haystack); 
          return 0; 
        }
    
    
    
        $ gcc test_find.c -o test_find
        $ ./test_find 0
          Mode is zeroed
          Took 0.0147023 secs to find string at offset of 209715200
        $ ./test_find 1
          Mode is random
          Took 0.129951 secs to find string at offset of 209715200
        $ ./test_find 2
          Mode is uninitialized
          Took 0.0371922 secs to find string at offset of 209715200

~~~
kazinator
Note that if you malloc that big of a block, it's going to come directly from
a MAP_ANONYMOUS _mmap_ , and so it _will_ be zero initialized!

Your uninitialized mode _is_ slower than zeroed, but nowhere near as slow as
random.

The difference between initialized and uninitialized may be due to page
faults: in the uninitialized case, the search is touching the memory for the
first time. Maybe we're seeing TLB misses?

~~~
cozzyd
Yeah I was expecting the uninitialized to be the same as the zeroed case for
that reason, but was surprised to see a (slightly) different result.

Edit: you are almost certainly right. I added a getrusage call before and
after, and we have ~50k soft page faults in the uninitialized case and ~2 in
the zeroed case during the memmem.

------
gerad
[https://blog.burntsushi.net/ripgrep/#literal-
optimizations](https://blog.burntsushi.net/ripgrep/#literal-optimizations)

------
peterburkimsher
I made a program to add spaces into Chinese text, which requires a lot of
string comparisons.

[https://pingtype.github.io](https://pingtype.github.io)

The algorithm works as follows: Read 5 characters. Is that in the dictionary?
No. What about the first 4? No. 3? No. 2? Yes! Copy to the output string, add
a space, slide along 2 characters. Repeat.

Searching a 5 MB dictionary repeatedly like this was too slow. I made it a
little faster by splitting the dictionary based on the number of characters in
the word (fiveWords, fourWords, threeWords, twoWords). It's still slow,
though. I have to run it in local JavaScript, because I don't have a server.
Any suggestions?

On a totally different extreme, the iTunes Store's EPF data (freely
available!) is 55 GB of text. I'd like to sort out Artist names by Producer.
To grep the file takes 30 mins. Is there a faster way?

Third, my algorithm for offline StackOverflow was optimised for disk space. I
search the index of titles as 200MB of plain text to get the post ID. Then I
use dd to get a 4KB chunk of the tar.bz2 compressed file. I can read it using
bzip2recover. Then I check if the post ID is there, and binary search like
that. It's slow (5 seconds to load) but doesn't waste my precious disk space,
and I can be patient when I need offline StackOverflow.

~~~
repiret
I would put your dictionary into a prefix tree if some sort, and the first
implementation I would reach for is a trinary tree [1]. As you scan your
input, walk the tree. Each time you get to the end of the tree, insert a space
one go back to the root. Once the tree is built it’s O(length of input) [2]
regardless if dictionary size.

1:
[https://en.m.wikipedia.org/wiki/Ternary_search_tree](https://en.m.wikipedia.org/wiki/Ternary_search_tree)

2: for a fixed alphabet size. It’s really closer to O(dictionary size *
log(alphabet size)) if that increases with dictionary size.

------
cheez
This is actually something I'm trying to figure out. I currently use (and pay
for) Scalyr, which is the best of the bunch (super fast, flexible alerting)
but I hate shipping my log data there. It just seems like a potential attack
surface I am not really happy about.

Is there anything open source that I can run in a docker container somewhere
on the internets that allows me to create custom alerts, extract data, create
charts and dashboards?

~~~
snewman
Scalyr CEO here. I'd love to hear more about how you're thinking about attack
surfaces – drop me a line? My email address is in my profile.

~~~
cheez
For example, I am wary about who has access to my log data. I know that
theoretically, I "invite" your support team, and can revoke it, but without an
understanding of the architecture behind Scalyr, I am afraid this is just lip
service.

------
watmough
Hmmm, the Boyer-Moore indexing in Go takes 2 x as long as the naive linear
search.

Now I feel bad for wasting an hour wrestling with Go weirdness to get my
program to compile.

------
LordHeini
Not regarding the time it takes to search stuff in an array. You really don't
want any sort of naive string search to comb through your logfiles.

It might depend on your usecase but if you have some constantly running web
app some service to diggest all the logs is very nice to have.

It is true that a lot of those logging services are absurdly expensive and i
would rise even privacy concerns regarding sending all the internal data to
someone else.

I really don't get why so many people buy overprices services, if all it takes
is half an our and the skill to run through an installation instruction.

At our company we use graylog i guess there are other good alternatives too.
Its open source and we slapped it on some cheapo v-server which cost a few
bucks per month. That thing swallows like 150 messages a second and finding
things takes no time at all. And if need be we can scale it to whatever is
needed.

------
stochastic_monk
You’d be much better off with an FSA or (honestly) Boyer-Moore implementation.
For a baseline, a simple fgrep should show you FSA speed.

You might want to read about classical string search algorithms. Your naive
algorithm is straightforward but far from ideal.

------
gerdesj
"But isn't transient log storage a popular use case, and if so why don't any
services optimize for it?"

Well I could start with "define transient"! I also note your unhappiness with
third party logging facilities. I also note you went very DIY.

A three node ES cluster with a Graylog on the front can ingest quite a lot of
logs for a reasonable time outlay in learning how it hangs together. A bit of
tuning might be required.

I suspect I'm not your target market.

------
misterbowfinger
The Ruby implementation looks wrong. He's shoving the file into an array:

    
    
      contents = contents.read.each_byte.to_a
    

When really he should be streaming each byte from the file. I suppose he
should do that for all of the implementations (or just use grep & cat?), but
Ruby objects are annoyingly expensive since there's so much bloat.

------
lowbloodsugar
So, if this is to measure the relative speed of a naive algorithm across
multiple languages then great. If this is an attempt to measure how fast a
single cpu can scan memory for a string then there are better algorithms. This
is just one: [1]

    
    
      [1] https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

------
vardump
Golang single core version:

    
    
        if (b[n]) == (find[found]) {
    

Isn't indexing an UTF-8 string "find" O(n)? To get 10000th char, you need to
iterate all 9999 previous characters first.

Edit: Clarified I'm talking about string "find" and not byte array "b".

~~~
pbourke
I don't know Go, but b and found appear to be byte arrays, not strings, which
would imply O(1) indexing.

~~~
vardump
Single core version:

    
    
      find := "07123E1F482356C415F684407A3B8723E10B2CBBC0B8FCD6282C49D37C9C1ABC"
    

That's definitely a string.

Otherwise (single core) golang code seems to be just a test whether the
compiler is smart enough to remove branch in this code:

    
    
        if (b[n]) == (find[found]) {
          found += 1
        } else {
          found = 0
        }
    

I wouldn't be too surprised if golang compiler generates superfluous range
checks costing even more performance. At least golang 1.4 seemed to generate a
lot of unnecessary checks in code I optimized a few years ago. Hopefully newer
versions are better.

~~~
pbourke
yep, my mistake - was looking at multicore

------
candiodari
[https://lists.freebsd.org/pipermail/freebsd-
current/2010-Aug...](https://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)

How to really do it.

------
calibration263
Using the same algorithm I got 0.175 or so seconds in Java.

Sample code: [https://pastebin.com/duYF2QfE](https://pastebin.com/duYF2QfE)

This was on a "Early 2015" rMBP with a 3.1 GHz i7.

~~~
vardump

      if (data[i] == strArr[found]) {
    

Shouldn't that be:

    
    
      if (data[i] == stringToFind.codePointAt(found)) {
    

... if you claim to use the same algorithm. That's what golang needs to do to
index a UTF-8 string, right? (Comparing to golang single core version.)

That way both golang and Java will be performing proper unicode code point
indexing.

Of course the optimization you made makes sense, but I think it'd be fair for
the golang version to do same.

------
jtchang
I wonder how long a simple grep over a 200MB file takes on linux. I guess if
you put the 200MB file on a ramdisk then you could easily do it from memory.

That would be my first and dirty approach.

------
m1el
I'm kinda surprised this task isn't limited by RAM or IO. There should be no
benefit of multi-threading except for blocking I/O.

------
throwaway84742
For such a simple problem you should be shooting for about 25GB/s on a modern
Intel CPU. Anything less is a disgrace.

------
sciencesama
we use elasticsearch and it is pretty fast

