Optimizations on Linear Search (2016)

formerly_proven · on Oct 3, 2020

Why is he discarding compsci literally in the introduction? Does he really think "big O" is all there is, presumably because those ivory tower compsci people have no clue about computers?

There are some good bits of advice in the article, too, though.

> I conducted some simple benchmarks in Go to find how much data can be processed in 200 ms. A linear search can scan 13 million elements in less than 200 ms on a three-year-old MacBook laptop, and 13 million is no small feat.

It's funny how slow languages have shifted perceptions. Scanning 13 million elements to find the biggest element among 32 bit ints in 200 ms is slow. That's 250 MB/s, or about 50 clock cycles per element. Even the most naive and portable C three-liner is almost 100x faster (there is a difference in single-threaded performance, if you assume a very generous 50 % difference, it'd still be 50x faster).

   maximum = INT_MIN;
   for(int i = 0; i < N; i++)
       maximum = elems[i] > maximum ? elems[i] : maximum;

So, when you are using C/C++, "small" for this problem is a billion elements.

Computers are super-fast. Next time you wait two seconds for some application to compute something, ask yourself: is this problem really eight-billion clock cycles hard?

YesThatTom2 · on Oct 3, 2020

> Why is he discarding compsci literally in the introduction? Does he really think "big O" is all there is, presumably because those ivory tower compsci people have no clue about computers?

Hi! I'm the author. I do have an ivory tower compsci degree. I'm poking fun at myself as much as anyone else.

> There are some good bits of advice in the article, too, though.

Thanks! "Everything Sysadmin" is my 3x/yearly column in ACM Queue. My goal is to bring a bit of operational reality to an organization with a reputation for being ivory tower. Click on the phrase "everything sysadmin" to see the other columns I've written.

> So, when you are using C/C++, "small" for this problem is a billion elements.

I know, right? "Small" has gotten so big that we have to re-think our gut reactions and recalibrate our back-of-the-envelope estimations. Just imagine how big small will be for the next generation of developers?

thewebcount · on Oct 3, 2020

I just want to say that I really enjoyed this article. I was surprised to see all the negativity here on HN about it! Thank you for writing it, as it really got me thinking. I've done some of these things before, but it's easy to lose the forest for the trees when you're deep down in the middle of solving some problem.

mark-r · on Oct 3, 2020

Moore's law is dead. Small won't change nearly as fast as it used to.

voldacar · on Oct 3, 2020

>Just imagine how big small will be for the next generation of developers?

about the same, since moores law is dead for cpus, and ram doesnt seem to be getting much faster either

gautamcgoel · on Oct 3, 2020

Wait, are you saying that C is 50 times faster than Go? I find that very hard to believe. I could buy 3-4 times faster, especially if the C code is vectorized and the Go code is not, but 50-100x seems unrealistic IMO. Do you have any benchmarks to back up your claim?

johndough · on Oct 3, 2020

https://ideone.com/T56tDZ I get close to 5 GB/s with this online IDE and twice that on my 6 year old laptop, which is roughly a factor of 20 to 40 faster than 250 MB/s. A factor of 100 sounds plausible with more modern RAM or larger cache size.

gautamcgoel · on Oct 3, 2020

You only benchmarked the C code. In order to get an apples to apples comparison, you have to benchmark the Go code on the same hardware - your comment compares the runtimes of two different implementations on different hardware.

formerly_proven · on Oct 3, 2020

> Wait, are you saying that C is 50 times faster than Go?

Are you saying you need 50 clock cycles to compare two integers?

gautamcgoel · on Oct 3, 2020

...no? What made you think that?

richieartoul · on Oct 3, 2020

Its hard to believe because its dead wrong. See my benchmark above.

gautamcgoel · on Oct 5, 2020

Yeah what OP wrote is complete nonsense. Sad that this stuff bubbles to the front of HN and then gets quoted endlessly...

richieartoul · on Oct 3, 2020

Go is not nearly as slow as you're making it seem to be, I dont know what the authors implementation was doing, but this code runs at a rate of 7GiB/s/core on my laptop

  func BenchmarkFindMax(b *testing.B) {
   ints := make([]int32, 13*1000*100)
   for i := 0; i < len(ints); i++ {
    ints[i] = rand.Int31()
   }

    for i := 0; i < b.N; i++ {
      max := int32(math.MinInt32)
      for _, x := range ints {
        if x > max {
          max = x
        }
      }
      dontOptimize = max
   }

   b.ReportMetric(float64(b.N*len(ints)*4/1000/1000/1000), 
  "GiB/s")
  }


  goos: darwin
  goarch: amd64
  BenchmarkFindMax-8  1530   678975 ns/op   7.00 GiB/s

EDIT: Actually my original code was reporting the metric wrong, it should have been:

b.ReportMetric(float64(b.N)float64(len(ints))4/1000/1000/1000, "GiB/s")

which leads to 9-10GiB/s on my laptop

Jtsummers · on Oct 3, 2020

The author was running on a 2013 laptop, which had smaller cache and slower RAM than what you're probably using.

Benchmark comparisons are useless without running on equivalent hardware.

richieartoul · on Oct 3, 2020

He was probably scanning elements that were larger than int32s, even on a 2013 this code would run at Gib/s.

The point of my benchmark was that C is nowhere near 50-100x faster than Go and that the user I'm replying to shouldn't make unsubstantiated claims like that.

formerly_proven · on Oct 4, 2020

My claim that the trivial C code is 100x faster than the number from the article is a) well substantiated b) all there is c) perfectly clear.

formerly_proven · on Oct 4, 2020

> Go is not nearly as slow as you're making it seem to be

It's rather obvious I compared against the quoted figure.

higerordermap · on Oct 3, 2020

Not just that, in a language with reasonable abstractions it is max(elems), maybe in some external module. His arguments about simpler / less buggy code don't apply.

secondcoming · on Oct 3, 2020

Someone has to write the algorithm beneath those abstractions. It's not magic. There's a point in the abstraction call stack where the programmer needed to understand how a CPU actually works.

higerordermap · on Oct 4, 2020

Same applies for his sort example. Someone has to write sort. Except that now also includes sorting.

In a language with lazy stream abstraction, the max() function can be pretty efficient.

corysama · on Oct 3, 2020

> To meet the vendor's requirement, most games first load and display a title screen, then ask users to click a button to start the game. What users don't realize is that while they are sitting in awe of the amazing title screen, the game is finishing its preparations.

I did exactly this in my first video game. It was an N64 game so the load time requirements from Nintendo were very short and we were 1-2 seconds over the limit. Before the level loaded we already had a UI screen that showed you the map of the level you were about to enter and blinked "OK? (A)" waiting for you to be done checking the map.

So, I rendered the map with and without the prompt to the front and back display buffers. Then installed a VBlank signal handler that just flipped the display buffers without actually re-rendering anything. Then the loading started immediately without waiting for the user's OK. (We had no threading back then) The only think that changed when you hit (A) was that the VBlank handler would flip to the No-Prompt screen for the rest of the load time.

gameswithgo · on Oct 3, 2020

a lot of games will put up little tips while the levels load, and it can be pretty handy. then i got an ssd and you only have tome to read about half the tip!

dahart · on Oct 3, 2020

> Donald Knuth famously wrote that premature optimization is the root of all evil.

It’s funny how often Knuth’s actual point is left out of the re-telling of this quote. The actual point he was making was “we should not pass up our opportunities [to optimize] in that critical 3%.“ He wasn’t suggesting the avoidance of optimization, he was suggesting the avoidance of premature optimization.

staticassertion · on Oct 3, 2020

It's also Knuth quoting Tony Hoare, a man who I highly recommend everyone familiarize themselves with, he's a wonderful speaker and is extremely intelligent.

The context of the quote from both Hoare and Knuth is critical, and is always left out.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

ie: 3% of the time you should absolutely care about 'small' efficiencies. Given that most programs spend almost all of their time in a few places, this is hardly surprising, and 3% is actually a fairly sizable amount of code to optimize to the degree which the paper is discussing (most people are rarely dropping to tools like goto for performance).

And, from the same paper (Structured Programming with goto Statements):

"The conventional wisdom shared by many of today's software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by pennywise-and-pound-foolish programmers, who can't debug or maintain their "optimized" programs. In established engineering disciplines a 12 % improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering. Of course I wouldn’t bother making such optimizations on a one-shot job, but when it’s a question of preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies."

In other words, we should not write slower code when we can help it. Small efficiencies are well worth attaining in the general case, and the fears of writing 'unreadable' code because it's fast are mostly unfounded.

I believe this has never been truer. It's easier than ever to create highly tuned code that's hidden behind a 0 cost abstraction, and it's well worth it to tune that code for simple and obvious wins - we should never strive for inefficiency, as so many seem to advocate.

voxl · on Oct 3, 2020

The word "premature" is inside your quote.

dahart · on Oct 3, 2020

Right- exactly! If you read the article, it ignored the word premature, and used the quote as a narrative suggestion that people should/would avoid any optimization based on this (incomplete) quote, and based on optimizing taking effort. It was using Knuth as a counter-point, pretending that he was saying not to optimize, and then countering with reasons to consider optimizing anyway.

gameswithgo · on Oct 3, 2020

What gets me about the way people use this quote, is that it is tautological. "premature optimization is bad". Yes, it must be or it would not be premature. But when a kid or new programmer posts on the internet about what is faster, or how to make a thing faster, why do we tell them this? (Most programming communities, it is inevitable that they will be told this). How do the commenters know whether the optimization is premature? Why are they worried about it, when the person is just learning/curious?

Let them get good!

a1369209993 · on Oct 3, 2020

> How do the commenters know whether the optimization is premature?

> when the person is just learning/curious

Not that the code being bad matters in such a case, but if something is just for learning/curiousity, then it is by definition not critically important. Sometimes trying to optimize it is the terminal goal, but it's basically never a relevant instrumental goal.

gameswithgo · on Oct 6, 2020

A lot of software would be better if more programmers had built up some intuition about performance issues by being encouraged, rather than discouraged, when learning. A lot of kids coming out of college have absolutely no exposure to anything low level, or with regard to constant factor performance improvements.

dwheeler · on Oct 3, 2020

A fun snarky article. It missed another potential issue: many modern systems use cache prefetching, where they intentionally start loading the "next" memory area. As a result, sequential memory access is often much faster than bouncing around memory.

Your mileage may vary.

jonstewart · on Oct 3, 2020

This is such a mashup of conflicting advice. Don’t optimize! Use SIMD! If you’re reading from disk you’re I/O bound! I am embarrassed for the ACM.

hhmc · on Oct 3, 2020

It’s also strange to advocate for using a stdlib provided sort. It’s hard to imagine a stdlib worth it’s salt that provides ‘sort’ but not some kind of ‘linear search’ (and the stdlib implementation would likely use SIMD).

Joker_vD · on Oct 3, 2020

Well, Go's standard library has no linear search routine. But it does have both sort and binary search.

gameswithgo · on Oct 3, 2020

standard libraries are only just now starting to use SIMD at all. .NET has been working hard on SIMDifying their core libraries lately. Do any C++ implementations of the std do so? Or do they just lean on llvm/gcc to autovectorize when possible?

formerly_proven · on Oct 3, 2020

There's really no point providing a hand-rolled SIMD implementation for stuff like std::max_element. It's a very trivial loop that any compiler can and will vectorize.

Auto-vectorization in general "just works" for basic stuff. It's only the complex stuff where you can really benefit from hand-rolled SIMD implementations; you'll know it when you see it.

kevinventullo · on Oct 3, 2020

It doesn’t inspire much confidence in a candidate’s ability to write software of even moderate complexity if their reasoning for using sort instead of a 5-liner was that their 5-liner could contain bugs.

JackFr · on Oct 3, 2020

I sometimes ask "In the language of your choice write a function to find the maximum of a list of integers" as an interview question. I often have to typically follow it by "Really, there's no subtle trick, just find the max...". The number of people who use a 'sort' and take the first element is definitely non-zero. It's also astonishing the number of buggy implementations there are. (Well in fairness there is one buggy implementation which everyone who gets it wrong typically makes -- initially setting the max to 0)

I'm not sure if this is an argument for or against coding exercises in interviews -- it was just a surprising experience for me.

CyberDildonics · on Oct 3, 2020

You have never had anyone initialize the max to MIN_INTEGER or something like that? I have made similar mistakes, but I've had to write these functions before, so I know better.

lixtra · on Oct 3, 2020

I couldn’t remember such a thing as MIN_INTEGER and what if someone changes the data type?

Why not assume the first element is the max and start your search from the second. Ah, dam. Now you have to check that your vector has at least one element.

It ain’t easy.

kevinventullo · on Oct 4, 2020

> Now you have to check that your vector has at least one element.

Which may lead the candidate to ask the interviewer “What should be the behavior on an empty vector?” Returning MIN_INT or throwing an exception may both be reasonable depending on the application, but I would consider it good signal if the candidate at least acknowledges that this is an edge case.

CyberDildonics · on Oct 3, 2020

It is pretty easy, I would expect that at least some people would be able to do it. MIN_INTEGER was an example, it should be obvious you want the minimum for whatever type you are using. Standard C++ has std::numeric_limits<T>::lowest

lixtra · on Oct 3, 2020

Standard C++ also has std::max_element [1] (and accidentally the possible implementation looks surprisingly familiar). Not that I know much of C++.

[1] https://en.cppreference.com/w/cpp/algorithm/max_element

CyberDildonics · on Oct 4, 2020

That returns an iterator the max element, not the value. I'm not sure which part makes this so difficult that everyone fails on the first try.

jancsika · on Oct 3, 2020

> I sometimes ask "In the language of your choice write a function to find the maximum of a list of integers" as an interview question.

And you don't find it important at all to specify whether the code will be in the hot path or not?

Let me ask you-- real quick question here-- how much time would you say you spend each week dealing with these TPS reports? :)

benrbray · on Oct 3, 2020

Writing software of moderate complexity requires delegation. One big way to save dev+testing+debug time is to rely on specialized libraries whose priority is to be fast+correct. If profiling later reveals that sort() is the bottleneck, by all means, optimize away.

ben-schaaf · on Oct 3, 2020

Even worse when you consider most languages have a built-in way to do linear search too, so really it's just turning a fast 1 liner into a more complex slower 1 liner.

kleiba · on Oct 3, 2020

Quite a few comments here already remark on the "computer scientists leaving the room" joke at the beginning of the article.

I'd like to add another computer sciency aspect that hasn't been covered yet. If the original question is "what is the fastest algorithm to find the largest number in an unsorted array" you might just answer: "it depends on what you want to do with it afterwards."

Sure, in a job interview or in a computer science class you might want to just look at any one problem independent of anything else. In the real world, context matters though.

For example, we know that in an unsorted list (which we're dealing with here), an element can be removed by overwriting it with the last element of the list and decrementing the length of the list. So perhaps someone is actually looking at solving a more complex problem, say, processing a list of tasks from highest importance to lowest and removing each task once it has been processed. But they are new to the team, and there is a huge and complicated code base that has been created over years and that they now have to work with. And it just so happens that the actual algorithm is already implemented in the code base, except for one function:

    while list is not empty:
        task = find_most_important_task(list) # your TODO
        process(task)
        list.remove(task)

Now, this is the same question as in the job interview... or is it? Assuming that find_most_important_task is implemented as a linear search, the above algorithm is O(n^2).

However, knowing which task you're optimizing for -- that is, the algorithm as a whole -- you can obviously do better. Let's say, your list data structure gets a new flag that is true iff the list is sorted. Then you could use the following implementation for the max search:

    find_most_important_task(list):
        if list is not sorted:
            list.sort()
        return list[list.length - 1]

With this implementation, a single call to find_most_important_task(list) would be much worse in term of O-notation than linear search: O(nlogn) vs. O(n). However, amortized over the whole algorithm, the complexity of find_most_important_task is actually only O(logn)!

So, yeah, I don't really know why I wrote all this up. Probably because it is late. Certainly not because it contained anything new. It's also contrived. Oh well. Sorry for wasting anyone's time...

Kednicma · on Oct 3, 2020

I understand the rhetorical dodges here, but as both a career SRE and a student of CS, I don't see anything here which invalidates the theoretical asymptotic behavior. All of the nuance added here is a complement, not a replacement, for traditional theory.

It is certainly true that the number of elements needs to be quite high before the question is theoretically interesting. The author is right to point out that L2 cache is quite large. Today's GPUs are also quite large in onboard RAM and have integer ALUs, so even millions of elements are not a problem for a basic consumer-grade video card. (Yes, I know that that sounds stupid, but GPUs and now TPUs in the datacenter are a real thing, even if they're quite unfair!)

ram_rar · on Oct 3, 2020

I am really surprised to read this article on acm. I have nothing against the author. From a technically stand point, I am genuinely disappointed in acm for publishing it. I'm pretty sure, of getting downvotes for these. But heres my critique.

1. Fast enough means different things in different context. If you're building HFT (high frequency trading) system 200ms is eternity. At the same time, for a simple crud application its good enough. Context is everthing!

2. As much as I love Go, its not really a good benchmark as the article suggests. The equivalent C program (without concurrency), will blow those numbers out of the water. This basically tells me, the higher level languages like GO, have skewed perceptions of what is considered fast.

The article completely misses one of the most important points. You have to measure/instrument the code before one decides to "optimize" things. Without instrumentation, its really hard to know where the program is spending most of its time. Once you have that data, then focus on optimizing the slowest parts.

Kapura · on Oct 3, 2020

I love this article.

>If N=16,000, then the entire dataset fits in the L1 cache of the CPU, assuming the CPU was made in this decade. This means the CPU can scan the data so fast it will make your hair flip. If N=64,000, then the data will fit in a modern L2 cache, and your hair may still do interesting things. If the computer wasn't made in this decade, I would recommend that my friend reconsider working for this company.

Great stuff.

So much of it speaks directly to how I see my own development process. It's understanding not only what you are doing, but why you are doing it. It's the difference between solving the bridge you're building and fixing the highway system to route the traffic more efficiently. It's seeing the trees, the forest, and all the birds and bats flying around.

There are a million ways to skin the cat, and the difference between ok engineers and great ones is knowing which method to use, and when the cat is good enough with the skin on.

jedbrown · on Oct 3, 2020

Fun article.

> With four CPU cores, small becomes 4N, or nearly 200 million items.

Memory channels do not scale with number of cores, and indeed, peak DRAM bandwidth is often observed with less than the total number of cores (e.g., one core per CCX on Zen2).

MrYellowP · on Oct 4, 2020

Hm. On my 7300HQ, single channel 2667MHz DDR4 my hand crafted assembly code runs through 512 megs looking for a binary needle (8 bytes length for this example) in random data (mersenne twister) in 59 milliseconds. The needle is, of course, at the end of the 512megs. No SSE, on a single core and it scales nearly perfectly.

Not sure what to make of this article. In the end, what truly matters is knowing how a CPU operates and what not to do if you want to have your code running actually fast, not just "perceived fast".

That being said, I still consider the 59ms to be slow.

zvr · on Oct 3, 2020

Needs a " (2016)". methinks.

gameswithgo · on Oct 3, 2020

>Anything that takes less than 200 ms is perceived to be instantaneous by the human brain

Negative, ghost rider. What is perceived as instant depends on a lot of things, most studies in the past arrived at a threshold of 100ms, but there are domains where 100ms will feel horrible (first person gaming, VR tracking, conversation audio, etc). We know the brain can see things that are visible for only 13ms:

https://news.mit.edu/2014/in-the-blink-of-an-eye-0116

When I am programming if something that doesn't require IO takes even 1ms I immediately take a look and make sure something isn't wrong.

formerly_proven · on Oct 3, 2020

Very true, ~200 ms is a good assumption for the conscious reaction time of an attentive human, but that doesn't mean we perceive time in 200 ms increments and can't tell the delay between "click - 2 ms delay - something happens" and "click - 200 ms delay - something happens". The latter is quite obvious.

staticassertion · on Oct 3, 2020

> even 1ms I immediately take a look and make sure something isn't wrong.

Agreed. I didn't realize this when I was newer to programming, and I think it's in part because I was "thinking in milliseconds".

https://gist.github.com/jboner/2841832

latency.txt was a game changer for me, as well as starting to measure things in terms of nanoseconds.

Side note: is there an updated latency.txt?

antiuniverse · on Oct 3, 2020

You might also enjoy “Memory Bandwidth Napkin Math”:

https://www.forrestthewoods.com/blog/memory-bandwidth-napkin...

staticassertion · on Oct 4, 2020

Thank you, much appreciated.

formerly_proven · on Oct 3, 2020

> Side note: is there an updated latency.txt?

It hasn't changed.