Hacker News new | comments | show | ask | jobs | submit login
How copying an int made my code 11 times faster (medium.com)
168 points by tdurden on Feb 19, 2017 | hide | past | web | favorite | 51 comments



This reminds me of this classic StackOverflow question:

Why is it faster to process a sorted array than an unsorted array?

https://stackoverflow.com/questions/11227809/why-is-it-faste...


I wrote up a small script to test this out for python (at bottom of this post). The results on my laptop are about 8.0 seconds for unsorted and 7.6 seconds for the sorted version. I'm assuming that the discrepancy for python is much smaller due to the slow nature and high overhead of the language (or at least the way I've used it here), but I would be interested to know: how would one go about finding out what the python interpreter is doing beneath the surface?

Edit: After running with a wider range of parameters, it seems that the difference is always roughly the same order of magnitude. To investigate further, I included the sort into the second timing to double check and for 3276800 elements it's still a bit faster overall when you sort the array.

  #!/usr/bin/env python
  import time
  import numpy as np
  
  def main(n=32768):
      arr = np.random.randint(0, 256, n)
      t0 = time.time()
      sum1 = do_loop(arr)
      t1 = time.time()
  
      arr = np.sort(arr)
      t2 = time.time()
      sum2 = do_loop(arr)
      t3 = time.time()
  
      assert sum1 == sum2
      print(" Unsorted execution time: {} seconds".format(t1-t0))
      print(" Sorted execution time:   {} seconds".format(t3-t2))
  
  def run_many(func):
      def wrapper(arg):
          for t in range(1000):
             func(arg)
          return func(arg) 
      return wrapper
  
  @run_many
  def do_loop(arr):
      tot = 0
      for i in arr:
          if i >= 128:
              tot += i
      return tot
  
  if __name__ == '__main__':
      main()


I tried this on my machine, then tried a pure Python version; I only changed three lines, to:

  import random
  ...
  arr = [random.randint(0, 256) for x in range(n)]
  ...
  arr = sorted(arr)
Here are my times:

  $ time python2.7 hackernews_13682929.py 
   Unsorted execution time: 4.33348608017 seconds
   Sorted execution time:   4.09405398369 seconds
  
  $ time python3.5 hackernews_13682929.py 
   Unsorted execution time: 4.4200146198272705 seconds
   Sorted execution time:   4.188237905502319 seconds
  
  $ time python2.7 hackernews_13682929_purepython.py
   Unsorted execution time: 0.981621026993 seconds
   Sorted execution time:   0.832424879074 seconds
  
  $ time python3.5 hackernews_13682929_purepython.py
   Unsorted execution time: 1.3005650043487549 seconds
   Sorted execution time:   1.157465934753418 seconds
  
  $ time pypy hackernews_13682929_purepython.py
   Unsorted execution time: 0.239459037781 seconds
   Sorted execution time:   0.0910339355469 seconds
As you can see, the pure Python version is faster than the Numpy version, and also has a larger margin between unsorted and sorted. PyPy is of course faster than both, and also has an even greater margin between unsorted and sorted (2.63x faster).


Good call on going pure python. To take this a bit further I made your changes and used numba with @jit(nopython=True, cache=True), for some interesting results. If I do include the sorting into the timing:

    Unsorted execution time: 0.2175428867340088 seconds
    Sorted execution time:   1.133354663848877 seconds
And if I don't:

    Unsorted execution time: 0.21171283721923828 seconds
    Sorted execution time:   0.08376479148864746 seconds


Using python for micro benchmarks usually doesnt work. The mere fact you use python says you dont care about performance, but convenience.

Just for fun add "randomvariable = 1"(you know, 1 one cycle machine op) in a tight loop and watch tenths of a second added to your "benchmark".


I'm not quite sure what you mean by "using python for micro benchmarks usually doesn't work". To the extent that "microbenchmarking" is useful at all, then sure it works. There are just different considerations from other languages, and it depends on which Python implementation and libraries you're using.

Also, while I'll grant you that using Python implies that convenience, programmer time, and/or development speed is a higher priority than performance, that doesn't at all mean that people who use Python "don't care about performance".


How is this article related to branch prediction? I don't see any connection here.


I bet you didnt read the answer for that :)


I read it a while ago, and going on what I remember, I still can't see the connection. Would you mind explaining?


I don't think it's related to branch prediction in particular, just spooky action-at-a-distance where a change makes seemingly unrelated code much slower or faster.


What makes you say this? The author provides an extremely persuasive case, with perf timings. Additionally updated it with compiler enhancements from GCC and Intel that remove the branch mispredictions entirely and do perform as predicted.


I think you misunderstood me. I agree that the StackOverflow post has to do with branch prediction. I just meant that I don't think that's why the earlier poster thinks it's parallel to the situation described in the article.


Ah, my mistake. I thought the OP a bit further up didn't understand the relation of the stack overflow answer to the array question.


This is really surprising. I thought that part of Rust's pitch was that the explicit ownership tracking made optimizations much easier.

Is there a bug filed to fix this?


It does make optimization easier but until MIR landed, many of the best optimizations weren't really possible. The problem is that a lot of type information is lost between Rust and LLVM IR, where the compiler does the really serious optimizations. For example, Rust can't tell LLVM about its pointer aliasing guarantees (immutable and mutable borrows can't be done at the same time without unsafe) so a lot of optimizations like storing a highly used value from the heap in a register are passed over because of conservative heuristics.

Now that MIR has landed, Rust will eventually get much better optimizations from both rustc and the llvm optimization passes but other things are a much higher priority like non-lexical scoping.


> For example, Rust can't tell LLVM about its pointer aliasing guarantees

False.

http://llvm.org/docs/LangRef.html#noalias-and-alias-scope-me...


From what I understand, LLVM still doesn't take as much advantage of that information as it could, given Rust input. It's too geared toward the C family of languages. (But as sibling comment says, the problem was partially the Rust compiler's fault.)


C has a pointer aliasing keyword, "restrict".

Also: "Originally implemented for C and C++, the language-agnostic design of LLVM has since spawned a wide variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#,[4][5][6] Common Lisp, Crystal, D, Delphi, Fortran, OpenGL Shading Language, Halide, Haskell, Java bytecode, Julia, Lua, Objective-C, Pony,[7] Python, R, Ruby, Rust, CUDA, Scala,[8] and Swift."

Most of these are nothing like C when it comes to pointers and memory layout.


Again, from what I understand, `restrict` is not enough to convey everything Rust knows.

Further, those other languages may be nothing like C, but they're even less like Rust. So because of its heritage, even given the information Rust knows, LLVM simply doesn't have the optimization passes to take full advantage of it.


Ok. Like what ?


The point was that Rust _can_ (more easily) tell LLVM now that MIR has landed.


It's not only a MIR issue, it's also an issue of how many optimizations should be allowed when you also need to be able to make some assumptions about your data in unsafe blocks. For example if you simply added the obvious attributes everywhere, interior mutability (Cell, RefCell, UnsafeCell) would most certainly break. The exact rules around unsafe rules are still being discussed (see https://github.com/nikomatsakis/rust-memory-model), so until that is done some of these optimizations are very unlikely to be implemented because they can make a lot of unsafe code illegal.


Note that the compiler already adds "obvious" attributes to function calls, being careful to not emit them for things that contain an `UnsafeCell`. This is why that type exists, as the building block for interior mutability that the compiler understand and can optimise around.


I guess having taken a compilers class a decade ago has given me a Dunning-Kruger sense of optimism but the explanation sounds kind of like bullshit to me. So what if LLVM doesn't have the type information? There's no type information at all in the machine codeā€¦ optimize stuff yourself before you hand it to LLVM?


LLVM is literally designed as an production quality optimisation/code-generation backend, so it's seems perfectly reasonable for projects to rely on LLVM's optimisations without doing their own, especially when there are other things people can work on than the large effort to optimise some programs slightly better (and even fewer programs a lot better, but these are often microbenchmarks, or tight loops like the OP that can be diagnosed with a profiler relatively easily and rewritten) by doing language-specific optimisations. It's definitely a good long-term goal, but it requires a lot of infrastructure, and ends up requiring duplicating a lot of the optimisations and analysis already in LLVM (e.g. inlining is a critical transformation for enabling other optimisations) before one can get great results from the custom optimisations.

In any case, compiler optimisations are usually easy to implement on specific styles of internal representations, and this is MIR for Rust, which was only introduced recently, after requiring a very large refactoring of the compiler's internal pipeline.


As I understand it, this is partially the plan. Historically it has just been waiting for a compiler refactor to MIR (which is now complete). I'm not sure why more optimization RFCs haven't been created and prioritized.


I think it's kinda interesting what tiny little tweaks will affect code speed- I recently discovered that in python, if you're only going to use an import for one or two functions, importing it locally shaves off a good bit of time depending on the function, in my case, .2 seconds!


I don't think the function itself matters; what's going on is that if you import it locally, the reference you're using is in the local namespace versus being in a module's namespace, which means it takes less time to get to the function object. I'm guessing the function is being called a fair number of times in a loop or some such similar manner?


You could use `from some_module import some_func` to get the same effect as some_func will be in the local namespace.


That's what the top-level commenter was doing.


Ok. It wasn't clear what "importing it locally" meant. I've worked with some people who would use that phrase to mean "copy paste it into the local file".


Fair point :P. I think I might have seen someone mean it like that as well in the past.


And a common stdlib trick is to bind these via function arguments e.g.

    def foo(a, b, thing=util.thing):
        ...


I agree, seems likely. gp can use dis.dis() to confirm this...


Python is notable for how it harasses the filesystem. If your python program is running off NFS or parallel filesystems (e.g. gpfs, panasas, lustre, etc) then you may get a lot more performance from avoiding imports. Especially if pkg_resource is being used.


    like all generic interfaces in Rust, printing takes arguments by reference, 
    regardless of whether they are Copy or not. The println! macro hides this from 
    you by implicitly borrowing the arguments,
What's the reason for this? Seems a bit inconsistent that for functions you must explicitly pass with &, but for print it's automatic.


The reason I can think of are:

- it'd be annoying to have to write `println!("{}", &s);`,

- this is not an inconsistency one thinks about in practice: ownership is important in Rust, but the compiler does all the annoying checking, so IME one can just let it fade into the background and not think about it until the compiler tells you there's a problem,

- it's an "permissive" inconsistency: writing a & will still compile and give the same output,

- it's a syntax-level inconsistency created by the println! macro packaging up the normal syntax, not some special rules for special functions: there's still a literal & borrow in there (see the macro expansion),

- historical reasons, and no-one was annoyed enough (or even noticed it enough) to change it.


Hmm. If people would think it was annoying to println!("{}", &s) they wouldnt use rust in the first place because you have to do that everywhere else for plain functions.

Beeing permissive makes it even worse as there are now 2 inconsistent ways to do the same thing.

Maybe I need to understand rust macros better. I guess this all boils down to: macros are a very bad idea, as in other programming languages.


And if print can already have a special treatment, it should surely be the one where the value is copied if can fit the basic type size, that is, at least for int like and float like stuff not have automatic pass by reference but automatic pass by value to it?


The problem with this is print is a macro and that macros are expanded before the type checker. This makes it impossible to know if something can be copied or not, because this is type information.


I believe it could be a solvable problem? Imagine that the macro expands every parameter x to some language construct f( x ) where f( x ) has access to the type of x and the type of its result is different depending of the type of x, returning a value for for int and float-like variables, and the reference for others.


Since LLVM does not have the necessary information to do the optimizations, I wonder whether the same problem occurs in C++ code compiled with clang.


It depends on the system. First, the int would be by value anyway, so it wouldn't matter. But assume you handed it a pointer instead.

Some are properly marked readonly in declarations, some are not[1]

But this is a common problem with unmarked logging. The compiler will infer attributes all the way through things it has bitcode for. (IE it can determine it to be readonly, readnone, whatever, as long as it can see calls to either functions marked readonly/readnone, or determine the called functions to be readonly/readnone)

But, usually, if you don't give it the whole program, it can't infer that your logging function does not modify its arguments.

[1] Fun fact: glibc allows printf handler registration: https://www.gnu.org/software/libc/manual/html_node/Customizi...

So sadly, it's printf is not readonly, because it can do anything it wants in these callbacks.

(nobody uses the printf registration, though, so it may be time to just add a compiler option required to make it work, and not screw everyone else using printf over)


In this particular case, you'd likely be using printf or cout, both of which perform a copy of integer parameters.

So you wouldn't see this there.


It may not. It's possible that this may be happening because rustc is failing to mark the reference as both readonly and unaliased, whereas clang might mark the equivalent as both. You'd need to examine the IR from both front ends to really figure out the source of any differences in the generated ASM


Both Rust playgrounds (official[1], my take[2]) allow viewing the LLVM IR of a Rust program.

The call to the print is

      call void @_ZN3std2io5stdio6_print17h690779b3bd8114d5E(%"core::fmt::Arguments"* noalias nocapture nonnull dereferenceable(48) %_3)
The entire chunk of `main` preceding that call:

      %size = alloca i64, align 8
      %_3 = alloca %"core::fmt::Arguments", align 8
      %_8 = alloca [1 x %"core::fmt::ArgumentV1"], align 8
      %0 = bitcast i64* %size to i8*
      call void @llvm.lifetime.start(i64 8, i8* %0)
      store i64 33554432, i64* %size, align 8
      %1 = bitcast %"core::fmt::Arguments"* %_3 to i8*
      call void @llvm.lifetime.start(i64 48, i8* %1)
      %2 = bitcast [1 x %"core::fmt::ArgumentV1"]* %_8 to i8*
      call void @llvm.lifetime.start(i64 16, i8* %2)
      %3 = ptrtoint i64* %size to i64
      %4 = bitcast [1 x %"core::fmt::ArgumentV1"]* %_8 to i64*
      store i64 %3, i64* %4, align 8
      %5 = getelementptr inbounds [1 x %"core::fmt::ArgumentV1"], [1 x %"core::fmt::ArgumentV1"]* %_8, i64 0, i64 0, i32 1
      %6 = bitcast i8 (%"core::fmt::Void"*, %"core::fmt::Formatter"*)** %5 to i64*
      store i64 ptrtoint (i8 (i64*, %"core::fmt::Formatter"*)* @"_ZN4core3fmt3num54_$LT$impl$u20$core..fmt..Display$u20$for$u20$usize$GT$3fmt17hb872170870cc06d9E" to i64), i64* %6, align 8
      %7 = getelementptr inbounds [1 x %"core::fmt::ArgumentV1"], [1 x %"core::fmt::ArgumentV1"]* %_8, i64 0, i64 0
      %8 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 0, i32 0
      store %str_slice* getelementptr inbounds ([2 x %str_slice], [2 x %str_slice]* @ref.8, i64 0, i64 0), %str_slice** %8, align 8, !alias.scope !1, !noalias !4
      %9 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 0, i32 1
      store i64 2, i64* %9, align 8, !alias.scope !1, !noalias !4
      %_6.sroa.0.0..sroa_idx.i = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 1, i32 0, i32 0
      store %"core::fmt::rt::v1::Argument"* null, %"core::fmt::rt::v1::Argument"** %_6.sroa.0.0..sroa_idx.i, align 8, !alias.scope !1, !noalias !4
      %10 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 2, i32 0
      store %"core::fmt::ArgumentV1"* %7, %"core::fmt::ArgumentV1"** %10, align 8, !alias.scope !1, !noalias !4
      %11 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 2, i32 1
      store i64 1, i64* %11, align 8, !alias.scope !1, !noalias !4
      call void @_ZN3std2io5stdio6_print17h690779b3bd8114d5E(%"core::fmt::Arguments"* noalias nocapture nonnull dereferenceable(48) %_3)
[1]: https://play.rust-lang.org/

[2]: http://play.integer32.com/


Am I reading it right? It looks to me like rustc didn't mark the immutable borrow as readonly, so LLVM went ahead and assumed print could mutate it.


I'm not 100% sure, but I believe LLVM doesn't have a way to understand the (im)mutability of pointers rewritten into memory, like the borrow is here.


If you use printf and cout in the same program then yes you can have dramatic performance problems with cout unless you use `std::ios_base::sync_with_stdio(false);`

It's not the same issue but it's the same area of stupid stuff you have to find out about and roll your eyes.


This is awful. Was just reading how Rust is all shiny and special and much better than awful, naughty C++, and now I read how the print method is magic goo because "developers don't want to type & in front of ints". Back in my day, bah, grumble...


One of the reasons that ! is in macros is to let you know that they can do near-arbitrary things inside of their ()s.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: