
How copying an int made my code 11 times faster - tdurden
https://medium.com/@robertgrosse/how-copying-an-int-made-my-code-11-times-faster-f76c66312e0f#.bg7ino1f9
======
Cyphase
This reminds me of this classic StackOverflow question:

 _Why is it faster to process a sorted array than an unsorted array?_

[https://stackoverflow.com/questions/11227809/why-is-it-
faste...](https://stackoverflow.com/questions/11227809/why-is-it-faster-to-
process-a-sorted-array-than-an-unsorted-array)

~~~
MooMooMilkParty
I wrote up a small script to test this out for python (at bottom of this
post). The results on my laptop are about 8.0 seconds for unsorted and 7.6
seconds for the sorted version. I'm assuming that the discrepancy for python
is much smaller due to the slow nature and high overhead of the language (or
at least the way I've used it here), but I would be interested to know: how
would one go about finding out what the python interpreter is doing beneath
the surface?

Edit: After running with a wider range of parameters, it seems that the
difference is always roughly the same order of magnitude. To investigate
further, I included the sort into the second timing to double check and for
3276800 elements it's still a bit faster overall when you sort the array.

    
    
      #!/usr/bin/env python
      import time
      import numpy as np
      
      def main(n=32768):
          arr = np.random.randint(0, 256, n)
          t0 = time.time()
          sum1 = do_loop(arr)
          t1 = time.time()
      
          arr = np.sort(arr)
          t2 = time.time()
          sum2 = do_loop(arr)
          t3 = time.time()
      
          assert sum1 == sum2
          print(" Unsorted execution time: {} seconds".format(t1-t0))
          print(" Sorted execution time:   {} seconds".format(t3-t2))
      
      def run_many(func):
          def wrapper(arg):
              for t in range(1000):
                 func(arg)
              return func(arg) 
          return wrapper
      
      @run_many
      def do_loop(arr):
          tot = 0
          for i in arr:
              if i >= 128:
                  tot += i
          return tot
      
      if __name__ == '__main__':
          main()

~~~
Cyphase
I tried this on my machine, then tried a pure Python version; I only changed
three lines, to:

    
    
      import random
      ...
      arr = [random.randint(0, 256) for x in range(n)]
      ...
      arr = sorted(arr)
    

Here are my times:

    
    
      $ time python2.7 hackernews_13682929.py 
       Unsorted execution time: 4.33348608017 seconds
       Sorted execution time:   4.09405398369 seconds
      
      $ time python3.5 hackernews_13682929.py 
       Unsorted execution time: 4.4200146198272705 seconds
       Sorted execution time:   4.188237905502319 seconds
      
      $ time python2.7 hackernews_13682929_purepython.py
       Unsorted execution time: 0.981621026993 seconds
       Sorted execution time:   0.832424879074 seconds
      
      $ time python3.5 hackernews_13682929_purepython.py
       Unsorted execution time: 1.3005650043487549 seconds
       Sorted execution time:   1.157465934753418 seconds
      
      $ time pypy hackernews_13682929_purepython.py
       Unsorted execution time: 0.239459037781 seconds
       Sorted execution time:   0.0910339355469 seconds
    

As you can see, the pure Python version is faster than the Numpy version, and
also has a larger margin between unsorted and sorted. PyPy is of course faster
than both, and also has an even greater margin between unsorted and sorted
(2.63x faster).

~~~
MooMooMilkParty
Good call on going pure python. To take this a bit further I made your changes
and used numba with @jit(nopython=True, cache=True), for some interesting
results. If I do include the sorting into the timing:

    
    
        Unsorted execution time: 0.2175428867340088 seconds
        Sorted execution time:   1.133354663848877 seconds
    

And if I don't:

    
    
        Unsorted execution time: 0.21171283721923828 seconds
        Sorted execution time:   0.08376479148864746 seconds

------
jstarks
This is really surprising. I thought that part of Rust's pitch was that the
explicit ownership tracking made optimizations much easier.

Is there a bug filed to fix this?

~~~
akiselev
It does make optimization easier but until MIR landed, many of the best
optimizations weren't really possible. The problem is that a lot of type
information is lost between Rust and LLVM IR, where the compiler does the
really serious optimizations. For example, Rust can't tell LLVM about its
pointer aliasing guarantees (immutable and mutable borrows can't be done at
the same time without unsafe) so a lot of optimizations like storing a highly
used value from the heap in a register are passed over because of conservative
heuristics.

Now that MIR has landed, Rust will eventually get much better optimizations
from both rustc and the llvm optimization passes but other things are a much
higher priority like non-lexical scoping.

~~~
aleden
> For example, Rust can't tell LLVM about its pointer aliasing guarantees

False.

[http://llvm.org/docs/LangRef.html#noalias-and-alias-scope-
me...](http://llvm.org/docs/LangRef.html#noalias-and-alias-scope-metadata)

~~~
Rusky
From what I understand, LLVM still doesn't take as much advantage of that
information as it could, given Rust input. It's too geared toward the C family
of languages. (But as sibling comment says, the problem was partially the Rust
compiler's fault.)

~~~
gens
C has a pointer aliasing keyword, "restrict".

Also: "Originally implemented for C and C++, the language-agnostic design of
LLVM has since spawned a wide variety of front ends: languages with compilers
that use LLVM include ActionScript, Ada, C#,[4][5][6] Common Lisp, Crystal, D,
Delphi, Fortran, OpenGL Shading Language, Halide, Haskell, Java bytecode,
Julia, Lua, Objective-C, Pony,[7] Python, R, Ruby, Rust, CUDA, Scala,[8] and
Swift."

Most of these are nothing like C when it comes to pointers and memory layout.

~~~
Rusky
Again, from what I understand, `restrict` is not enough to convey everything
Rust knows.

Further, those other languages may be nothing like C, but they're even less
like Rust. So because of its heritage, even given the information Rust knows,
LLVM simply doesn't have the optimization passes to take full advantage of it.

~~~
gens
Ok. Like what ?

------
Vexs
I think it's kinda interesting what tiny little tweaks will affect code speed-
I recently discovered that in python, if you're only going to use an import
for one or two functions, importing it locally shaves off a good bit of time
depending on the function, in my case, .2 seconds!

~~~
Cyphase
I don't think the function itself matters; what's going on is that if you
import it locally, the reference you're using is in the local namespace versus
being in a module's namespace, which means it takes less time to get to the
function object. I'm guessing the function is being called a fair number of
times in a loop or some such similar manner?

~~~
fnord123
You could use `from some_module import some_func` to get the same effect as
some_func will be in the local namespace.

~~~
Cyphase
That's what the top-level commenter was doing.

~~~
fnord123
Ok. It wasn't clear what "importing it locally" meant. I've worked with some
people who would use that phrase to mean "copy paste it into the local file".

~~~
Cyphase
Fair point :P. I think I might have seen someone mean it like that as well in
the past.

------
Too

        like all generic interfaces in Rust, printing takes arguments by reference, 
        regardless of whether they are Copy or not. The println! macro hides this from 
        you by implicitly borrowing the arguments,
    

What's the reason for this? Seems a bit inconsistent that for functions you
must explicitly pass with &, but for print it's automatic.

~~~
dbaupp
The reason I can think of are:

\- it'd be annoying to have to write `println!("{}", &s);`,

\- this is not an inconsistency one thinks about in practice: ownership is
important in Rust, but the compiler does all the annoying checking, so IME one
can just let it fade into the background and not think about it until the
compiler tells you there's a problem,

\- it's an "permissive" inconsistency: writing a & will still compile and give
the same output,

\- it's a syntax-level inconsistency created by the println! macro packaging
up the normal syntax, not some special rules for special functions: there's
still a literal & borrow in there (see the macro expansion),

\- historical reasons, and no-one was annoyed enough (or even noticed it
enough) to change it.

~~~
Too
Hmm. If people would think it was annoying to println!("{}", &s) they wouldnt
use rust in the first place because you have to do that everywhere else for
plain functions.

Beeing permissive makes it even worse as there are now 2 inconsistent ways to
do the same thing.

Maybe I need to understand rust macros better. I guess this all boils down to:
macros are a very bad idea, as in other programming languages.

------
domoritz
Since LLVM does not have the necessary information to do the optimizations, I
wonder whether the same problem occurs in C++ code compiled with clang.

~~~
msbarnett
It may not. It's possible that this may be happening because rustc is failing
to mark the reference as both readonly and unaliased, whereas clang might mark
the equivalent as both. You'd need to examine the IR from both front ends to
really figure out the source of any differences in the generated ASM

~~~
shepmaster
Both Rust playgrounds (official[1], my take[2]) allow viewing the LLVM IR of a
Rust program.

The call to the print is

    
    
          call void @_ZN3std2io5stdio6_print17h690779b3bd8114d5E(%"core::fmt::Arguments"* noalias nocapture nonnull dereferenceable(48) %_3)
    

The entire chunk of `main` preceding that call:

    
    
          %size = alloca i64, align 8
          %_3 = alloca %"core::fmt::Arguments", align 8
          %_8 = alloca [1 x %"core::fmt::ArgumentV1"], align 8
          %0 = bitcast i64* %size to i8*
          call void @llvm.lifetime.start(i64 8, i8* %0)
          store i64 33554432, i64* %size, align 8
          %1 = bitcast %"core::fmt::Arguments"* %_3 to i8*
          call void @llvm.lifetime.start(i64 48, i8* %1)
          %2 = bitcast [1 x %"core::fmt::ArgumentV1"]* %_8 to i8*
          call void @llvm.lifetime.start(i64 16, i8* %2)
          %3 = ptrtoint i64* %size to i64
          %4 = bitcast [1 x %"core::fmt::ArgumentV1"]* %_8 to i64*
          store i64 %3, i64* %4, align 8
          %5 = getelementptr inbounds [1 x %"core::fmt::ArgumentV1"], [1 x %"core::fmt::ArgumentV1"]* %_8, i64 0, i64 0, i32 1
          %6 = bitcast i8 (%"core::fmt::Void"*, %"core::fmt::Formatter"*)** %5 to i64*
          store i64 ptrtoint (i8 (i64*, %"core::fmt::Formatter"*)* @"_ZN4core3fmt3num54_$LT$impl$u20$core..fmt..Display$u20$for$u20$usize$GT$3fmt17hb872170870cc06d9E" to i64), i64* %6, align 8
          %7 = getelementptr inbounds [1 x %"core::fmt::ArgumentV1"], [1 x %"core::fmt::ArgumentV1"]* %_8, i64 0, i64 0
          %8 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 0, i32 0
          store %str_slice* getelementptr inbounds ([2 x %str_slice], [2 x %str_slice]* @ref.8, i64 0, i64 0), %str_slice** %8, align 8, !alias.scope !1, !noalias !4
          %9 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 0, i32 1
          store i64 2, i64* %9, align 8, !alias.scope !1, !noalias !4
          %_6.sroa.0.0..sroa_idx.i = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 1, i32 0, i32 0
          store %"core::fmt::rt::v1::Argument"* null, %"core::fmt::rt::v1::Argument"** %_6.sroa.0.0..sroa_idx.i, align 8, !alias.scope !1, !noalias !4
          %10 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 2, i32 0
          store %"core::fmt::ArgumentV1"* %7, %"core::fmt::ArgumentV1"** %10, align 8, !alias.scope !1, !noalias !4
          %11 = getelementptr inbounds %"core::fmt::Arguments", %"core::fmt::Arguments"* %_3, i64 0, i32 2, i32 1
          store i64 1, i64* %11, align 8, !alias.scope !1, !noalias !4
          call void @_ZN3std2io5stdio6_print17h690779b3bd8114d5E(%"core::fmt::Arguments"* noalias nocapture nonnull dereferenceable(48) %_3)
    

[1]: [https://play.rust-lang.org/](https://play.rust-lang.org/)

[2]: [http://play.integer32.com/](http://play.integer32.com/)

~~~
msbarnett
Am I reading it right? It looks to me like rustc didn't mark the immutable
borrow as readonly, so LLVM went ahead and assumed print could mutate it.

~~~
dbaupp
I'm not 100% sure, but I believe LLVM doesn't have a way to understand the
(im)mutability of pointers rewritten into memory, like the borrow is here.

------
lowbloodsugar
This is awful. Was just reading how Rust is all shiny and special and much
better than awful, naughty C++, and now I read how the print method is magic
goo because "developers don't want to type & in front of ints". Back in my
day, bah, grumble...

~~~
steveklabnik
One of the reasons that ! is in macros is to let you know that they can do
near-arbitrary things inside of their ()s.

