
10 thousand times faster Swift - coldcode
https://medium.com/@icex33/10-thousand-times-faster-swift-737b1accd973#.lk21x8xty
======
gilgoomesh
Bad news: the optimizer is moving your functions outside the loop.

    
    
        for _ in 0..<iterations {
            let result = flatuseStruct(outputData)
            assert(result == 8644311667)
            total = total + UInt64(result)
        }
    

Looking at the assembly... the call to `flatuseStruct` is moved _outside_ the
loop in Release builds. You're only measuring 1 thousand iterations of
`flatuseStruct`, not 1 million.

Your red flag should have been this:

> One million times decoding of a small object graph took 0.35ms

That's literally impossible. That's doing 2.8 billion iterations per second. A
single function call generally takes 2 nanoseconds (you can't do 1 billion per
second).

~~~
mzaks
I Updated the run bench function

[https://gist.github.com/mzaks/e3a2dc7ccdfc2397bc26c55eb6dc8a...](https://gist.github.com/mzaks/e3a2dc7ccdfc2397bc26c55eb6dc8ac3)

the output is now:

    
    
      Eager run
      =================================
      1557 ms encode
      264 ms decode
      34 ms use
      206 ms dealloc
      504 ms decode+use+dealloc
      0,38 ms direct
      0,32 ms using struct
      =================================
      Total counter1 is 8644311667000000
      Total counter2 is 8644311667000000
      Total counter3 is 8644311667000000
      Encoded size is 315 bytes, should be 344 if not using unique strings
      =================================
    
    

As you can see all three counters are equal.

~~~
toth
The function call is not being optimized out, it's being hoisted outside the
loop. I.e., it is as if the code was written as:

    
    
        let result = flatuseStruct(outputData)
        for _ in 0..<iterations {
            assert(result == 8644311667)
            total = total + UInt64(result)
        }
    

The counter will still be correct, but you are not measuring what you think
you are measuring.

~~~
mzaks
This makes sense!

Changed the iteration to:

    
    
       for i in 0..<iterations {
         let result = flatuseStruct(outputData, start:i)
         assert(result == 8644311666 + Int(i))
         total2 = total2 + UInt64(result)
       }
    

now result is around 43ms

Thanks for pointing it out. Have to check if there are so other things
involved, but that might be it.

------
phpnode
This is happening because of compiler optimisation technique called Loop
Invariant Code Motion [0], which means that the author is not measuring what
he thinks he's measuring (and this should be obvious from the numbers really),
so the result is meaningless.

0\. [https://en.wikipedia.org/wiki/Loop-
invariant_code_motion](https://en.wikipedia.org/wiki/Loop-
invariant_code_motion)

------
masklinn
So

* Swift's boolean -> integer conversion is oddly slow (should probably be reported)

* allocations are expensive (duh)

* Converting arbitrary binary data to a string is deceptively simple which is obvious coming from C or C++ but possibly less so coming from higher-level languages:

> String conversion. If I use byte array to string conversion I move from
> 0.35ms to 1774.73ms. And if I do what the test needs to do (get the length
> of the string “s.utf8.count”), I am at 2737.4ms. Which is an additional
> second spend doing factually nothing.

Except it's not going nothing, it has to

* allocate a buffer for the string (possibly multiple times depending how reservation works by default) (and specifically for Swift there's the potential additional issue NSString bridging)

* validate that the input is decodable and possibly transcode to whatever the internal encoding is if it's not UTF-8

* iterate the string's codepoints and sum the number of UTF8 bytes necessary to encode that codepoints

That's a shit-ton of work compared to doing literally nothing if you just
check the number of bytes in the original array.

------
openasocket
I might be misreading this, but that number doesn't seem possible. He says he
can do 1 million decodings in 0.35ms, but that means each decoding is done in
less than a third of a _nanosecond_ , which sounds unreasonable. I'm not
familiar with FlatBuffers, but surely there needs to be some sort of
validation step for the data, right?

~~~
stefs
as being said above: this is probably a micro-benchmarking compiler
optimization mistake. i.e.: don'tignore your result or the compiler will
probably optimize it away.

------
tempodox
> ...Swift being as fast or even faster than C...

Rather unlikely. To get “faster than C” you need to hand-code in assembler,
and know your target CPU really well to outsmart the compiler.

One advantage of C is that it's relatively easy to see what the CPU does when
you look at the source. In Swift, this is no longer the case. So, unless you
know Swift really well, being “as fast as C” doesn't come easily. One thing
that helps you here is looking at your compiler's assembly output. Sadly, with
Swift this is not as convenient in Xcode as it is with (Objective-)C.

~~~
protomyth
> To get “faster than C” you need to hand-code in assembler, and know your
> target CPU really well to outsmart the compiler.

Ada compilers can beat C compilers, and languages with some restrictions such
as Fortran can beat C.

~~~
AlisdairO
'restrict' is part of the C specification these days, so Fortran has no
theoretical performance advantage over C. Of course, there is a body of code
in Fortran built up with restrict-semantics applied by default, so there may
well be real-world advantages.

------
protomyth
"We saw that allocating memory and retain release calls where dominating when
profiling. So why not just do the same bare bone thing that C does. Write a
bunch of functions which read data from a byte array without allocating any
objects."

Good, but this can become a problem if you are not properly clearing the
buffer and suddenly start leaking data. OpenSSL is the current big example.

------
jdright
Doing benchmarks without knowing what is doing, without understand what is
happening and then talking about faster than C only shows naivité.

Hope you will use of this mistake to learn how things work and to investigate
more in detail before announcing misleading and erroneus information.

I can understand the desire to beat C and that this comes from detaching
software from hardware obscuring the basis of software and turning it in a
kind of magic driven by faith.

~~~
Tloewald
The writer disbelieved their own results, got others to double and triple
check their work, and when the correct explanation emerged, published a
correction. I think you're being overly harsh (and, in the end, the results
were very close to C).

~~~
archgoon
> The writer disbelieved their own results

No they didn't.

"To be honest it is hard for me to say, how exactly it happened that we got 70
times faster than C, _but it is a measurable fact._ "'

You also don't publish a headline "10,000 times faster Swift" if you don't
believe your own results.

At best, this is click-bait.

~~~
mzaks
The good thing about any narrative, it resonates with different people on
different levels.

The blog post is in deed titled "10,000 times faster Swift". I though it will
be a catchy title even though 6 seconds to 0.35 ms is not factor of 10,000.

I thought about renaming the title to 500 times faster Swift, which would be
rathe more accurate insight of current findings, but than what the hack. It's
a blog post. I didn't published a scientific paper. I just reflected on my
resent work.

The main points of the Blog posts wehere anyways about how it is possible to
make low level optimisations to make Swift programs faster. And as a matter of
fact the Loop-invariant code motion was a valid technique to get the same
result. Result being sum of payload content. The compiler was smarter than me.
It gave me the same result doing 250times less work. I find it impressive.

I must be honest I am not fluent in assembly this is why I could not figure it
out by myself.

Was I suspicious? Absolutely!!! But the facts were in my face.

Shouldn't I publish an article, where I am not sure why I got what I got? If I
wouldn't publish the article, I would not figured out the truth and wouldn't
learned form this experience.

And after all, this post is about performance pitfalls in Swift language. The
comparison with C was almost accidental. I would compare it with C++ if I
would have a Windows machine, as the benchmark for C++ project has Windows
specific code. I also consulted with the author of flatcc, who is much more
relaxed about my blog post than you are :)

This blog post is about learning something. I learned something before I wrote
this post I shared it and now I learned even more.

You should try it yourself.

Maybe not as satisfying as criticising, but it also has it's moments.

------
askyourmother
Use C or C++ instead?

~~~
b34r
RTFA dude. The end product was _faster_ than C.

~~~
tempodox
> _faster_ than C

How do you know? There is no data on the respective machines and environments.
A number from a tweet and a number from a blog post do not make such data.

~~~
mzaks
Performance test run on Travis CI in a virtual machine [https://travis-
ci.org/mzaks/FlatBuffersSwift](https://travis-ci.org/mzaks/FlatBuffersSwift)

function called for decode+use+dealloc
[https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBu...](https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBuffersPerformanceTestDesktop/flatbench.swift#L55)

function called for direct:
[https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBu...](https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBuffersPerformanceTestDesktop/flatbench.swift#L112)

function called for using struct:
[https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBu...](https://github.com/mzaks/FlatBuffersSwift/blob/master/FlatBuffersPerformanceTestDesktop/flatbench.swift#L144)

Everything is on Github, you are welcome to try it out on your own machine.

