

PyPy is faster than C, again: string formatting - Scriptor
http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-again-string.html

======
old-gregg
I love PyPy, those guys are amazing. However, every time I see "faster than C"
almost always a bit of trickery/wordplay is involved.

The examples aren't comparable. The equivalent would be to have Python code
invoke an external function which sits in a pre-compiled .so

Bulk of the work is happening inside of sprintf(), why handicap C by not
letting it to compile the code?

The fair comparison would be to place the source of sprinf() nearby and see if
C compiler inlines that call or/and unrolls the loop, otherwise it's just
about packaging/linking, not really about code generation.

Edit: I see this became #1 on HN front page today. I want to take advantage of
this and say that <http://mailgun.net>, the programmable email platform, is
looking for an engineer who'd find this discussion interesting. See my
profile. And we're users of PyPy too! :)

~~~
scott_s
But the whole point of the post was to point out that PyPy can optimize in
places that the traditional C model of shared libraries cannot - or, at least,
have great difficulty doing so. This is an inherent advantage to optimizing
the instructions at runtime.

~~~
old-gregg
Shared libraries is an operating system feature and has nothing to do with C.
C just happens to be the most popular systems programming language.
"Traditional C model" is just a bunch of object files, and it's up to you and
your linker to package it any way you want it: a shared library, a static
library or an executable.

I'm not a GCC expert, but Microsoft C compilers could inline cross-module code
since the beginning of time. His example may actually produce different
results under msvc with /GL flag and with static linking enabled.

 _"...Inline a function in a module even when the function is defined in
another module..."_ [http://msdn.microsoft.com/en-
us/library/0zza0de8%28v=vs.80%2...](http://msdn.microsoft.com/en-
us/library/0zza0de8%28v=vs.80%29.aspx)

~~~
sausagefeet
Is anyone working on a JIT for object code? I'm thinking we could keep all our
SO's how they are now since they are great for updating versions of libraries
with bugs, but have a JIT that takes care of profiling and inlining functions
from other modules.

~~~
fanf2
LLVM is designed to be able to do this.

------
apaprocki
Since we're comparing apples to oranges anyway, how fast _could_ it be if you
really wanted to format "%d %d" into the stack 10 million times without a
function call... Just for fun:

    
    
      int main() {
          static const char* digits = "0123456789";
          int i;
          for (i = 0; i < 10000000; i++) {
              char x[44], *p = x, tmp[20];
      
              /* sign */
              int j;
              if (i < 0) { *p++ = '-'; j = -i; } else { j = i; }
      
              /* number */
              int pos = 0, spos;
              do {
                  tmp[pos++] = digits[j % 10];
                  j /= 10;
              } while (j != 0 && pos <= 20);
              spos = pos;
              do { *p++ = tmp[--pos]; } while (pos > 0);
      
              /* space, sign, number again */
              *p++ = ' ';
              if (i < 0) *p++ = '-';
              do { *p++ = tmp[--spos]; } while (spos > 0);
              *p++ = '\0';
          }
      }
      
      $ gcc -O4 -o s s.c
      $ time ./s
      real    0m0.140s
      user    0m0.138s
      sys     0m0.001s

~~~
1amzave
That runs in ~0.4s on my system (Xeon E5520, 2.27GHz), but replacing the
'digits' lookup table with simple arithmetic on the ASCII values ('0' + j%10)
speeds it up to ~0.23s. Yes, L1 caches are pretty fast, but ALUs are still
faster (for integer addition anyway).

Edit: This was with GCC 4.1.2, newer versions probably optimize differently,
so who knows.

~~~
apaprocki
Interesting.. I guess I should have mentioned gcc 4.5.2 on a Xeon X5670,
2.93GHz, which is 12M cache. Changing it to ('0' + j % 10) has no change in
overall speed for me.

~~~
1amzave
OK, I just tested with gcc 4.6.0, and unless I've screwed something up, it
looks like (at -O4) it actually optimizes these into the _exact_ same code. As
in the generated ELFs are byte-for-byte identical. Impressive.

~~~
apaprocki
Since you showed interest, using this as the for loop tweaks it even a bit
more :)

    
    
      /* temp number */
      char tmp[20];
      int pos = 0;
      int j = i > 0 ? i : -i;
      do {
        tmp[pos++] = '0' + j % 10;
        j /= 10;
      } while (j != 0 && pos <= 20);
    
      /* output both numbers simultaneously */
      char x[44], *p1 = x, *p2 = x + 1 + pos;
      if (i < 0) { *p1++ = '-'; *p2++ = '-'; }
      do {
        int tpos = --pos;
        *p1++ = tmp[tpos]; *p2++ = tmp[tpos];
      } while (pos > 0);
      *p1 = ' ';
      *p2 = '\0';

~~~
1amzave
Well, if you _really_ want to go apples-to-oranges and specialize as much as
possible for the sequentially increasing case we're dealing with here, you
could just do your arithmetic directly on the ASCII string itself, complete
with a little ripple-carry loop...I bet _that_ 'd be fast. If I'm remembering
correctly, I think x86 actually has dedicated instructions for ASCII
arithmetic (though they're probably so unoptimized in a modern
microarchitecture it'd probably be faster to avoid them).

------
onedognight

        char x[44];
        sprintf(x, "%d %d", i, i);
    

_This is fine, except you can't even return x from this function, a more fair
comparison might be:_

    
    
         char * x = malloc(44 * sizeof(char));
         sprintf(x, "%d %d", i, i);*
    

There is a standard (C99) way to do this: asprintf(3).

    
    
        char *x;
        asprintf(&x, "%d %d", i, i);
        return x;

~~~
tedunangst
There is no asprintf function in my copy of the C99 standard.

~~~
onedognight
Sorry, you are correct; I mis-read the man page. It is however in at least
glibc and Darwin/FreeBSD's libc.

------
jperras
The pypy guys continue to do amazing work. This project is one of the reasons
why I believe the python community is one of the best open source communities
out there: Let's work on something incredibly difficult, challenging and
something that not long ago was considered to be nearly impossible, and
produce incredible results.

If you're not running pypy in production already, then you probably should
be[1].

[1]: Yes, there are some obvious exceptions.

edit: formatting.

------
_delirium
Re:

> GCC is unable to inline or unroll the sprintf call, because it sits inside
> of libc.

If I'm understanding [http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/Other-
Builtins.h...](http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/Other-
Builtins.html) correctly, sprintf should be handled as a built-in function,
rather than linking the libc version, unless you explicitly specify -fno-
builtin. In theory that should allow gcc to perform various optimizations;
I've seen that happen with _printf_ at least, where e.g. printf-ing a constant
string just gets compiled to _puts_.

~~~
kingkilr
Perhaps it is recognizing it, but it's not doing anything interesting with
that knowledge, on my machine GCC emits a call to "__sprintf_chk"

~~~
pja
There's nothing really stopping gcc from doing as well here, since it ought to
be able to spot that the format string never changes & roll out a custom
sprintf() that doesn't need to parse it every time. However, right now gcc
doesn't do that, so it loses to a language implementation that does.

The really sad thing is that using an ostringstream in C++ is even worse,
despite the fact that C++ has all the types available to it & doesn't need to
parse any format strings at all: Not enough template metaprogramming clearly!

------
xd
I'm no python developer but this:

 _def main():

    
    
        for i in xrange(10000000):
    
            "%d %d" % (i, i)
    

main()_

Does't seem to be copying the result anywhere. Where as the C example is
copying the result to memory .. which would explain why it is slower.

~~~
xd
Oh and does anyone else think the malloc example in comparison to actual
garbage collection is incredibly unfair?

~~~
kingkilr
I've long since lost track of what it means for a comparison between languages
to be fair. To put that a different way, how should we have compared them?
Should we have downloaded some GC library for C to implement it?

~~~
xd
I guess the real problem is that the example serves no real world purpose and
can very easily be biased. See my response to scott_s below.

------
eridius
Sounds like this would be the equivalent of some theoretical

    
    
        exp = CompileSprintfFormat("%d %d");
        for (i = 0; i < 10000000; i++) {
            RunCompiledSprintf(exp, i, i);
        }
    

All I'm really getting out of this is that PyPy now compiles sprintf formats
for you and saves the results, and that there's no equivalent API in libc.

~~~
TylerE
That's not really it. This, at least in my understanding, is a fully _generic_
optimization, of operations on constant strings.

~~~
eridius
From the article:

> In the case of PyPy, we specialize the assembler if we detect the left hand
> string of the modulo operator to be constant.

So it's very much a targeted optimization at the modulo operator (which is
Python's equivalent of sprintf).

~~~
kingkilr
I wouldn't say it's particularly targeted, let me show you the code that makes
this happen: [https://bitbucket.org/pypy/pypy/src/unroll-if-
alt/pypy/objsp...](https://bitbucket.org/pypy/pypy/src/unroll-if-
alt/pypy/objspace/std/formatting.py#cl-288)

~~~
pja
Yeah, the point is that by expressing the algorithm in Python, the JIT gets to
go hog wild optimising tight loops like this one.

Which is great: why do all that work writing some kind of custom sprintf
generating function when you can just let the JIT do it for you on the fly?

------
malkia
Why don't you take an example from Mike Pall's LuaJIT relying on something
more computationally expensive like scimark - he has lua version, and also few
other benchmarks (from the alioth site).

If you really want fast sprintf( %s, "%d %d" ) - then you might aswell craft
something specifically for converting text to decimal numbers.

sprintf( ) is convenience function, not performance.

~~~
sjs
I'm sure that's one of the larger goals of PyPy, to make idiomatic code that
is convenient to write also perform well. They don't have to be mutually
exclusive.

------
pwpwp
What the example shows is that specializing string operations for known inputs
can be faster than not doing so. Surprise!

But going from this to "PyPy is faster than C" seems quite a stretch, no?

------
ZoFreX
> compiled with GCC 4.5.2 at -O4 (other optimization levels were tested, this
> produced the best performance).

I was under the impression that any number greater than 3 had no effect?

------
schiptsov
Why be so shy? Faster than Assembly language! Faster than Assembly language
with cache size alignment and padding of data structures! ^_^

------
comex
If PyPy were really smart, it would notice the unused result and take 0.00s.
:)

------
aninteger
How did they measure this? How would it compare to C#. I'm assuming they
ignore the startup time of course.

------
afhof
This title is misleading; instead of `C` it should use `CPython`.

~~~
briancurtin
Why would it use that title? CPython is a part of some of the numbers, but the
comparison is C vs. PyPy.

