
Scalene: a high-performance, high-precision CPU and memory profiler for Python - matt_d
https://github.com/emeryberger/scalene
======
bawr_hszm
Sadly, relying on Python's signal handling is not enough to get robust
profiling information the moment your code is spending a significant chunk of
time outside of simple Python calls. This is because signals don't get
delivered to the Python level until the interpreter comes back from C land,
and it's possible to get stuck in C land even with pure Python code.

To wit:

    
    
      $ python -m scalene trace.py
      f1 1.3852 312499987500000
      f2 1.2420 2812499962500000
      f3 1.5018 1.5
      trace.py: % of CPU time =  33.66% out of   4.13s.
      Line  | CPU %    |  [trace.py]
         1  |          | #!/usr/bin/env python
         2  |          | 
         3  |          | import time
         4  |          | 
         5  |          | def timed(f):
         6  |          |     def f_timed(*args, **kwargs):
         7  |          |         t = time.time()
         8  |          |         r = f(*args, **kwargs)
         9  |          |         t = time.time() - t
        10  |          |         print('%s %.4f %s' % (f.__name__, t, r))
        11  |          |     return f_timed
        12  |          | 
        13  |          | @timed
        14  |          | def f1(n):
        15  |          |     s = 0
        16  |  17.99%  |     for i in range(n):
        17  |  81.29%  |         s += i
        18  |          |     return s
        19  |          | 
        20  |          | @timed
        21  |          | def f2(n):
        22  |   0.72%  |     return sum(range(n))
        23  |          | 
        24  |          | @timed
        25  |          | def f3(t):
        26  |          |     time.sleep(t)
        27  |          |     return t
        28  |          | 
        29  |          | if __name__ == '__main__':
        30  |          |     f1(25_000_000)
        31  |          |     f2(75_000_000)
        32  |          |     f3(1.5)

~~~
emeryberger
Good observation. My personal POV is that the best way to optimize your Python
code is to use native code (whether as libraries or through pure Python code
that is essentially a thin wrapper over C) rather than living in the
interpreter. I want to see the parts of the program that are spending that
time in the interpreter.

In short, a profiler that tells me that a program is spending a lot of time in
C is not generally providing me particularly actionable information.

(In any event, the top-line report is that the Python part of the program only
accounts for 33.66% of the execution time, which looks just about right.)

~~~
bawr_hszm
I can understand this approach, but I fundamentally disagree with it - the
first duty of a profiling tool is _not to mislead the user_.

In this example, the program spends a third of the time just sleeping /
blocked, a third of the time on CPU but at the C level, and the remainder just
evaluating Python loops. Unless you already know how it's implemented, that
33.66% is not easy to interpret, and the docs don't mention what _exactly_ is
being profiled. Specifically, these samples aren't a % of CPU time, they're a
% of real time that we happened to have been able to do a Python-level
interrupt, and even that isn't a great explanation. I think most users would
still expect line 22 to get traced properly.

I very much _do_ want to know about C time, too, because it's very much
actionable for most of the optimizations I end up making in production
systems.

That said, I don't think this line of discussion is super productive for
either of us, we seem to have different goals in mind, which is fine. ;)

So I'll close by saying that I was impressed by your LD_PRELOAD hacks for
memory profiling, which isn't an approach that I've ever seen in other Python
profilers.

~~~
emeryberger
Re: being impressed by the memory profiling hacks, thanks! :)

And you persuaded me. New version to be uploaded momentarily that attributes
time to line 22.

    
    
      % python3 -m scalene sample-sample.py                                        
      f1 1.8081 312499987500000
      f2 1.7910 2812499962500000
      f3 1.5034 1.5
      sample-sample.py: % of CPU time =  70.53% out of   5.10s.
        Line  | CPU %    |  [sample-sample.py]
           1  |          | #!/usr/bin/env python
           2  |          |  
           3  |          | import time
           4  |          |      
           5  |          | def timed(f):
           6  |          |     def f_timed(*args, **kwargs):
           7  |          |         t = time.time()
           8  |          |         r = f(*args, **kwargs)
           9  |   0.32%  |         t = time.time() - t
          10  |          |         print('%s %.4f %s' % (f.__name__, t, r))
          11  |          |     return f_timed
          12  |          | 
          13  |          | @timed
          14  |          | def f1(n):
          15  |          |     s = 0
          16  |   8.58%  |     for i in range(n):
          17  |  41.34%  |         s += i
          18  |          |     return s
          19  |          | 
          20  |          | @timed
          21  |          | def f2(n):
          22  |  49.76%  |     return sum(range(n))
          23  |          | 
          24  |          | @timed
          25  |          | def f3(t):
          26  |          |     time.sleep(t)
          27  |          |     return t
          28  |          | 
          29  |          | if __name__ == '__main__':
          30  |          |     f1(25000000)
          31  |          |     f2(75000000)
          32  |          |     f3(1.5)

~~~
emeryberger
This discussion inspired me to refine Scalene's output further. Scalene now
separates out CPU time spent in Python from time spent in C (that is, not
executing in the interpreter). I think this is the best of both worlds: all
CPU time is accounted for, and it's easy for users to see what code is running
where. Thanks!

    
    
      sample-sample.py: % of CPU time =  99.58% out of   3.34s.
               | CPU %    | CPU %    | 
        Line   | (Python) | (C)      | [sample-sample.py]
      --------------------------------------------------------------------------------
           1   |          |          | #!/usr/bin/env python
           2   |          |          |  
           3   |          |          | import time
           4   |          |          |      
           5   |          |          | def timed(f):
           6   |          |          |     def f_timed(*args, **kwargs):
           7   |          |          |         t = time.time()
           8   |          |          |         r = f(*args, **kwargs)
           9   |          |          |         t = time.time() - t
          10   |          |          |         print('%s %.4f %s' % (f.__name__, t, r))
          11   |          |          |     return f_timed
          12   |          |          | 
          13   |          |          | @timed
          14   |          |          | def f1(n):
          15   |          |          |     s = 0
          16   |   6.61%  |   0.41%  |     for i in range(n):
          17   |  35.17%  |   3.97%  |         s += i
          18   |          |          |     return s
          19   |          |          | 
          20   |          |          | @timed
          21   |          |          | def f2(n):
          22   |   0.30%  |  53.53%  |     return sum(range(n))
          23   |          |          | 
          24   |          |          | @timed
          25   |          |          | def f3(t):
          26   |          |          |     time.sleep(t)
          27   |          |          |     return t
          28   |          |          | 
          29   |          |          | if __name__ == '__main__':
          30   |          |          |     f1(25000000)
          31   |          |          |     f2(75000000)
          32   |          |          |     f3(1.5)

~~~
gshulegaard
Just wanted to take a moment to appreciate the awesome response to feedback
and new added feature.

~~~
kinow
Yeah, love this kind of discussion and interaction on hnews!

~~~
gigatexal
Yeah this is some of the best HN back and forth I’ve seen!

Kudos to everyone involved.

------
JackC
I love the idea of fast Python profilers that don't require modified source
code to run.

One profiler I used recently which isn't mentioned in the readme is pyflame
(developed at uber):

[https://pyflame.readthedocs.io/](https://pyflame.readthedocs.io/)

pyflame likewise claims to run on unmodified source code and be fast enough to
run in production, so it might be worth adding to the comparison. It generates
flamegraphs, which greatly sped up debugging the other day when I needed to
figure out why something was slow somewhere in a Django request-response
callstack.

~~~
albertzeyer
Similar to PyFlame is also py-spy: [https://github.com/benfred/py-
spy](https://github.com/benfred/py-spy)

> While pyflame is a great project, it doesn't support Python 3.7 yet and
> doesn't work on OSX, Windows or FreeBSD.

I wonder how the CPU profiling in Scalene is different. It does not mention
PyFlame or py-spy at all in the Readme. Of course, the memory profiler is some
nice extra.

~~~
emeryberger
Scalene author here. I was not aware of either tool - many thanks for the
pointers! I just tried py-spy (I will try PyFlame on a Linux box momentarily).
It's pretty cool, though having to run it as root on OS X isn't great (scalene
runs without the need for root privileges; the CPU profiling part is pure
Python). Py-spy does appear to efficiently track CPU perf at the line
granularity (and does a lot of other stuff). It does a lot of things that
scalene does not do, but not memory profiling. Also, I personally prefer
scalene's line-level display of annotated source code vs. the flame graphs,
but YMMV.

~~~
Znafon
The possibility to run py-spy on an already running Python program in
production is pretty awesome. Do you think Scalene could do this?

------
spott
Another python profile that isn't mentioned is pprofile[0]. It is also a
statistical profiler with line-level granularity. It doesn't have the memory
profiling abilities (that is pretty slick...), but it also doesn't require the
dynamic library injection.

I don't know how it's statistical profiling speed compares to scalene, but it
would be great to see the comparison.

Also, does anyone know how the malloc interacts with pytorch?

[0][https://github.com/vpelletier/pprofile](https://github.com/vpelletier/pprofile)

------
jeanvalmarc
Another similar sampling profiler which has decent performance is
pyinstrument:
[https://github.com/joerick/pyinstrument](https://github.com/joerick/pyinstrument)

