
How do Ruby and Python profilers work? - ingve
https://jvns.ca/blog/2017/12/17/how-do-ruby---python-profilers-work-/
======
ekimekim
An amazing part of sampling-based profilers: They're so low overhead, you can
leave them on permanently in production. And since all they're taking is
counts, you can combine results from multiple instances very easily. Put this
together and it forms an easy, highly insightful part of your production
monitoring systems. Notice a service being slow? You can instantly see its
live flamegraph, check a function's average runtime over time, etc.

~~~
MuffinFlavored
Could you please share a really high level psuedo code example of what kind of
data a sample-based profiler collects/how it does it?

~~~
amorphid
In Ruby, check out
[https://apidock.com/ruby/Kernel/set_trace_func](https://apidock.com/ruby/Kernel/set_trace_func).
I believe Python has something similar, too.

~~~
ekimekim
I'm referring to a sampling profiler - as mentioned in the article, these
generally work on a timer (either a thread with a loop, or a recurring
signal), not by a per-function callback.

------
alex-
It might be interesting to some to know that relatively recently (2016) PEP
523 (
[https://www.python.org/dev/peps/pep-0523/](https://www.python.org/dev/peps/pep-0523/)
) was created which provides a more efficient alternative to PyEval_SetTrace
or PyEval_SetProfile for some use cases.

The idea of this PEP is to add frame evaluation into cPython. As the PEP says
"For instance, it would not be difficult to implement a tracing or profiling
function at the call level with this API"

Elizaveta Shashkova (a PyCharm developer at JetBrains) gave a really good talk
on the subject at this years PyCon (
[https://www.youtube.com/watch?v=NdObDUbLjdg](https://www.youtube.com/watch?v=NdObDUbLjdg)
).

------
sciurus
This is a really nice blog post! It covers the ecosystem well and goes into
the right level of detail to help someone unfamiliar with the tools understand
how they work.

At PyOhio I gave a hastily-prepared lighting talk on building a statistical
profiler in Python. I'll link to it here because, although going in much less
detail than Julia's post, it shows a visualization technique and an example of
the sort of payoff you get from profiling.

[https://www.polibyte.com/2017/08/05/building-statistical-
pro...](https://www.polibyte.com/2017/08/05/building-statistical-profiler/)

------
richardwhiuk

      > The main disadvantage of tracing profilers implemented in this way is
      > that they introduce a fixed amount for every function call / line of code
      > executed. This can cause you to make incorrect decisions! For example, if
      > you have 2 implementations of something – one with a lot of function calls
      > and one without, which take the same amount of time, the one with a lot of
      > function calls will appear to be slower when profiled.
    

I'm not convinced this is the main disadvantage of tracing profilers. The main
disadvantage is that the perf impact tends to be much more severe.

It's theoretically possible to mitigate the problem identified (keep track of
how much time you spend profiling, and reduce the time tracked spent in the
function by that amount). However, it's harder to have a tracing profiler
which only profiles some functions, and harder to specify those.

The nice thing about sampling profiles is that it's very easy to configure the
performance impact and easy to get a good sample by just running for longer.

~~~
unwind
Please don't indent, it doesn't work to make block quotes. The qoute is very
hard to read on mobile.

Use asterisks around the paragraph, that renders as italics.

Thanks.

~~~
gsich
Use landscape mode.

------
viraptor
I'm surprised that no system-wide tracers were included (unless I missed
something?). You can use dtrace / systemtap with both ruby and python (>=3.6).
You still have to extract the bits you're interested in, but the hooks for
function entry/exit and GC are exposed, so you can effectively add a targeted
profiler/tracer at runtime.

------
cdancette
I'm sad there are no drawings in the article :( Julia Evans drawings are
amazing. If you've never seen them, you should check the strace zine :
[https://jvns.ca/zines/#strace-zine](https://jvns.ca/zines/#strace-zine)

But it's still a very interesting blog post !

------
trentnelson
I guess this is as good a time as any to make my post-PyParallel pet project
public (holy alliteration Batman):
[https://github.com/tpn/tracer](https://github.com/tpn/tracer).

I started it about 2.5 years ago because I wasn't really happy with existing
Python tracing tools. I didn't like:

a) the performance hit when tracing non-trivial programs,

b) the inability to say: "hey, trace all code living in modules mymodule1 and
mymodule2; ignore everything else", and

c) the inability to say: "screw it, trace everything", and have an efficient
means to persist tens of gigabytes worth of data (i.e. don't separate in-
memory representation from on-disk representation).

So I came up with this tracing project, creatively named: tracer.

It has relatively low overhead, efficiently and optimally persists tens of
gigabytes worth of data to disk with minimal overhead, and allows for module-
based tracing directives so that you can capture info about your project and
ignore all the other stuff (like stdlib and package code).

Like all fun pet projects, once the easy stuff was done, this one devolved
into writing assembly to do remote thread injection
([https://github.com/tpn/tracer/blob/master/Asm/InjectionThunk...](https://github.com/tpn/tracer/blob/master/Asm/InjectionThunk.asm#L357)),
coming up with new "string table" data structure as an excuse to use some AVX2
intrinsics
([https://github.com/tpn/tracer/blob/master/StringTable/Prefix...](https://github.com/tpn/tracer/blob/master/StringTable/PrefixSearchStringTable.c)),
and writing a sqlite3 virtual table interface to the flat-file memory maps,
complete with CUDA kernels
([https://github.com/tpn/tracer/blob/master/TraceStore/TraceSt...](https://github.com/tpn/tracer/blob/master/TraceStore/TraceStoreSqlite3ExtInit.c#L228)).

Edit: something pretty fascinating I observed in the wild when running this
code: a tracing-enabled Python run of a large project ran faster than a
normal, non-traced version of Python in a low-memory environment (a VM with
8GB of RAM). My guess is that splaying the PyCodeObject structures each frame
invocation has the very beneficial side effect of reducing the likelihood the
OS swaps the underlying pages out at a later date
([https://github.com/tpn/tracer/blob/master/Python/PythonFunct...](https://github.com/tpn/tracer/blob/master/Python/PythonFunction.c)).

~~~
chubot
Hm interesting -- what parts of it are tied to Windows?

I recently strated playing with uftrace, a tracing framework for C++ using
instrumented binaries. I got it working on CPython and I like it so far.

The CLI seems to be well done. I haven't looked in detail your tracer (a
README would help), but maybe there are some ideas that can be exchanged.

This thread has a link to the repo and a video:

[https://www.reddit.com/r/ProgrammingLanguages/comments/7djre...](https://www.reddit.com/r/ProgrammingLanguages/comments/7djreg/uftrace_a_function_graph_tracer_for_cc_userspace/)

~~~
trentnelson
> Hm interesting -- what parts of it are tied to Windows?

If I had to guess... I'd say... almost all of it? :-)

I like systems programming in C on NT these days far more than Linux systems
programming. Once you've gotten used to how NT does I/O and threading, it's
tough to look back. Sure is lonely going against the grain, though.

(Just added a README, in that I copy and pasted some notes I wrote a month or
so ago into a text file. I'll add more stuff (you know, like, instructions)
soon.)

------
blt
As far as I can tell, there's not much of a reason to use an instrumenting
profiler over a sampling one. Is there some niche use case? Maybe to
experimentally validate some claim about average case computational
complexity?

~~~
MaulingMonkey
I use _explicit_ instrumenting APIs to generate flamegraphs a lot - useful for
a lot of niches in gamedev - where we care about individual frames exceeding
their frame budget. Being able to capture several thousand frames worth of
flame graphs and then being able to drill down into the individual outliers is
absolutely killer in this context. Sampling profilers are great for getting a
broad idea of where problems might be, but not nearly as good as instrumented
options when you really want detail. For example:

\- Finding where we checked a file for modification on the main thread every
few seconds, to consider reloading resources, causing us a 5ms spike that
would sometimes cause us to miss vsync.

\- Spotting a framerate hitch caused by the main thread stalling out on a
contested logging lock under some specific set of circumstances for a frame or
two.

\- Spotting frames where overly aggressive background jobs simply starved the
main thread out of CPU cycles.

\- Which specific subanimations of a set (complete with names because I'm
using an explicit API that lets me annotate them) are causing our optimization
code to exhibit O(scary) behavior.

There's nothing here that you couldn't eventually figure out and solve with a
debugger and a sampling profiler, but instrumenting profilers - and profiler
APIs that let you add your own annotations - make the job way easier in my
experience.

~~~
blt
This is a great answer, thanks!

------
wdewind
One of the things that makes Ruby so difficult to profile is that in order to
do anything really fun (like get memory usage, for instance), you need to
shell out, which means you're forking/doubling your memory footprint
immediately, which means you probably can't do this in production. This is
especially frustrating because by far the biggest issue you'll have with a
Ruby (especially Rails) app is memory usage, so trying to pinpoint what of the
many things that are currently running on your box are blowing up your memory
can be non-trivial.

~~~
jvns
Why do you have to fork to get a memory profile? I don't know anything about
memory profiling yet (I collected a bunch of links in
[https://gist.github.com/jvns/b81eb039f6373bc577d7dbdd978581b...](https://gist.github.com/jvns/b81eb039f6373bc577d7dbdd978581b5)
but haven't read their source yet) but I'd like to know more. It seems like it
should be possible to get a memory profile in production without a prohibitive
amount of overhead though.

Also, I believe forking shouldn't _actually_ double memory usage because of
copy on write.

~~~
wdewind
Ruby can't tell how much memory it is using from within the process, the only
way to is shell out and scrape the output of ps. I would love to be wrong
about this, I did a fair amount of research and all the profiling libraries I
read used this method. One thing I've done instead is use object allocation,
which is available from the builtin benchmarker, as a proxy for memory usage.
I don't have a good sense of how accurate that is, but anecdotally it seemed
useful and correlated to stuff like page load times.

(Edit: ^happily I am wrong about this, though I believe the copy on write
thing below is still correct, see further discussion)

Copy on write does definitely help here, only for Ruby > 2.0.

~~~
TheDong
I don't have much experience in ruby profiling, but what you're saying doesn't
make sense.

> the only way to is shell out and scrape the output of ps

All `ps` is doing is reading files in proc. The memory info it prints is in
`/proc/$pid/status`. Ruby can trivially read `/proc/self/status` to get that
very naive view of how much memory it's using.

That isn't a useful amount of memory because, well, it doesn't actually show
where object allocations go.

> Copy on write does definitely help here, only for Ruby > 2.0.

I believe she meant the operating system's copy-on-write of memory. Forking a
copy of a process is actually very cheap because it's copy-on-write. That has
nothing to do with ruby or its version and has been true in linux for as long
as linux has existed.

I really don't understand what you're trying to say or why you mention ps at
all.

~~~
wdewind
> All `ps` is doing is reading files in proc. The memory info it prints is in
> `/proc/$pid/status`. Ruby can trivially read `/proc/self/status` to get that
> very naive view of how much memory it's using.

Yep, great point, I am simply wrong about this.

> I believe she meant the operating system's copy-on-write of memory. Forking
> a copy of a process is actually very cheap because it's copy-on-write. That
> has nothing to do with ruby or its version and has been true in linux for as
> long as linux has existed.

Re: Copy on write the way Ruby implements GC in < 2.0 screws this up:

> The way ruby creates objects, the GC flag or reserved bit is stored in the
> object itself. So, as you would have guessed by now, when the GC runs in one
> of the processes, the GC flag would be modified in all the objects even if
> they are present in the shared pool. Now, the OS would sense this and
> trigger a copy-on-write making private copies of the objects in each child’s
> memory space.

[https://medium.com/@rcdexta/whats-the-deal-with-ruby-gc-
and-...](https://medium.com/@rcdexta/whats-the-deal-with-ruby-gc-and-copy-on-
write-f5eddef21485)

~~~
zb
That is a genuine issue with fork(), but if you do fork() followed immediately
by exec() then you won't ever encounter it. (The new process's memory space
completely replaces the copied pages of the child.) So it's a real problem for
e.g. spinning off worker processes, but not at all for e.g. calling out to a
shell.

------
FascinatedBox
I'm currently working on a programming language (Lily. See this name on
github), and this is really helpful. I've been thinking about adding debugging
support, and this will help a lot toward doing that.

------
josefelix
Great information! I am currently learning Python and this helps a lot in my
studies.

