Hacker News new | past | comments | ask | show | jobs | submit login
Magic-trace – High-resolution traces of what a process is doing (github.com/janestreet)
831 points by cgaebel on April 22, 2022 | hide | past | favorite | 140 comments



Hi HN! I'm Clark, one of the maintainers of magic-trace.

magic-trace was submitted before, our first announcement was this blog post: https://blog.janestreet.com/magic-trace/.

Since then, we've worked hard at making magic-trace more accessible to outside users. We've heard stories of people thinking this was cool before but being unable to even find a download link.

I'm posting this here because we just released "version 1.0", which is the first version that we think is sufficiently user-friendly for it to be worth your time to experiment with.

And uhh... sorry in advance if you run into hiccups despite our best efforts. Going from dozens of internal users to anyone and everyone is bound to discover new corners we haven't considered yet. Let us know if you run into trouble!


Windows has Windows Performance Analyzer, GPUView and PIX so most game devs are covered on that front :)


Do people still use GPUView? It hasn't seen a lot of development in years AFAIK and wondered if it was still working and useful.

PIX is great! It gets regular updates and an active and responsive Discord channel.


We use it internally, I'm not entirely certain of external usage. It still works and is good for tracking command packet scheduling and inter-process wait chains.

Yup, PIX is THE tool for game developers. Direct3D team also has a very responsive discord channel :)


I’m probably missing something so apologies if this is obvious: does this only work on compiled programs or could it work on any arbitrary running code. Everything from Firefox to my random python script?


It works best on compiled programs.

We do try to support scripted languages with JITs that can emit info about what symbol is located where [1]. Notably, this more or less works for Node.js. It'll work somewhat for Python in that you'll see the Python interpreter frames (probably uninteresting), but you will see any ffi calls (e.g., numpy) with proper stacks.

[1]: https://github.com/torvalds/linux/blob/master/tools/perf/Doc...


I'm curious about why Standard ML (SML) was chosen for this project, given the track record Jane Street has with OCaml. Do you seen an advantage for using the former in this kind of project?


It's all OCaml, GitHub is just misclassifying it as SML :)


Hint[1] in case you’re ever in this situation:

  echo '*.ml  linguist-language=OCaml' >> .gitattributes
  echo '*.mli linguist-language=OCaml' >> .gitattributes
[1] https://github.com/github/linguist/blob/master/docs/override...


Thank you for this. I've made the change, but it looks like it may be several days before github gets around to refreshing the language statistics.


So one needs to upload the trace log to your website to visualize? Any way to do it locally?


Absolutely, check out https://github.com/janestreet/magic-trace#privacy-policy and https://github.com/janestreet/magic-trace/wiki/Setting-up-a-.... With a bit of extra configuration, magic-trace can host its own UI locally. You just need to build the UI from source, and point magic-trace to it (via an environment variable).


Awesome work Clark!

Any plans to support Arm in the future? Thanks!


We don't have plans to add ARM support largely because we have no in-house expertise with ARM. That said, ARM has CoreSight which sounds like it could support something like magic-trace in some form, and we'd definitely be open to community contributions for CoreSight support in magic-trace.


On the website, scrolling doesn't work in mobile safari.


Thanks for sharing this.


why do you guys use Caml?



that seems to be a presentation about language features. I'm mostly interested in the business reasons for using the language within what jane street does, and how the language offers a competitive advantage and why it is "good enough" for the highly competitive HFT landscape they work in.


The language features are the competitive advantage.


Because Java is not the be-all and end-all.


why use java when you have c++?


So you can use Clojure. ;-)


Jane Street has been publishing some interesting projects recently. See Signals and Threads podcast State Machine Replication, and Why You Should Care [0] posted 2 days ago.

I came across the incr_dom library [1], which efficiently calculates the diff on the projected DOM based on a diff of the model data, sorta like React but more... mathematically grounded (?). incr_dom was then reformulated in Bonsai [2] which refactors and generalizes the idea to work with more than DOM. There was a recent Signals and Threads podcast about it a few months ago [3].

[0]: https://news.ycombinator.com/item?id=31100023

[1]: https://opensource.janestreet.com/incr_dom/

[2]: https://opensource.janestreet.com/bonsai/

[3]: https://signalsandthreads.com/building-a-ui-framework/


Note that this uses an Intel-specific API, so it doesn’t currently work on AMD or ARM:

https://man7.org/linux/man-pages/man1/perf-intel-pt.1.html

> Intel Processor Trace is an extension of Intel Architecture that collects information about software execution such as control flow, execution modes and timings and formats it into highly compressed binary packets. […] The main distinguishing feature of Intel Processor Trace is that the decoder can determine the exact flow of software execution.


We have a bit more color on compatibility in general up on <https://github.com/janestreet/magic-trace/wiki/How-could-mag...> for those interested.


Can I ask an office hours type question?

I worked on a very similar (if not identical lol) project at a job once upon a time and the biggest problem I had (and one that I never really solved well) was recovering call stacks from trace data. I effectively ended up using DWARF and just simulating execution and keeping a call stack in the decoder. This mostly worked fine for small and simple programs, but I ran into SO MUCH trouble because I found that (at least on my generation of cores) IPT actually overflows and drops packets very frequently if you have too many calls/returns too quickly. This is largely not an issue for C code but once you start getting into more dynamic languages with fancy features, IPT cannot keep up. Once packets get dropped, the entire call stack for the entire rest of the thread is ruined since you have no idea who called/returned in the dropped packets.

One option that we had but didn't really chase down due to time was maybe combining IPT with low frequency stack traces so that we can both just reset every so often and, if needed, work backwards/apply heuristics in order to arrive at that next callstack.

How did y'all manage this? Your call stacks look totally correct and I'm very impressed :)


I think this answer has several parts:

- I imagine the extra memory bandwidth of newer parts doesn't hurt. The example traces were taken on server-class Ice Lake machines. They just don't overflow for our typical workloads.

- We found the specific IPT configuration matters a lot. Turning off return compression is more liable to result in overflows. We allow varying this in magic-trace via the `-timing-resolution` parameter, more detail available in the wiki. We don't typically see overflows under the default configuration even on Broadwell server-class parts.

- Clark spent a week on an Intel NUC (mobile Tiger Lake part) toiling away on decode error recovery. For the most part, the data lost are uninteresting branches, and you only need one of the call in / return out of a frame to survive the decode error to be able to construct a frame for it.

We also considered the periodic stack sampling approach for error recovery, but ended up not implementing it since the decode error recovery we implemented ended up being robust enough in practice.

We ended up having more trouble with runtimes that mess with the stack pointer directly. (The kernel does this for the retpoline Spectre mitigation! But perf is smart and rewrites that part of the instruction stream into a jump for us.) There's code in magic-trace to special-case OCaml exceptions, for instance, and it's likely similar code is necessary for some other runtimes too (we have an open issue for Go's coroutine switching).


You should make it clearer in the magic trace Readme that the UI service fork used for viewing and analyzing the traces is also available should one want to deploy it locally or in a situation without Internet access:

https://github.com/janestreet/perfetto/

It is mentioned in the documentation but for anyone quickly skimming and expecting a SaaS pricing model underneath (I think many do now), that isn't obvious. From my initial scroll through this wasn't obvious and it was significantly less attractive. Looks very interesting!


They mention that in the privacy section:

https://github.com/janestreet/magic-trace#privacy-policy


I've recently stumbled on Google's Perfetto in the last few months. Very very nice UI. I've been using it with viztracer for Python.


This tool is a fork of Perfetto.


Right, yeah I definitely recognized it in the screenshots.


The magic of PR's :)


Are people generally cool with unsolicited PRs to "editorial" content? I'd have assumed that stating or not stating something in a README is mostly an intentional decision.


I doubt there's a universal answer to that question, but I think most projects tend to overlook important information because the participants have been too deep in the weeds for too long. You tend to forget what newcomers won't find obvious about your tool.


Being a person who works with FPGAs. For all their difficulties, waveforms of both simulation and real-time tapping capabilities are indispensable for tracking what happens before, during, and after an erroneous event occurs. One thing I always miss when going back to software development is some sort of 'waveform' of what the process has done over time and tracking the state of the system over that same course of time. Admittedly dealing with CPU instructions and a program's function calls and overlapping threads is orders of magnitude more difficult than tracking how a bit is changing in a waveform over time, but it definitely seems like tools are getting closer and closer to something that's equivalent and that's pretty awesome to see.


Ha, this was my experience as well when getting into hardware after spending forever in software. It's SO amazing being able to just shoot a program through simulation and then look at the waveform to see how an instruction propagates down a pipeline. Debugging concurrency issues on hardware (i.e. incorrect re-ordering/concurrent scheduling) is honestly so much easier than debugging software concurrency because you often can't even see the entire system state. We're starting to see software catch up with things like time-travel debug with instruction tracing (whether Intel Processor Trace or ARM CoreSight tracing) but the analysis tools for these sorts of things have nothing on wave analysis programs. They either force you into a linear interface (GDB time travel) which makes actually finding the issue a pain in the ass or they simply don't give you the granularity of data that you need.


,,magic-trace.org is a fork of Perfetto, with minor modifications. We'd like to thank the people at Google responsible for it. It's a high quality codebase that solves a hard problem well.''

https://perfetto.dev

I don't like that this comment is hidden on the bottom of the page, as my first impression of the page was that the work of creating the high frequency trace was done by Jane Street (I don't like it in the other direction either, when a big company ,,rebrands'' what a person/small company does).


The trace-producing part of magic-trace is built on top of the more primitive Intel Processing Trace feature, not perfetto. It's still standing on the shoulder of a giant for sure, but few people would be able to effectively utilize Intel Processing Trace without the ergonomic improvements.

perfetto is "just" the profiler UI.


Thanks, it wasn’t clear what was the modification, now it makes more sense.


Jane Street is a prop trading shop. I heard that they take tech seriously, but still impressed to see them at the front of HN with a tool like this.


They are among the most notable users of OCaml in the world, and have taken significant stewardship of the language. They sponsor, attend, and publish at multiple academic conferences, especially in the realm of programming languages.

I don't think I can say I support their company's primary mission, but their commitment to improving the world of software through various means (language influence, academic publication, open-source software releases, etc) is admirable and well worth respecting.


Yeah, quite impressive language to make the backbone of all your systems.


I doubt that it is used for actual trading. Maybe the OP can elaborate what the live infrastructure and backtesting setup look like?


> I doubt that it is used for actual trading.

Do you have a basis for that claim?

JS have developed tons of libraries and tools for OCaml development, and new developers and quants that they hire go through an OCaml bootcamp to come up to speed. They put lots of work into the OCaml compiler, and in blog posts about that work they talk about why this is useful for trading. Maybe I'm missing something crucial, but I think it's more likely that you just don't know what you're talking about.


Of course they use it for trading and everything around for many years.


they use it for the control plane. it's clearly not used for actually submitting trades.


Do you mean this tool, or OCaml? They are definitely writing their actual trading systems in OCaml, they have been publicizing this for many years.


that is what i thought. the actual trading code is probably in c or c++ or is in an fpga being controlled by the trading code.


cool, so how about the OP shares the details of how they use it for trading?


That's largely what their excellent Signals & Threads podcast is about.

https://signalsandthreads.com/

> Listen in on Jane Street’s Ron Minsky as he has conversations with engineers working on everything from clock synchronization to reliable multicast, build systems to reconfigurable hardware. Get a peek at how Jane Street approaches problems, and how those ideas relate to tech more broadly.


Do you take issue with the market maker business model?


I think just the idea of a company that does not directly produce things, and instead spends its efforts turning its money into more money via investment, is something that... doesn't entirely sit well with me.

There is, of course, something to be said about the by-products of their work. Jane Street is far from an evil company, and I would not be entirely morally opposed to working for or with them. They do a lot of good in academic research in areas I care about. I just wish that that was their primary purpose instead of direct money-making, if that makes sense.


Fwiw, market makers sometimes say that by providing liquidity they benefit the whole market.


Doesn't that pretty much characterise all for-profit companies?


I think for a lot of people there's a pretty big difference between these two business models:

- We use cash to buy circuit boards, screens, enclosures, etc, write software, and sell mobile phones.

- We use cash to rent a building, order pallets of inventory, and sell that inventory locally to walk-in customers.

- We use cash to buy shares, hold onto them for a bit, and sell those same shares and make money off the spread.

I'm not making any kind of comment at all about the value of market makers, just... those three businesses feel like they're different models.


They are the same model. They both buy stuff, do some stuff and then re-sell the same stuff but with the modifications they made.

In the case of a trading shop, the stuff they do is playing the market liquidity, collecting interest, arbitrage, etc...

Sure there are some evil ones, but other businesses have those too.


Yes, this is pretty much it. As I said in another comment, Jane Street does not have customers for whom they provide goods or services. Their primary business model is using the company's own money for investments, and that's a business model that I don't love.


Considering how many high finance shops live off of gambling with other peoples' money, I find the intellectual honesty of a company doing it on their own dime ... refreshing.


Company spends its own money to make money for itself?

Not exploiting teenagers in some 3rd world country?

Not gambling with your pension?

Not manipulating some physical commodity like oil?

What’s the problem here?


> Not exploiting teenagers in some 3rd world country?

> Not gambling with your pension?

> Not manipulating some physical commodity like oil?

There is no reason to believe #1 and #3 aren't true, and I should very much suspect they are. #2 is not possible as far as I can tell, I agree there.


> - We use cash to rent a building, order pallets of inventory, and sell that inventory locally to walk-in customers.

> - We use cash to buy shares, hold onto them for a bit, and sell those same shares and make money off the spread.

Those two sound like pretty much the same thing.


Indeed, but not like the first one.


Does that cover the entire financial industry?


Not necessarily. Companies that facilitate trades (eg, E-Trade) don't fall in this category, because their business model is about providing a service to regular people.

Jane Street is a proprietary trading firm. They have no external customers to whom they would provide goods or services. Their primary purpose is to invest the company's own money.


I think an economist would say market makers reduce the volatility of a market, which does benefit regular people.


Partly because the cost of switching is impossible for them.


You think they are platinum-level sponsors at multiple conferences every year, sponsors of PLMW multiple times a year, sponsors for carbon-neutrality at ICFP each year, and continue investing in hiring PhDs and improving the OCaml ecosystem because... the... cost of switching away from OCaml is too great?

I don't think that tracks. They like OCaml, and they are pretty adamant that it is a good tool for the job. Maybe you disagree, but you should not project your opinions on them.


I bet it is also a great way for them to hire and retain great engineers, including the kind that isn't just in it for the money.


Speaking only of what I see (I'm a PhD student in the field of programming languages), quite a few people seek internships and full-time employment with JS specifically because of the tech stack and the kind of problems they (JS) tend to like throwing themselves at. They do some seriously cool stuff, and they've published a lot of great research!


I don't think that's the situation, but if you are irreversibly committed to a technology, you have an interest in seeing that tech continue to advance. Being stuck on a dead tech is even worse.


> I don't think that's the situation

I'm not sure what you mean by this. What is the situation?

> Being stuck on a dead tech is even worse.

Why do you think OCaml is a "dead tech"? Can you justify that? Or is it just based on the notion that if most people don't use it, it must be a bad tool?


I dont think he's saying it's dead tech. He's saying their incentive to invest into ocaml is to make sure it doesn't become dead tech.


Pretty much yes.

Cost is more than just monetary. There are significant indirect costs as well.


> There are significant indirect costs as well.

Can you elaborate on these costs? Do you have knowledge of JS's internal needs and resources to suggest a better alternative?

They are not just hapless consumers of a dead language; they actively maintain it and invest in it because it works well for them. The language itself gives them the kind of guarantees they want in their work, and their work on the language and surrounding tooling (among other things) helps them to acquire high-skill talent. I don't know how you can claim that they would transition to another language if they could without having some pretty firm data to back that claim up. Otherwise, I think you're just projecting your own feelings about OCaml onto them.


You seem eerily passionate about JS all over this thread.

Nowhere did I mention that Ocaml was a dead language or a dying language. I simply stated that there are insurmountable switching costs which incentivizes them to contribute to the larger Ocaml community.


I think they are mostly here to hire people. Not to educate.


If by "here" you mean "at academic conferences", then I'd ask you why they bother to actually do primary research and publish it if they have no interest in sharing knowledge.

If all they wanted was to hire people, they... would. You don't have to sponsor a conference to attend or hire from that conference. And you especially also don't need to sponsor additional workshops, or carbon neutrality initiatives, or anything else.

It's genuinely silly to suggest that they spend all this money on things just to hire people. There are so many more effective uses of their money if that is the only goal.


because they are hoping to run into interesting people to hire. why would they share knowledge otherwise? if they are interested in doing that, why not share all the details of their tech stack and what they are doing exactly in the market? it's very hard to hire, especially with FAANGs who compete with them for talent.


> why would they share knowledge otherwise?

Because they believe in the value of science?

They don't need to publish to compete with top-tier public companies. I don't know of any other trading firm that contributes to open-source development, or to a language infrastructure, or to academic advancement in the way and to the extent that Jane Street do. Most companies keep everything proprietary and highly secret.

But JS chooses to publish. And it's not like that's an easy task that you can just do for fun on a whim; it takes a long time to put together a good paper. They also regularly collaborate with people in academia on long-term projects and evaluations.

I understand the perspective you're suggesting, but I genuinely believe it is wrong, and I also believe that you do not have any evidence to back it up. I think their public contributions speak for themselves, but I've also met some of their more academically inclined engineers (including their CTO), and they come from an academic background and seem to genuinely believe in academic publishing as a goal in itself. It's not totally crazy that there exists one such company out there. (There are actually a couple, but not terribly many, and the others are not relevant in the present discussion.)


Tangentially related. But i really love the podcast from the janestreet people: signals and threads.


Ron Minsky is an excellent communicator and incredible interviewer. Every episode has been phenomenal.


Needs a Skylake or later CPU, linuxers can

    cat /sys/devices/cpu/caps/pmu_name
to find if they're invited to the party


One of the maintainers here -- it should work on Broadwell if you're not super keen on the tens-of-nanoseconds timing precision and are okay with microsecond precision (i.e. only want accurate callstacks),

  grep intel_pt /proc/cpuinfo
should do the trick.


Thank you! I was really disappointed none of the readme actually said what the extension name was.....


You may also be interested in this wiki page: <https://github.com/janestreet/magic-trace/wiki/Supported-pla...>

Intel PT has a bunch of rough edges that we've tried to paper over in magic-trace, but the gritty caveats are documented in the wiki.


Thanks, good to see a new perf tool (and not just another procfs top clone :-). I've started using processor trace in the cloud, thanks to bare-metal instances, but for years I couldn't touch it (not available in VMs). There's a wealth of new information it provides, and we need better tooling on top of it, like magic-trace.

Glad to see "overhead" mentioned and quantified. I'd put the 2-10% at the top though, as that's heavy handed for some environments (can trigger a production fail-over).

I see magic-trace has implemented what some call "flame charts" (time on the x-axis) and not "flame graphs" (alphabet on the x-axis). The best tools do both (e.g., TraceCompass). Please do both! Will make seeing the big picture easy (flame graphs) and then zooming into time-based patterns easy too (flame charts).


We are similarly sad about how unavailable Intel PT is in VMs. In 2022, being unavailable on Macs and VMs raises the barrier to entry extraordinarily high for many people in our target audience. Not sure if working outside of work is your cup of tea, but we've found Intel NUCs to be <$1000 and an unobtrusive way to play with these features at home.

Good point about overhead. I've moved the 2%-10% number front and center, and wrote up a bit more detail about where that comes from in a new wiki page: https://github.com/janestreet/magic-trace/wiki/Overhead

We'll think about adding flame graphs. We unfortunately have little experience writing responsive web UIs, the excellent Perfetto developers did all of the heavy lifting on that front. But who knows, maybe an enterprising Open Source Contributor could help us out. I see Matt Godbolt was asking questions in their discord the other day...


(Perfetto developer here)

The Perfetto UI already supports flamegraphs btw (we use it for memory profiling and CPU stack sampling). We've never bothered to implement it for userspace slices because we've never had high frequency data there to make that a worthwhile view of the data.

Contributions for this upstream are very welcome :)


If you aren't adverse to manual instrumentation there's also Tracy[1].

[1]: https://github.com/wolfpld/tracy


Supports Windows and Mac too, not just linux - thanks for mentioning this, this fills a really big need for me


Looks really cool. I'm sad that I'll never know; Intel + Linux + non-VM excludes literally every computer I have access to. We're all AMD + Windows, and my only access to Intel Linux machines would be a cloud VM.


Windows (AMD or Intel) has Time Travel Debugging: https://docs.microsoft.com/en-us/windows-hardware/drivers/de..., although this has extremely high performance overhead.

There's also the regular performance traces you can capture with wpr and friends. I don't think these provide function-level traces, and I also don't think it's possible to do that (but I could be wrong). You just get sampled callstacks, which may or may not be enough for your needs.

In my experience on Windows you need to instrument applications to get function-level tracing.


You can use PMUs in the cloud. Amazon's c6i.metal instance type is an Intel Ice Lake Xeon with full PMU access, for example.


Also, KVM, VMWare (and maybe Xen?) allow PMU access too (but not VirtualBox). Both VMWare ESXi and VMware Server/Workstation/Fusion allow PMU access, you just need to make sure it's enabled in the settings.

A quick way to check if PMU access is enabled is this:

  dmesg | grep "Performance Events"
Edit: Oh, unfortunately VMware Fusion 12 (on Mac OSX Intel) does not expose the performance counters to the VM anymore, as it's using the OSX Hypervisor Framework instead of its own kernel module.


A simple answer that solves the problem, thank you!



> The key difference from perf is that instead of sampling call stacks throughout time, magic-trace uses Intel Processor Trace to snapshot a ring buffer of all control flow leading up to a chosen point in time[2].

> 2. perf can do this too, but that's not how most people use it. In fact, if you peek under the hood you'll see that magic-trace uses perf to drive Intel PT.

I think this (the first sentence quoted) is a bit misleading. The main feature is not really a "key difference from perf" if the main feature is implemented using perf. From a brief read, it looks like the real key difference is a friendlier and more interactive UI (both when capturing and viewing the trace).

Regardless, I think it looks neat, and will try to take it for a spin sometime soon.


"for Linux", just add that in the title.

We shouldn't need to scroll through pages of text to figure that out.


I suppose it's easy to forget to mention, when doing open source development, and the target platform is also open source. Especially if it's Linux, which is the default.


Nice. And it satisfies my curiosity about whether trading firms are switching to AMD or not


In HFT single-threaded performance is king so that's why we're all still on Intel. AMD is making progress but not just quite there yet.


Huh. My experience has been that AMD wins that unless your application is so small that it can fit into Intel's smaller cache. And the new 3D architecture from AMD I thought would make your developers drool, allowing them to actually inline everything instead of being scared of building apps that are too big to fit into cache


Not my experience at all and I work across different teams who own different latency sensitive apps. Most of them have unhygienically huge working sets.


To be clear: bitcharmer says "we" to mean "fellow HFTs" not "Jane Street".


Yes. Thanks, should have made that more explicit.


For low-latency strategies, AMD's lack of DDIO [0] makes it a non-starter. The memory latency is a big gap to close.

[0] https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...


Do you know this for a fact? I've done some work in the industry where I needed to make fast software, but never the like sub-microsecond tick-to-trade type fast, so I really don't know.

There was a great presentation from 2017 about some of Optiver's low latency techniques[1]. I had assumed they released it because the had obviated all of them by switching to FPGAs, but I don't know. Either way, he suggested that if you ever needed to ping main memory for anything, you already lost. So, I wouldn't have thought DDIO plays into their thinking much.

[1] https://www.youtube.com/watch?v=NH1Tta7purM


The idea is precisely that you want to avoid pinging main memory at all, which is possible (in the happy case) if you do things correctly with DDIO. Not everything is done in hardware where I am. I am wary of saying much because my employer frowns on it, and admittedly I work on the software more than the hardware, but DDIO is certainly important to us.


how do you access this DDIO feature if you are writing a C or C++ application? intrinsics?


DDIO operates mostly transparently to software, with the I/O controller feeding DMAs into a slice of L3. Hardware can opt out by setting PCIe TLP header hints, and you have some system-wide configurability via MSRs, but it's not something a userspace application can take into its own hands.


so is this taken advantage of by the OnLoad drivers of solarflare cards, for example?


Noticed this just now. It is.


It's configurable via MSR. You can also disable it system-wide or on a PCIe port basis. I detailed it all here:

https://www.jabperf.com/skip-the-line-with-intel-ddio/


I don’t know that this definitively answers that question. It’s possible to use a different architecture based on cost/performance and keep a small population of Intel machines in service because you want access to their superior PMUs. Most of what you learn on the latter would still apply to the former.


I wonder if there is some support for dumping pt-traces only on some condition? Would be useful for debugging spikes in busy-loops.


Absolutely! This is one of the main features of magic-trace, and in fact a primary use-case.

You can select a trigger symbol for magic-trace to snapshot upon the next call of. This can be whatever you want, and you can imagine writing code like

  if (something_really_wonky_happened) { take_magic_trace(); }
and asking magic-trace to take a snapshot of the past only when `take_magic_trace` is called.


Sounds great, thank you!


So instead of sampling, or hw perf counters, IPT does tracing? perf counters are able to attribute cycles or cache misses to instructions. but only in aggregate. If your program isn't a 100% computation, cycles consumed won't necessarily point to the bottleneck. So IIUC IPT can tell you more than just the statistics but instead can tell the real story - the sequence of instructions executed? If so, I can see this painting a much clearer picture. But what are the limitations? some small buffers that overflow after a few million instructions or pipeline flushes? If the IPT buffer(s) overflow does magic-trace indicate gaps?

Is it possible to combine PMU or sampling with IPT to get multiple profiling dimensions in the same run? Not just what sequence of instructions were executed but where in time-and-code the branch mispredictions, cache misses, etc. occurred?


You basically understand how this works; you see everything, but there might be gaps in the trace. In our experience they're rare (< 10 per multi-millisecond trace), short lived, and magic-trace can mostly infer what happened in that period fairly easily. You'll see these show up in the final trace as a little arrow that says "Decode error: Overflow packet" when you click on it, and the trace might look a little wonky (hopefully not too wonky!) from that point on.

In fact, if you look carefully at the demo gifs in the README, that trace had 5 decode errors! Nonetheless, it was extremely usable.

Snapshot sizes are configurable--you can go back as far as you like. However, the trace viewer tends to crash when the trace files reach the hundreds of MB and you'll need to do some work to set up a trace processor outside of your browser for the UI to connect to. The UI will offer up some docs if you actually run into this.

I'm so glad you asked us about PMU events, we've been thinking a lot about those. These are available in traces of the efficiency cores of Alder Lake CPUs, but nothing else. When we get our hands on a server class part with PMU tracing we'll add support ASAP. We conjecture that it will be absurdly useful to see cache events on a timeline next to call stacks.


I thought perf can also exploit Intel processor trace?


It can! You can read more about how to do that yourself here: https://perf.wiki.kernel.org/index.php/Perf_tools_support_fo....

magic-trace uses perf. If you want, you can think of it as a mere "alternative frontend" for the Intel PT decoding offered by perf.


Ah okay, I misunderstood magic-trace to be an alternative to perf.


Jane Street is making me considering using OCaml.


Can you apply it to itself?


Yes, in fact this is how we've been narrowing down performance problems in it and its dependencies :)

- https://github.com/let-def/owee/issues/23

- https://github.com/janestreet/magic-trace/issues/93


This is implemented in ML[1]?

First no-kidding application I've seen in that language.

[1] https://en.wikipedia.org/wiki/ML_(programming_language)


It's actually in Ocaml


If you just want calls and returns, can't you use one of the other PMUs for that? Or is sampling at the "1 sample per event" level higher overhead than IPT?


do you mean configuring the other PMUs to interrupt the core every function call / return?

If yes, then yes that is much much higher overhead than processor trace.


It's worth noting that aside from the overhead, function call / returns are not quite enough to reconstruct the callstack: tailcalls are just regular branch instructions.


IPT - so I'll need some kind of a recent *Lake CPU to have this support? And the tool won't work on EPYC, I guess?


Broadwell works, Skylake or later works better. We go into more detail about what platforms we support and why in https://github.com/janestreet/magic-trace/wiki/Supported-pla...


From readme: Intel only (Skylake3 or later), Linux only.


This may be a neat use case for bpf at some point.


Is this like java flight recorder?


One time, I wrote a simple set of tools at my company.

1) A program that inserted a macro invocation with a GUID, at every single new scope. "{", except not switch, struct, or class, etc.

2) I made the macro instantiate an object on the stack, passing in the GUID. In the constructor and in the destructor, it called a singleton with thread-local storage to a file pointer, where it would append a few bytes indicating whether it was entering or exciting scope, what the GUID was, and what the time was. In this way, each thread was writing to its own file.

3) I made a program which walked my source, looking for file name, line number for each of my macro invocations, and the GUID. If I were more sophisticated, I would have tried to get the function name out of it, too.

4) I made a program which would turn one of the thread's files into a Visual Studio recognized output. Basically "filename(linenumber): [content, such as the time]". Then I set that up as a tool in Visual Studio, and when I would run it, it would output in a Visual Studio window. The reason for that was then I could hit (I think) F4 and Shift-F4 to step forward and backward through the output, and each time it would jump to the source code at that location.

So then I had a forward-and-backward time travelling debug script. I think I also started manually passing in function parameters into the macros, which would format (a: "a") on the debug line, too.

We had automated testing of our whole integrated application. I wanted to record my output from each automated test. Then when I was checking in new code, I could see which new GUIDs were never touched by any of our integration tests, to tell me how much coverage we had. And I could tell the testers which automated tests were most likely to exercise my code changes.

I liked that the GUIDs would have been stable, even as code moved. (Unlike file name, line number, or even class and function name.)

And yes, seeing this in the code wasn't great:

{ TIMER("5c7c062f-84a3-40d0-b7cd-77bd9db59f3e");

  // real code
}

I wanted to teach Visual Studio how to basically ignore those, and if I copied code and pasted it, have it generate new GUIDs when I pasted.

But I could imagine using the output to generate the fire charts, and other debugging tools, like in the article.

And it all compiled to 0, in Release mode.

The payoff of this felt large, and the cost felt small. But the biggest pain was that humans would see these macro invocations, and need to maintain them.

So I chickened out and didn't force my coworkers to see all of this.


fkdekfèkǜecr


gold




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: