Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: 0xTools – Always-On Profiling for Production Systems (0x.tools)
6 points by tanelpoder 4 days ago | hide | past | favorite | 2 comments

From mysqld error log on node 1: 2020-10-19T23:34:30.988023-06:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address node2:3306 has become unreachable.'

From node 2, data missing for 7 seconds:

  SchedLat by Tanel Poder [https://0x.tools]

  PID=1478 COMM=mysqld
  TIMESTAMP              %CPU   %LAT   %SLP
  2020-10-19 23:34:22     0.0    0.0  100.0 
  2020-10-19 23:34:23     0.0    0.0  100.0
  2020-10-19 23:34:24     0.0    0.0  100.0
  2020-10-19 23:34:25     0.0    0.0  100.0
  2020-10-19 23:34:32     0.0    0.0  100.0
  2020-10-19 23:34:33     0.0    0.0  100.0
  2020-10-19 23:34:34     0.0    0.0  100.0
  2020-10-19 23:34:35     0.0    0.0  100.0
  2020-10-19 23:34:36     0.0    0.0  100.0
Now what?

I wasn't able to "install" tools because policy does not allow gcc to be installed so no xCapture. I do have a perf file, but never used it before and not sure what I'm looking at.

Author here. I regularly troubleshoot non-trivial performance problems on Linux (mostly running busy Oracle databases, but other things too).

I wrote 0xTools for two main reasons:

1) Have an easy way to report process/thread level top activity, but with the ability to break down what the threads were doing at the time - were they stuck in a system call, spending excessive time (sleeping) in some kernel location, etc.

2) Have an ability to "go back in time" for advanced troubleshooting, so when some intermittent problem happens, you can troubleshoot it "at first occurrence" and not have to wait for the problem to show up again. For that, you must have detailed (but low-overhead) always-on collection of thread activity collection enabled.

Since I'm using these tools in critical production systems, they also must be safe to use, must not require installation of custom kernel modules, etc - and sampling the Linux kernel-presented /proc filesystem with standard OS tools can go surprisingly far. My tools just make collection and analysis of the data easier.

You may ask why not just write an eBPF script for this - most of the systems I end up troubleshooting are still running RHEL6 (or clones like CentOS 6), with some RHEL5, RHEL7 with Linux kernels 2.6.32 and 2.6.18 and 3.10 respectively. No functioning eBPF there! It will probably take 10 years before 99% of traditional enterprise production systems are running RHEL 8.1 or newer (where RedHat officially started supporting eBPF).

Give it a try, it's very easy to test - no uncommon dependencies or OS reconfiguration needed. Questions, comments, enhancement requests welcome!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact