
Debugging Distributed Systems With Why-Across-Time Provenance [pdf] - petethomas
https://mwhittaker.github.io/publications/wat_SOCC18.pdf
======
phaedrus
I've been working this year on exactly this problem. I "inherited"
responsibility for one of our vendor's Windows MFC codebase which makes
gratuitous use of threads and every obfuscated/side-channel form of
nonfunctional data flow known to C++. It's like a mirror universe rendition of
Erlang created by a sloppy idiot savant using nothing but Windows messages and
public class member variables.

(For example _reading config values from the INI file_ gets its own
CWinThread-derived custom message pump, so that both the initiator of the read
and the handler of the result value can be different threads in different .cpp
files. I found one CWinThread which was merely being used to RAII one
unrelated member variable and whose constructor constructed a 3rd unrelated
object two stars (pointer dereferences) away. _Its run loop was empty_ (for
gosh sake).)

The reason for this monstronsity was that the program was written by a fresh
college graduate from the peak of the era when CS professors were saying "OMG
concurrency/threads/cores every processor of the 2010's is going to have 128
and doubling weak cores and if you don't multithread all your code you're part
of the problem not part of the solution." Then this fresh college graduate was
thrown into a hardware company whose software department still parties like
it's 1999 with MFC / Win32 in C++. And he created... this cosmic horror.

As you can imagine, stack traces are useless to me here. Logging is nearly
useless (because the app is always doing more than one thing, even when it has
< 1 thing to do). Breakpoints are useless to me (the args mean nothing; public
members and global variables hold all the relevant state - but they could be
anywhere).

What I ended up doing was adding Sqlite to the codebase, and created
"causality log" database. It's sort of like a call tree, but not. It's sort of
like a flow chart - but not. And it's sort of like a UML chart, but not. It
combines aspects of all of these; after testing a feature of the program I run
a post-processor on the log data to turn it into a dotGraph file so I can
render it as a graph.

For example it might tell me that thread object A created thread object B
which then was given pointers to the following 3 global objects by assigning
B's member variables ex post facto. Later, a windows message came from dialog
X and B handled it in event handler B.h() and passed the data onto dialog Y.

That's the goal anywhere - I have 2/3rds of the above paragraph implemented;
what I have left is figuring out how to pass a "causality" tracker value to
shadow windows message sends.

The way I created this causality logging system is that my log-to-event-
database functions return an id to a row, and I (had to) change the code to
pass these ids (parent_id) through constructor calls and function calls, etc.
I find the non-local destructive assignments of public (should be private)
class member variables, and manually log those occurrences. All of these ids
provide a chain of provenance for causality and reachability. Because they are
database ids rather than call stack frames, they survive the return of any one
call frame.

~~~
pjmlp
You just reminded me of an application for SNMP communication on HP-UX I had
to maintain in 2005, the amount of threads per action was so complex that we
had a couple of A4 papers glued together for the fluxogram of the actions
triggering threads being started, joined and respective synchronization. :\

------
adamcharnock
I would be very grateful if someone could spell out how an implementation of
this may look in a more concrete sense. I don't parse the math very well
myself.

My gut-feel has been that a 'message trace' of some sort would be useful in
debugging a distributed system. For example, every message contains the ID of
the message which cause it to be sent, if any (or a list of message IDs giving
the full causality chain). This is something I'm considering implementing in
Lightbus [1] using Python's contexts.

Is the proposed wat-provenance system here somehow different? To quote the
abstract:

> Given an arbitrary state machine, wat-provenance describes why the state
> machine produces a particular output when given a particular input.

So is this more akin to static analysis rather than runtime debugging?

[1]: [https://lightbus.org](https://lightbus.org)

