
Debugging Distributed Systems - luu
https://dl.acm.org/doi/pdf/10.1145/2927299.2940294
======
epdlxjmonad
For debugging a distributed system, it may be just okay to use the traditional
way consisting of testing, log analysis, and visualizing that everyone is
familiar with. Yes, there are advanced techniques such as formal verification
and model checking, but depending on the complexity of the targe distributed
system, it may be practially out of the question or just not worth your time
to try to apply these techniques (unless you are in a research lab or
supported by FAANG). In other words, it may be that there is nothing
inherently inefficient with sticking to the traditional way because
distributed systems are hard to debug by definition and there is (and will be)
no panacea.

We have gone through the pain of testing and debugging a distributed system
that is under development for the past 5 years. We investigated several fancy
ways of debugging distributed systems such as model checking and formal
verification. In the end, we decided to use (and are more or less happy with)
the traditional way. The decision was made mostly because 1) the
implementation is too complex (a lot of Java and Scala code) to allow formal
techniques; 2) the traditional way can still be very effective when combined
with careful design and thorough testing.

Before building the distributed system, we worked on formal verification using
program logic and were knowledgeable about a particular sub-field. From our
experience, I guess it would require a PhD dissertation to successfully apply
formal techniques to debugging our distributed system. The summary of our
experience is at [https://www.datamonad.com/post/2020-02-19-testing-
mr3/](https://www.datamonad.com/post/2020-02-19-testing-mr3/).

~~~
collyw
I wonder is the promise of cheap hardware using distributed systems is offset
by the increased complexity and developer time. Stack overflow scaled up
rather than out and I have never seen a problem with their site.

~~~
taude
Is S.O. a good example of a complicated/large distributed system? I couldn't
find any quick googleable results on how many people work on their site.

The reason I ask is I'm working on a product with 30-ish different pods/teams
(maybe about 200 - 250 engineers) working on their respective modules,
microservices, etc. From my understanding with talking to a lot of others at
conferences, is that our distributed system is fairly small (in terms of
functional modules/teams, transactions, etc..).

Anyway, even at our small-ish scale, I couldn't imaging running our platform
as a single app that were were able to scale up with better hardware.

Also, I think how a company supports multi-tenanting would play a big role in
deciding how this works, too, because you can have scenario with a monolith
and DB but you have it partitioned by individual tenant dbs, app servers, etc,
and you still have a huge pile of hardware (real or virtual) you're dancing
around in....

~~~
collyw
My point is that Stack Overflow seems to have kept things as simple as
possible, the opposite of a "complicated distributed system". It seems to be a
classical relational databases backed app with some additions for specific
parts where it needed to scale. In the end I guess it is distributed but it
looks like its based largely around a monolith.

[https://stackexchange.com/performance](https://stackexchange.com/performance)

------
orbifold
This is perhaps one of the few places where software developers could learn
something from hardware development: In hardware description languages
everything happens in parallel by default. A large chip project has many
components that communicate typically fully asynchronously over some network /
bus. Tools for debugging these systems are

\- A GUI that allows one to trace all state changes down to the single bit
level.

\- Assertions based on linear temporal logic to verify invariants.

\- Conventions around how to model transactions at various levels of
abstraction, as well on how to monitor and extract these transaction from the
"wire protocol".

~~~
marcosdumay
The article talks about all three, but with different names.

At the end of the day, we keep hardware as simple as possible, so it's easier
to verify, and all the extra complexity goes into software, so it's easier to
iterate. As a consequence, the hardware tooling is far from enough for
software debugging. Software has more powerful tools, but those aren't enough
either - distributed systems are a hairy problem.

------
di4na
Aaaaaaaand ofc there is nothing about the tools thst are in the main runtime
built around distributed systems.

Their explanation on tracing show they have never played with an erlang
system.

As someone that build distributed systems everyday, i would really like all of
our industry to play more with production erlang systems. That would answer a
lot of these explorations instead of badly reinventing the wheel over and over
again.

~~~
h91wka
As a systems Erlang designer who worked for 2 of the biggest Erlang companies
in the market, I can assure you that Erlang/OTP is not some magic pixie dust
that suddenly makes all the distributed systems problems go away. Playing is
one thing, making a reliable, fault-tolerant system is another.

OTP doesn't give you _state of art_ distributed consensus algorithms out of
the box.

~~~
ignoramous
AWS built simple-db with Erlang [0] but then seemingly haven't been using it
much ever since... I wouldn't be surprised if simple-db has since been moved
to aurora (like document-db [1], neptune, and timestream) or dynamo-db [2]
behind the scenes?

[0]
[https://en.wikipedia.org/wiki/Amazon_SimpleDB](https://en.wikipedia.org/wiki/Amazon_SimpleDB)

[1]
[https://news.ycombinator.com/item?id=18869755](https://news.ycombinator.com/item?id=18869755)

[2]
[https://news.ycombinator.com/item?id=18871854](https://news.ycombinator.com/item?id=18871854)

~~~
rubiquity
Disclaimer: Former Erlanger, currently at AWS but these are disjoint
properties.

I’ve never worked on SimpleDB or even looked at it, but I think one lesson
several database creators learned is that Erlang isn’t high throughout enough
for certain classes of database systems. Especially with the last decade’s
trend of making potential adoption of a given database as wide as possible and
how important benchmarks are for marketing. I still think Erlang is wonderful
for writing control plane type software, given the culture around using FSM-
like structures.

~~~
ignoramous
> _Erlang isn’t high throughout enough for certain classes of database
> systems._

Interestingly, one of the early simple-db engineers, James Larson [0], later
built _Megastore_ [1] at Google, which seems a lot like a retake of simple-db,
and a predecessor to Google F1 [2]. I wonder if Megastore uses Erlang.

[0]
[https://www.linkedin.com/in/jilarson](https://www.linkedin.com/in/jilarson)

[1]
[https://research.google/pubs/pub36971/](https://research.google/pubs/pub36971/)

[2]
[https://research.google/pubs/pub41344/](https://research.google/pubs/pub41344/)

------
ram_rar
similar to crypto libraries, one should seriously rethink about building a
distributed systems from scratch. Unless thats the only goal.

I have worked on this topic in academia and industry for many years and I can
safely say, there is a huge disconnect between real world tools and academia.
ROI for model checkers is too low, unless your building mission critical
systems.

For a CRUD like application, simple things like

1\. Distributed Tracing 2\. Log Analysis 3\. Basic concepts like exponential
backoff with retries etc seems to iron out most of the issues.

Lately, we started using chaos testing methods, where we have dedicated build
pipelines with fault injection and see, if the system can function as
expected. That has helped us uncover / fix many distributed system bugs. In
terms of ROI and learning curve, chaos testing seems a much better approach
than using model checkers and other sophisticated methods.

------
e79
This dismisses theorem proving as too difficult to use for existing systems.
My experience with old, complex systems is that they’re often old, complex
_and_ not very well understood anymore. Theorem proving is all about producing
a specification and asking software to help you certify its correctness. I get
that it’s often very hard in practice, but wouldn’t this process be incredibly
beneficial then for existing systems? The endeavor would likely motivate the
creation of an up-to-date spec and prove or disprove its correctness all at
the same time. Whether the existing implementation matches the spec is another
challenge, but this could at least uncover serious design flaws in high
assurance software.

------
cjlovett
I agree with the paper, it has a great description of the problems with
debugging distributed systems. But I think it is missing one type of solution
which we call "Systematic Testing". See
[https://microsoft.github.io/coyote/](https://microsoft.github.io/coyote/).
This is a testing tool that integrates into your development process and has
been used successfully by several large very complex distributed systems
inside of Azure.

------
winrid
Logs and request ids are so important. One trick I found is to load a page
with some ?devondebug query param, then search logs. Then look at those logs
to get request ids to search with again to get the whole picture.

Unless you have Istio. That's amazing.

------
formalsystem
Does anyone know of a good distributed systems resource? It's one of the rare
areas of CS I have a lot of trouble finding good tutorials and textbooks. Is
it just something you pick up with pure practice?

~~~
Denzel
[https://pdos.csail.mit.edu/6.824/](https://pdos.csail.mit.edu/6.824/) is a
good start.

~~~
enitihas
Does the link have any content? I don't find anything.

~~~
alexeldeib
Click the schedule on the left navbar for lecture links. Lab links are there
too.

~~~
enitihas
Thanks, didn't notice that at first.

------
neutronicus
For systems using MPI there is the TotalView debugger, which is pretty good

