Hacker News new | past | comments | ask | show | jobs | submit login
Debugging Distributed Systems (acm.org)
201 points by luu 5 months ago | hide | past | favorite | 52 comments

For debugging a distributed system, it may be just okay to use the traditional way consisting of testing, log analysis, and visualizing that everyone is familiar with. Yes, there are advanced techniques such as formal verification and model checking, but depending on the complexity of the targe distributed system, it may be practially out of the question or just not worth your time to try to apply these techniques (unless you are in a research lab or supported by FAANG). In other words, it may be that there is nothing inherently inefficient with sticking to the traditional way because distributed systems are hard to debug by definition and there is (and will be) no panacea.

We have gone through the pain of testing and debugging a distributed system that is under development for the past 5 years. We investigated several fancy ways of debugging distributed systems such as model checking and formal verification. In the end, we decided to use (and are more or less happy with) the traditional way. The decision was made mostly because 1) the implementation is too complex (a lot of Java and Scala code) to allow formal techniques; 2) the traditional way can still be very effective when combined with careful design and thorough testing.

Before building the distributed system, we worked on formal verification using program logic and were knowledgeable about a particular sub-field. From our experience, I guess it would require a PhD dissertation to successfully apply formal techniques to debugging our distributed system. The summary of our experience is at https://www.datamonad.com/post/2020-02-19-testing-mr3/.

I wonder is the promise of cheap hardware using distributed systems is offset by the increased complexity and developer time. Stack overflow scaled up rather than out and I have never seen a problem with their site.

Is S.O. a good example of a complicated/large distributed system? I couldn't find any quick googleable results on how many people work on their site.

The reason I ask is I'm working on a product with 30-ish different pods/teams (maybe about 200 - 250 engineers) working on their respective modules, microservices, etc. From my understanding with talking to a lot of others at conferences, is that our distributed system is fairly small (in terms of functional modules/teams, transactions, etc..).

Anyway, even at our small-ish scale, I couldn't imaging running our platform as a single app that were were able to scale up with better hardware.

Also, I think how a company supports multi-tenanting would play a big role in deciding how this works, too, because you can have scenario with a monolith and DB but you have it partitioned by individual tenant dbs, app servers, etc, and you still have a huge pile of hardware (real or virtual) you're dancing around in....

My point is that Stack Overflow seems to have kept things as simple as possible, the opposite of a "complicated distributed system". It seems to be a classical relational databases backed app with some additions for specific parts where it needed to scale. In the end I guess it is distributed but it looks like its based largely around a monolith.


> Anyway, even at our small-ish scale, I couldn't imaging running our platform as a single app that were were able to scale up with better hardware.

Lots of 250+ engineer teams out there working on monoliths.

Distributed systems (and specifically microservices) are oftentimes solutions to organizational problems, not technical ones.

Their developers can't write good stuff on their resume though. How will those poor chaps get another job without writing Kubernetes, NoSQL, distributed database, large scale horizonatally scalable systems. /s

AFAIK, writing "Used a large machine to solve customer problems quickly and efficiently" is not really taken well by a lot of people. The majority of companies can better scale up, than out, but out is the new normal for various reasons.

They're using NoSQL and other things. It's mostly Microsoft C# stack though.

What NoSQL? According to their blog they use SQL server.

That’s the problem. Doing the reasonable thing is a career killer.

The notion that stack overflow is small and scaling up is long obsolete. It's running on more than a hundred servers now.

Not according to this:


Where do you get your numbers from?

Yep, and if you look at average CPU load percentage, it's usually in single digits.

When I worked on supercomputer simulations, I always landed on just dumping intermediate results and looking at them over figuring out how to compile something TotalView could look at.

When the latter was possible it saved time but it was always breaking under me and I eventually said fuck it

> it may be just okay to use the traditional way consisting of testing, log analysis, and visualizing that everyone is familiar with.

Not at all. I spent quite a lot of effort on debugging distributed deadlocks in a highly available system that was the best selling product in its category at one of the most famous (and loved) software companies based in SV and the amount of things that could (and will) go wrong is infinite, given every piece of infrastructure has its difficult-to-find/reproduce bugs. Things like sockets stopping responding after a few weeks due to an OS bug, messing up your distributed log, or unexpected sequences of waking up during a node crash and fail-over because some part of the network had its issues, leading to split brain for a few moments that needs to be resolved real-time, a brief out of memory issue that leads to a single packet loss, messing up your protocol etc. We used sophisticated distributed tests and that was absolutely inadequate. You are still in a naive and optimistic mode, though perhaps you as a designer will be shielded from the real-life issues your poor developers would experience as usually original architects move on towards newer cooler things and leave the fallout to somebody else.

Check out FoundationDB's approach:


Using a single thread to simulate everything is cool (as stated in my previous comment on FoundationDB at https://news.ycombinator.com/item?id=22382066). Especially if the overhead of building the test framework is small.

In our case, we use a single process to run everything, in conjunction with strong invariants written into the code. When an invariant is violated, we analyze the log to figure out the sequence of events that triggers the violation. As the execution is multi-threaded, the analysis can take a while. If the execution was single-threaded, the analysis would be much easier. In practice, the analysis is usually quick because the log tells us what is going on in each thread.

So, I guess there is a trade-off between the overhead of building the test framework and the extra effort in log analysis.

This is perhaps one of the few places where software developers could learn something from hardware development: In hardware description languages everything happens in parallel by default. A large chip project has many components that communicate typically fully asynchronously over some network / bus. Tools for debugging these systems are

- A GUI that allows one to trace all state changes down to the single bit level.

- Assertions based on linear temporal logic to verify invariants.

- Conventions around how to model transactions at various levels of abstraction, as well on how to monitor and extract these transaction from the "wire protocol".

The article talks about all three, but with different names.

At the end of the day, we keep hardware as simple as possible, so it's easier to verify, and all the extra complexity goes into software, so it's easier to iterate. As a consequence, the hardware tooling is far from enough for software debugging. Software has more powerful tools, but those aren't enough either - distributed systems are a hairy problem.

Personally I'd love to see an article that could relate those techniques to current cloud technologies (if it's possible?)

I've thought about it and I don't think hardware design can map to software.

Hardware is all about individual blackbox components. Take some bits on the left, give some bits on the right. Chain an infinite amount of components. It's "easy" to layout and analyze as a linear flow.

Software is not a linear flow with defined states. How to represent a syscall or an IO? The minimal state of an application is its entire memory.

Aaaaaaaand ofc there is nothing about the tools thst are in the main runtime built around distributed systems.

Their explanation on tracing show they have never played with an erlang system.

As someone that build distributed systems everyday, i would really like all of our industry to play more with production erlang systems. That would answer a lot of these explorations instead of badly reinventing the wheel over and over again.

As a systems Erlang designer who worked for 2 of the biggest Erlang companies in the market, I can assure you that Erlang/OTP is not some magic pixie dust that suddenly makes all the distributed systems problems go away. Playing is one thing, making a reliable, fault-tolerant system is another.

OTP doesn't give you _state of art_ distributed consensus algorithms out of the box.

I meant about the debugging tools that OTP ship with. There is nothing in this about messages or function tracing dynamically. Even less about the kind of introspection you get with it.

I totally agree with your points, i am just pointing that it seems they do not seem to know that this kind of tooling are even possible.

AWS built simple-db with Erlang [0] but then seemingly haven't been using it much ever since... I wouldn't be surprised if simple-db has since been moved to aurora (like document-db [1], neptune, and timestream) or dynamo-db [2] behind the scenes?

[0] https://en.wikipedia.org/wiki/Amazon_SimpleDB

[1] https://news.ycombinator.com/item?id=18869755

[2] https://news.ycombinator.com/item?id=18871854

Disclaimer: Former Erlanger, currently at AWS but these are disjoint properties.

I’ve never worked on SimpleDB or even looked at it, but I think one lesson several database creators learned is that Erlang isn’t high throughout enough for certain classes of database systems. Especially with the last decade’s trend of making potential adoption of a given database as wide as possible and how important benchmarks are for marketing. I still think Erlang is wonderful for writing control plane type software, given the culture around using FSM-like structures.

> Erlang isn’t high throughout enough for certain classes of database systems.

Interestingly, one of the early simple-db engineers, James Larson [0], later built Megastore [1] at Google, which seems a lot like a retake of simple-db, and a predecessor to Google F1 [2]. I wonder if Megastore uses Erlang.

[0] https://www.linkedin.com/in/jilarson

[1] https://research.google/pubs/pub36971/

[2] https://research.google/pubs/pub41344/

> OTP doesn't give you _state of art_ distributed consensus algorithms out of the box.

Correct. Yet people love to think it does.

From Erlang's perspective, debugging an average distributed application is just like being primitives hunting in the forest directly after graduating from college.

I mean, there's even no parameter in the call stack, people can only guess instead of knowing things for sure.

Of course, there are more important things like remote debugging in Erlang. To fix a production issue is like fixing a plane that is malfunctioned and crashing:

For average language and runtime, the plan is to take a guess, then develop a theory, build a malfunctioned plane according to the theory. If it's not having same issue, repeat previous steps until people rebuild a plane having exact same issue as the crashing one. Then fix the plane having the same issue, and apply it to the crashing plane.

For Erlang, the process would be, examine the malfunctioned plane to see where the problem is for sure, directly patch it in place to make it usable again. And then build a real fix, applying it without restart anything.

Tools may not be the most useful things in the world but they definitely could be game-changers sometimes.

The problem is legacy and how codebases have grown in the past.

How would you deploy erlang given a system that has grown out of a legacy C++ codebase and needs to run on a mix of mobile devices, IoT embedded systems (some with less than 64 kB RAM), PCs and servers?

Some distributed systems are heterogenous and will probably always be.

Now, getting back to the main topic, any insights from erlang or otherwise how to troubleshoot said systems that are deployed to customers?

I meant about the debugging tools that OTP ship with. There is nothing in this about messages or function tracing dynamically. Even less about the kind of introspection you get with it.

I totally agree with your points, i am just pointing that it seems they do not seem to know that this kind of tooling are even possible.

Give your operators the ability to explore the system free form with these kind of tracing, dynamically, on production systems. It helps a lot. And yes that will need a lot of work in current runtimes.

No idea about deployment but I would ask couchdb people. https://en.wikipedia.org/wiki/Apache_CouchDB There is an embedded version too

Language/runtime is not the hard part of building distributed systems.

I meant about the debugging tools that OTP ship with. There is nothing in this about messages or function tracing dynamically. Even less about the kind of introspection you get with it.

I totally agree with your points, i am just pointing that it seems they do not seem to know that this kind of tooling are even possible.

I always find myself holding this view ...

Until someone bumps an MPI version and I can't fix my God damn build for a week

Have you ever used erlang as a distributed system in production?

Any good stuff to read on topic you can recommend?

Hmm I would say try to read Erlang in Anger, test it in a toy app, that is not hard to get. https://www.erlang-in-anger.com/

This is also a good entry point https://medium.com/@mrjoelkemp/jvm-struggles-and-the-beam-4d...

similar to crypto libraries, one should seriously rethink about building a distributed systems from scratch. Unless thats the only goal.

I have worked on this topic in academia and industry for many years and I can safely say, there is a huge disconnect between real world tools and academia. ROI for model checkers is too low, unless your building mission critical systems.

For a CRUD like application, simple things like

1. Distributed Tracing 2. Log Analysis 3. Basic concepts like exponential backoff with retries etc seems to iron out most of the issues.

Lately, we started using chaos testing methods, where we have dedicated build pipelines with fault injection and see, if the system can function as expected. That has helped us uncover / fix many distributed system bugs. In terms of ROI and learning curve, chaos testing seems a much better approach than using model checkers and other sophisticated methods.

This dismisses theorem proving as too difficult to use for existing systems. My experience with old, complex systems is that they’re often old, complex and not very well understood anymore. Theorem proving is all about producing a specification and asking software to help you certify its correctness. I get that it’s often very hard in practice, but wouldn’t this process be incredibly beneficial then for existing systems? The endeavor would likely motivate the creation of an up-to-date spec and prove or disprove its correctness all at the same time. Whether the existing implementation matches the spec is another challenge, but this could at least uncover serious design flaws in high assurance software.

I agree with the paper, it has a great description of the problems with debugging distributed systems. But I think it is missing one type of solution which we call "Systematic Testing". See https://microsoft.github.io/coyote/. This is a testing tool that integrates into your development process and has been used successfully by several large very complex distributed systems inside of Azure.

Logs and request ids are so important. One trick I found is to load a page with some ?devondebug query param, then search logs. Then look at those logs to get request ids to search with again to get the whole picture.

Unless you have Istio. That's amazing.

Does anyone know of a good distributed systems resource? It's one of the rare areas of CS I have a lot of trouble finding good tutorials and textbooks. Is it just something you pick up with pure practice?

IMO if you develop a good understanding of Paxos then learn TLA+ you'll know everything you need to work on production distributed systems. Learn about CRDTs too, just to see what eventually-consistent systems look like. The real core of distributed systems is less knowing specific algorithms and more knowing what makes the field so hard, what solutions are possible, and the limitations of those solutions.

For learning Paxos I like this video: https://m.youtube.com/watch?v=JEpsBg0AO6o

Implement it yourself in your language of choice.

For TLA+ there's the Practical TLA+ textbook or Leslie Lamport's video lecture series: https://lamport.azurewebsites.net/video/videos.html

It's pretty amazing how quick the industry has changed, and the course content has followed along with this. That's a lot of variety of content for 21 lectures. Would love to see a formalized online version of some of this stuff, for even an overview, high-level. Most of the stuff I find is pretty specific for a particular cloud-native environment.

I also noticed that the lectures are on YouTube[1]

[1] https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39...

Does the link have any content? I don't find anything.

Click the schedule on the left navbar for lecture links. Lab links are there too.

Thanks, didn't notice that at first.

Check out https://www.cs.rice.edu/~alc/comp520/schedule.html . That set of papers (most of which are not too hard to understand; and most of the interesting discussion is not so theoretical, but rather logical) is an excellent guide to thinking about distributed systems.

For systems using MPI there is the TotalView debugger, which is pretty good

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact