We have gone through the pain of testing and debugging a distributed system that is under development for the past 5 years. We investigated several fancy ways of debugging distributed systems such as model checking and formal verification. In the end, we decided to use (and are more or less happy with) the traditional way. The decision was made mostly because 1) the implementation is too complex (a lot of Java and Scala code) to allow formal techniques; 2) the traditional way can still be very effective when combined with careful design and thorough testing.
Before building the distributed system, we worked on formal verification using program logic and were knowledgeable about a particular sub-field. From our experience, I guess it would require a PhD dissertation to successfully apply formal techniques to debugging our distributed system. The summary of our experience is at https://www.datamonad.com/post/2020-02-19-testing-mr3/.
The reason I ask is I'm working on a product with 30-ish different pods/teams (maybe about 200 - 250 engineers) working on their respective modules, microservices, etc. From my understanding with talking to a lot of others at conferences, is that our distributed system is fairly small (in terms of functional modules/teams, transactions, etc..).
Anyway, even at our small-ish scale, I couldn't imaging running our platform as a single app that were were able to scale up with better hardware.
Also, I think how a company supports multi-tenanting would play a big role in deciding how this works, too, because you can have scenario with a monolith and DB but you have it partitioned by individual tenant dbs, app servers, etc, and you still have a huge pile of hardware (real or virtual) you're dancing around in....
Lots of 250+ engineer teams out there working on monoliths.
AFAIK, writing "Used a large machine to solve customer problems quickly and efficiently" is not really taken well by a lot of people. The majority of companies can better scale up, than out, but out is the new normal for various reasons.
Where do you get your numbers from?
When the latter was possible it saved time but it was always breaking under me and I eventually said fuck it
Not at all. I spent quite a lot of effort on debugging distributed deadlocks in a highly available system that was the best selling product in its category at one of the most famous (and loved) software companies based in SV and the amount of things that could (and will) go wrong is infinite, given every piece of infrastructure has its difficult-to-find/reproduce bugs. Things like sockets stopping responding after a few weeks due to an OS bug, messing up your distributed log, or unexpected sequences of waking up during a node crash and fail-over because some part of the network had its issues, leading to split brain for a few moments that needs to be resolved real-time, a brief out of memory issue that leads to a single packet loss, messing up your protocol etc. We used sophisticated distributed tests and that was absolutely inadequate. You are still in a naive and optimistic mode, though perhaps you as a designer will be shielded from the real-life issues your poor developers would experience as usually original architects move on towards newer cooler things and leave the fallout to somebody else.
In our case, we use a single process to run everything, in conjunction with strong invariants written into the code. When an invariant is violated, we analyze the log to figure out the sequence of events that triggers the violation. As the execution is multi-threaded, the analysis can take a while. If the execution was single-threaded, the analysis would be much easier. In practice, the analysis is usually quick because the log tells us what is going on in each thread.
So, I guess there is a trade-off between the overhead of building the test framework and the extra effort in log analysis.
- A GUI that allows one to trace all state changes down to the single bit level.
- Assertions based on linear temporal logic to verify invariants.
- Conventions around how to model transactions at various levels of abstraction, as well on how to monitor and extract these transaction from the "wire protocol".
At the end of the day, we keep hardware as simple as possible, so it's easier to verify, and all the extra complexity goes into software, so it's easier to iterate. As a consequence, the hardware tooling is far from enough for software debugging. Software has more powerful tools, but those aren't enough either - distributed systems are a hairy problem.
Hardware is all about individual blackbox components. Take some bits on the left, give some bits on the right. Chain an infinite amount of components. It's "easy" to layout and analyze as a linear flow.
Software is not a linear flow with defined states. How to represent a syscall or an IO? The minimal state of an application is its entire memory.
Their explanation on tracing show they have never played with an erlang system.
As someone that build distributed systems everyday, i would really like all of our industry to play more with production erlang systems. That would answer a lot of these explorations instead of badly reinventing the wheel over and over again.
OTP doesn't give you _state of art_ distributed consensus algorithms out of the box.
I totally agree with your points, i am just pointing that it seems they do not seem to know that this kind of tooling are even possible.
I’ve never worked on SimpleDB or even looked at it, but I think one lesson several database creators learned is that Erlang isn’t high throughout enough for certain classes of database systems. Especially with the last decade’s trend of making potential adoption of a given database as wide as possible and how important benchmarks are for marketing. I still think Erlang is wonderful for writing control plane type software, given the culture around using FSM-like structures.
Interestingly, one of the early simple-db engineers, James Larson , later built Megastore  at Google, which seems a lot like a retake of simple-db, and a predecessor to Google F1 . I wonder if Megastore uses Erlang.
Correct. Yet people love to think it does.
I mean, there's even no parameter in the call stack, people can only guess instead of knowing things for sure.
Of course, there are more important things like remote debugging in Erlang. To fix a production issue is like fixing a plane that is malfunctioned and crashing:
For average language and runtime, the plan is to take a guess, then develop a theory, build a malfunctioned plane according to the theory. If it's not having same issue, repeat previous steps until people rebuild a plane having exact same issue as the crashing one. Then fix the plane having the same issue, and apply it to the crashing plane.
For Erlang, the process would be, examine the malfunctioned plane to see where the problem is for sure, directly patch it in place to make it usable again. And then build a real fix, applying it without restart anything.
Tools may not be the most useful things in the world but they definitely could be game-changers sometimes.
How would you deploy erlang given a system that has grown out of a legacy C++ codebase and needs to run on a mix of mobile devices, IoT embedded systems (some with less than 64 kB RAM), PCs and servers?
Some distributed systems are heterogenous and will probably always be.
Now, getting back to the main topic, any insights from erlang or otherwise how to troubleshoot said systems that are deployed to customers?
Give your operators the ability to explore the system free form with these kind of tracing, dynamically, on production systems. It helps a lot. And yes that will need a lot of work in current runtimes.
Until someone bumps an MPI version and I can't fix my God damn build for a week
This is also a good entry point https://medium.com/@mrjoelkemp/jvm-struggles-and-the-beam-4d...
I have worked on this topic in academia and industry for many years and I can safely say, there is a huge disconnect between real world tools and academia. ROI for model checkers is too low, unless your building mission critical systems.
For a CRUD like application, simple things like
1. Distributed Tracing
2. Log Analysis
3. Basic concepts like exponential backoff with retries
etc seems to iron out most of the issues.
Lately, we started using chaos testing methods, where we have dedicated build pipelines with fault injection and see, if the system can function as expected. That has helped us uncover / fix many distributed system bugs. In terms of ROI and learning curve, chaos testing seems a much better approach than using model checkers and other sophisticated methods.
Unless you have Istio. That's amazing.
For learning Paxos I like this video: https://m.youtube.com/watch?v=JEpsBg0AO6o
Implement it yourself in your language of choice.
For TLA+ there's the Practical TLA+ textbook or Leslie Lamport's video lecture series: https://lamport.azurewebsites.net/video/videos.html