
Use of Formal Methods at Amazon Web Services (2013) [pdf] - yankcrime
http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf
======
gone35
_We are concerned with two major classes of problems with large distributed
systems; 1) bugs and operator errors that cause a departure from the logical
intent of the system, and 2) surprising 'sustained emergent performance
degradation' of complex systems that inevitably contain feedback loops. We
know how to use formal specification to find the first class of problems.
However, problems in the second category can cripple a system even though no
logic bug is involved.

(...) A common example is when a momentary slowdown in a server (perhaps due
to Java garbage collection) causes timeouts to be breached on clients, which
causes the clients to retry requests, which adds more load to the server,
which causes further slowdown. In such scenarios the system will eventually
make progress; it is not stuck in a logical deadlock, livelock, or other
cycle. But from the customer's perspective it is effectively unavailable due
to sustained unacceptable response times.

(...) TLA+ could be used to specify an upper bound on response time, as a
real-time safety property. However, our systems are built on infrastructure
(disks, operating systems, network) that do not support hard real-time
scheduling or guarantees, so real-time safety properties would not be
realistic. We build soft real-time systems in which very short periods of slow
responses are not considered errors. However, prolonged severe slowdowns are
considered errors. We don't yet know of a feasible way to model a real system
that would enable tools to predict such emergent behavior._

Interesting. TLA+ of course allows for hybrid specifications that incorporate
physical constraints --real-time itself being the prime example--, so
constraints based on more complicated 'loss functions' like, say, weighted
average or total variation of response times, _in principle_ ought to be
feasible as well. Leslie Lamport himself laments such specifications are still
'of only academic interest' [1, §9.5], so I wonder how impractical such
specifications could they really get.

[1] [http://research.microsoft.com/en-
us/um/people/lamport/tla/bo...](http://research.microsoft.com/en-
us/um/people/lamport/tla/book-02-08-08.pdf)

------
saosebastiao
Really cool stuff. I have a couple of questions:

1) How do you model the myriad fault types that could possibly occur in
distributed systems?

2) Is it possible to partially model a system? For example, in SOA, the
behavior of a service can depend on and be greatly affected by the behavior of
other services. Can each service be modeled in isolation with meaningful
results?

3) Is the run time of model checking managable? The paper mentioned a few
seconds to find bugs, but how long to exhaustively check a model without bugs?

~~~
hayfield
I currently have about 3 weeks experience with some different formal
verification tools (others will know better), but...

1\. While large, the system will have a finite number of desired actions
(Read, Write, Send Data, etc), with set paths/transitions between them (think
state machines). With each of these things-you-want to happen, there may or
may not be corresponding errors. If there are error handling actions, these
become further things-you-want-to-happen, and so on. This modelling can be at
a fairly high level, so you only need to go 'there may be an error, this is
how we may handle it' rather than iterating over every possible error.

2\. Each part of a system may be independently modelled, with synchronisation
states (waiting for data from another system, etc). A model checker can then
generate traces to cover the entire possible set of traces. This means that
verification may occur at multiple simultaneous levels - you could model a
File System separately to a Network Driver, each of which are separate to an
Application making use of these two things.

3\. Full formal verification is NP-hard. Model checkers, however, are clever
tools and are able to act more intelligently than brute forcing. They can do
things like looking at which parts of the state space haven't been covered as
thoroughly and covering them better; doing things akin to MC/DC testing;
focusing on rare events (a good checker will allow you to customise the types
of traces you're interested in). Checkers are also able to output what they
have covered along with probabilities of finding certain types of bug.
Standard distributions and the Pareto Principle will often hold in terms of
guesstimating how long varying levels of confidence will take.

Another thing to note is that hardware verification is light years ahead of
software verification - there are algorithms that have been used on hardware
for decades that are only just being discovered for software verification. You
could make a comparison that says a distributed software system is like a
modern CPU with various independent parts, and if the CPU can be verified then
surely your distributed software system can too.

------
edwinnathaniel
This is VERY interesting and feels like utilizing the very core of Computer
Science.

I'm definitely done reading blog/article on how TDD/BDD/ATDD or
Clojure/Haskell make one more productive without scientific background ;)

------
xxpor
404 :(

~~~
hyperliner
It's working for me.

------
jedberg
This explains why it takes so long for them to release new features in AWS.
Fascinating.

------
coldcode
Odd to see something about Amazon from Microsoft.

~~~
dlgeek
Looks like it's a personal directory for Leslie Lamport
([http://en.wikipedia.org/wiki/Leslie_Lamport](http://en.wikipedia.org/wiki/Leslie_Lamport)),
who is cited several times in the paper. Think of it as a professor hosting
research papers that cite him in his field on his university site, rather than
a business site.

~~~
mcguire
And that use his research. Lamport created TLA+.

