
How Amazon Web Services Uses Formal Methods (2015) [pdf] - mindcrime
https://www.cslab.pepperdine.edu/warford/math221/How-Amazon-Web-Services-Uses-Formal-Methods.pdf
======
Twirrim
One of the teams I worked for in AWS used TLA+ on a new back-end feature that
was added. It was one that, if things went wrong, would be extremely bad for
customers. The needs for operating on scale only added to the necessary
complexity. Paranoia was key.

They'd estimated something like an optimistic 4 months for the work.

Two of the more senior developers were assigned the task, and they had to
learn TLA+ and formal methods from scratch. (one dev's main complaint was that
he was forced to use Eclipse, I don't know if integration has now happened
with other IDEs since then).

It took a little bit for them to get up to speed, but they were soon
implementing and catching bugs in the proposed designs, and they had all the
bugs ironed out within a few weeks.

They then found that going from TLA+ to actual Java code was almost a "fill in
the blanks" exercise. The formal model provided almost the exact structure
they needed for their final code. Going from model to code took them very
little time at all. It was in production, running in shadow mode after not
even 2 months, and was fully live not long afterwards.

TLA+ got subsequently used for a number of other features added to the
platform.

There had been general push back among developers that learning TLA+ and doing
formal verification was only going to slow them down, but we found it was
quite to the opposite. TLA+ actually sped up development, with the big bonus
of higher confidence of correctness.

~~~
bordercases
How do you believe using TLA+ compares to the "rich static typing" systems
that pump up the valuations of Scala and Haskell?

~~~
zozbot234
Not very comparable. TLA+ involves writing statements in a custom "logic"
which adds modalities (i.e. monads) to support 'time' (or rather, state
transitions in discrete steps, ala temporal logic) and non-determinism (which
is reused to support refinement) to standard propositional logic.

Nothing like this can be done in a standard programming language (even Scala
or Haskell), and even in a dependently-typed language you would need to write
some support code in order to embed this logic and work with it within the
system. (It wouldn't be _hard_ for sure, since even FOL gives you far more
power than any (multi-)modal logic, but it's a different, perhaps more elegant
take on things than just working directly within TLA).

~~~
bordercases
Thanks for your response.

------
ignoramous
Bryon Cook, a principal scientist at Automated Reasoning Group gives an
overview of various solvers in-use today at AWS:
[https://www.youtube.com/watch?v=UKqVY0SSbus](https://www.youtube.com/watch?v=UKqVY0SSbus)

Here's a presentation by Eric, a distinguished engineer, and Neha, a principal
engineer, on analysing IAM policies (Zelkova?) and Network reachability
(Tiros?):
[https://www.youtube.com/watch?v=x6wsTFnU3eY](https://www.youtube.com/watch?v=x6wsTFnU3eY)

This webpage has many more links to blog articles on the topic:
[https://aws.amazon.com/security/provable-
security/](https://aws.amazon.com/security/provable-security/)

~~~
mjb
Yes, since we published this (I'm one of the authors), the Automated Reasoning
Group have done amazing work applying formal methods to a broad range of
problems at AWS. We're also still using TLA+ in many places for the reasons
the paper describes.

One aspect of TLA+ (and formal specifications in general) that I under
appreciated when we wrote this was how useful they are as a communication
tool. I've found well-commented TLA+ to be an excellent medium for reviewing
and documenting protocols, and having high-bandwidth conversations about them.
Complex protocol interactions can be hard to describe clearly in text, but the
mix of text and TLA+ (or PlusCal) seems to work well.

~~~
amortize_
Does your team hire PhDs with formal verification backgrounds? If so, what is
the interview process like? Do you make them go through the standard leetcode
type interviews?

~~~
exdsq
Side note - did you do a PhD in formal verification? I'd love to hear more
about it? I am _seriously_ considering it myself.

~~~
amortize_
I have not, but like you am considering getting a PhD in this area.

------
lwb
Questions I wondered about that weren't answered until the end of the paper:

Q: Do engineers actually implement AWS systems in TLA+? A: No, TLA+ is called
"exhaustively testable pseudo-code" internally, and is only used for designs.

Q: Which AWS systems use it? A: The first was DynamoDB, followed by S3, and
now there are 8 additional complex systems that use it for designs.

Q: Who uses TLA+ at Amazon? A: All engineers, from entry level to principal.

I thought it was especially interesting that several engineers at AWS were
effectively writing their own model checkers by writing programs that tried to
brute force rare combinations of conditions that would lead to system failure.
Obviously TLA+ does this better and faster so they switched to that.

~~~
mjb
Thanks for the useful summary.

> All engineers, from entry level to principal.

I'd phrase that as "some engineers of all levels". It's certainly not true
that all engineers at AWS use TLA+ :)

~~~
lwb
Yes, thanks for the correction!

------
exdsq
I’ve started using TLA+ in my spare time. There are several good sites and
books to learn more about it:

\- [https://learntla.com/introduction/](https://learntla.com/introduction/)

\- Pracitcal TLA+

It’s much easier to get running with than other formal methods tools like
Isabelle and Coq.

~~~
mjb
Practical TLA+ is indeed a great place to start, and Hillel Wayne's writing
out formal methods is all worth reading.

Lamport's TLA+ book, and "Specifying Systems" are both available online
([https://lamport.azurewebsites.net/tla/book.html](https://lamport.azurewebsites.net/tla/book.html)).
That one's worth a read too. Lamport's approach to teaching TLA+ is different,
and probably more suited to the mathematically-inclined.

If you like spelunking through history, I'd also recommend checking out some
of Lamport's papers on TLA. I particularly like "What Good is Temporal Logic?"
which lays out his early arguments about why temporal logic is useful for
solving some classes of problems. The paper is available here:
[https://www.microsoft.com/en-
us/research/uploads/prod/2016/1...](https://www.microsoft.com/en-
us/research/uploads/prod/2016/12/What-Good-Is-Temporal-Logic.pdf)

------
dang
A thread from 2014, presumably pre-publication:

[https://news.ycombinator.com/item?id=8096185](https://news.ycombinator.com/item?id=8096185)

A bit more:
[https://news.ycombinator.com/item?id=9289890](https://news.ycombinator.com/item?id=9289890)

Related from 2015:
[https://news.ycombinator.com/item?id=9287426](https://news.ycombinator.com/item?id=9287426)

------
_cairn
Does anyone have resources that talk about writing a TLA+ model for an
existing software system as opposed to designing a new system using TLA+ as
the design reference?

~~~
hwayne
Check out Marianne Bellotti's work. She recently gave a talk at Strange Loop
on writing specifications of existing systems:
[https://www.youtube.com/watch?v=oMSmkRGzQ64](https://www.youtube.com/watch?v=oMSmkRGzQ64)

~~~
_cairn
Thank you! I will check this out.

------
agentultra
I introduced TLA+ to help understand underlying components in the OpenStack
ecosystem and at small companies to solve thorny problems. It's well worth
learning as it gives you a very different perspective on the design and
construction of systems.

In addition to the books and sites mentioned here, there's also a video course
with tutorial content produced by the creatore of TLA+:
[https://lamport.azurewebsites.net/video/videos.html](https://lamport.azurewebsites.net/video/videos.html)

~~~
exdsq
Hey! It sounds like you're using TLA+ in a consulting style role within small
companies? How did you manage to get into that?

------
jayd16
How do you go from something like TLA+ to a common language and keep the
correctness? Are assertions the only way?

------
martythemaniak
One thing I'm wondering is whether (or rather, what kind of) bugs arise when
you try to translate a model (described as "exhaustively tested pseudocode"
elsewhere in this thread) into a real system. For example, if there is if-
branch somewhere that gets missed etc?

~~~
lemmster
I am the engineer who translated
[https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/...](https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/tlc2/tool/fp/OpenAddressing.tla)
into Java
([https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/...](https://github.com/tlaplus/tlaplus/blob/master/tlatools/src/tlc2/tool/fp/OffHeapDiskFPSet.java)).
The translation took me about half a day. Later, while running scalability
experiments, a correctness bug intermittently showed up, on which I spent
approximately two weeks to find the root cause. The two weeks included fixing
around a dozen bugs in code that was not specified in TLA+. However, none of
these bugs were correctness bugs and unrelated to the correctness bug.
Eventually, I translated the TLA+ spec into input for Java Pathfinder, which
found the cause of the correctness bug immediately: The Java translation
contained an off-by-one bug in a for loops (TLA+ is one-indexed while Java is
zero-indexed). I am still convinced that a code review would have found the
bug and that JPF was a very heavyweight substitute for a fresh pair of eyes.
In other words, the bugs that get introduced during the translation phase are
shallow and easily corrected.

Note that no other bugs have been reported for the code since.

