(Any images seen here are attributed to the presentation in the video series mentioned above. This blog post is more like watching those videos on fast-forward and I would definitely encourage you to check out the videos if you would like more clarity)
Mathematics provides infinitely many ways of expressing the same thing
Take the number 12. There are infinite number of ways to express it.
expr 1) 6 + 6
expr 2) 3 * 4
expr 3) 141 - 129
expr 4) 4353475 - (4353462 + 1)
All the above expressions evaluate to 12. When options are infinte, how to express something?
The same question applies to writing formal specifications and programming.
The advice from the book is simple:
…, then you can choose the one that you feel makes the specification easiest to understand.
Yes, choose the one that makes it the easiest to READ it at a future point.
Example: Someone might find it OKAY to choose (expr 3) above to express 12. Reason behind it being, “come on, it is not that complex!”. Especially it is not as complex as (expr 4). But when others (or the same person) read it at a future point in time, they might wonder, “why didn’t we choose (expr 1) or (expr 2)”.
I have seen the equivalent of this happening in programming.
My advice for anyone (especially if you are getting started with programming and are in that phase where you get excited about different programming languages and their features) would be:
If there are two ways to express something, choose the one that will be the easiest to read and understand at a future point in time by a human and not the compiler.
~ ~ ~ ~
Optimize for reads, when writing.
]]>With those very good enough reasons, I stumbled upon this awesome github repo which curates various testing strategies for distributed systems. One of the things that stood out for me in that list was “Formal methods”, more specifically “TLA+”. It then led me to watch this awesome conference video where they compare TLA+ and Jepsen/Maelstrom - the video made me feel excited about both the technologies. A quick lesson from the video: TLA+ is apples and Jepsen is oranges - we would ideally want to eat both.
I then decided to learn more about TLA+ since that comes in the earlier stages of the design process. I have previously attempted to learn TLA+ but couldn’t succeed in it successfully - mainly due to a lack of motivation in the middle of the learning process. So, I wanted to be motivated enough this time before attempting to learn it again and try to use it in my side project or at work. This line of thinking made me remember that AWS had published a paper about TLA+ that I had heard of in the past. So I decided to pick it up and read it.
You can get a copy of it from here.
This paper is an experience report from the Engineers who spearheaded the moment of using formal methods to verify complex distributed systems that were getting built at AWS such as S3, Dynamodb, etc. At first, they didn’t think of formal methods and were investing in other types of testing. Those tests helped but there were still edge cases that could cause serious bugs.
They open up with the scale that they are dealing with here.
As an example of this growth; in 2006 we launched S3, our Simple Storage Service. In the 6 years after launch, S3 grew to store 1 trillion objects [1]. Less than a year later it had grown to 2 trillion objects, and was regularly handling 1.1 million requests per second [2].
Imagine that you were about to design a system for such a high scale and growth - how will you gain confidence about its design and correctness? If you are making any changes to the system at some point, how will you be confident about the effects of your changes?
The first line of defense in order to gain that confidence is using formal methods to specify and check your system design. Once we made sure that our design is correct, then we start to implement it and write “tests” which check the correctness of the code (this is the classic software testing bit that we are used to).
What do most of us do most of the time while designing systems?
… conventional design documents consist of prose, static diagrams, and perhaps pseudo-code in an ad hoc untestable language. Such descriptions are far from precise; they are often ambiguous, or omit critical aspects such as partial failure or the granularity of concurrency (i.e. which constructs are assumed to be atomic).
I have noticed this divergence between the reality and the design doc/diagrams in day-to-day engineering. What if we wrote something during that process of creating those beautiful diagrams and design docs - something that is more detailed and helps us down the line when we are trying to alter the system? That something turned out to be TLA+ for AWS.
TLA+ is based on simple discrete math, i.e. basic set theory and predicates, with which all engineers are familiar. A TLA+ specification describes the set of all possible legal behaviors (execution traces) of a system.
TLA+ is intended to make it as easy as possible to show that a system design correctly implements the desired correctness properties, either via conventional mathematical reasoning, or more easily and quickly by using tools such as the TLC model checker [5], a tool which takes a TLA+ specification and exhaustively checks the desired correctness properties across all of the possible execution traces.
TLA+ is accompanied by a second language called PlusCal which is closer to a C-style programming language, but much more expressive as it uses TLA+ for expressions and values. In fact, PlusCal is intended to be a direct replacement for pseudo-code.
In industry, formal methods have a reputation of requiring a huge amount of training and effort to verify a tiny piece of relatively straightforward code, so the return on investment is only justified in safety-critical domains such as medical systems and avionics. Our experience with TLA+ has shown that perception to be quite wrong.
Excellent, that is exactly what I needed to hear. They also provided this nice table of real world things:
TLA+ has been helping us shift to a better way of designing systems. Engineers naturally focus on designing the ‘happy case’ for a system
and
Once the design for the happy case is done, the engineer then tries to think of “what might go wrong?”, based on personal experience and that of colleagues and reviewers.
…. Almost always, the engineer stops well short of handling ‘extremely rare’ combinations of events, as there are too many such scenarios to imagine.
and
In contrast, when using formal specification we begin by precisely stating “what needs to go right?”
….
- Safety properties: “what the system is allowed to do”
- Liveness properties: “what the system must eventually do”
After we define those properties, we will need to see if those hold true for various kind of things that can happen in the system.
Next, with the goal of confirming that our design correctly handles all of the dynamic events in the environment, we specify the effects of each of those possible events; e.g. network errors and repairs, disk errors, process crashes and restarts, data center failures and repairs, and actions by human operators.
So there should be a way to model these events in the system too. (The video that I mentioned at the top helped me digest this portion of the paper more easily)
We have found this rigorous “what needs to go right?” approach to be significantly less error prone than the ad hoc “what might go wrong?” approach.
In several cases we have prevented subtle, serious bugs from reaching production. In other cases we have been able to make innovative performance optimizations – e.g. removing or narrowing locks, or weakening constraints on message ordering – which we would not have dared to do without having model checked those changes.
Awesome!
They are interested in two things
1) bugs and operator errors that cause a departure from the logical intent of the system, and
2) surprising ‘sustained emergent performance degradation’ of complex systems that inevitably contain feedback loops.
(1) is achievable via formal methods but not (2). They give a good example of what (2) would look like and they mention that they have other ways to mitigate those.
This and the upcoming sections of the paper are well narrated and I felt like I was watching a documentary movie while reading these sections.
One another option that they were considering was Alloy) as they found evidence of its usage.
Zave used a language called Alloy to find serious bugs in the membership protocol of a distributed system called Chord. Chord was designed by a strong group at MIT and is certainly successful; it won a ’10-year test of time’ award at SIGCOMM 2011
But they chose TLA+ over Alloy as it was not as expressive as they needed it to be.
Eventually C.N. stumbled across a language with those properties when he found a TLA+ specification in the appendix of a paper on a canonical algorithm in our problem domain: the Paxos consensus algorithm
The fact that TLA+ was created by the designer of such a widely used algorithm gave us some confidence that TLA+ worked for real-world systems.
Yeah, TLA+ was invented by Leslie Lamport who given us with some of the coolest research that are getting used in a lot of stuff.
T.R. says that, had he known about TLA+ before starting work on DynamoDB, he would have used it from the start. He believes that the investment he made in writing and checking the formal TLA+ specifications was both more reliable, and also less time consuming than the work he put into writing and checking his informal proofs.
Totally love this section. I would use the techniques mentioned here if I were to introduce formal methods and verification to other engineers.
This raised a challenge; how to convey the purpose and benefits of formal methods to an audience of software engineers? Engineers think in terms of debugging rather than ‘verification’, so we called the presentation “Debugging Designs”
and
Continuing that metaphor, we have found that software engineers more readily grasp the concept and practical value of TLA+ if we dub it:
Exhaustively testable pseudo-code
One another thing that I saw that I didn’t expect was
Most recently we discovered that TLA+ is an excellent tool for data modeling, e.g. designing the schema for a relational or ‘No SQL’ database.
Wow, his helped them in coming up with a better schema!
“How do we know that the executable code correctly implements the verified design?”
We don’t, but
Formal methods help engineers to get the design right, which is a necessary first step toward getting the code right. If the design is broken then the code is almost certainly broken, as mistakes during coding are extremely unlikely to compensate for mistakes in design. Worse, engineers will probably be deceived into believing that the code is ‘correct’ because it appears to correctly implement the (broken) design. Engineers are unlikely to realize that the design is incorrect while they are focusing on coding.
Seems like they published a whole other paper on this topic.
When we found that TLA+ met those requirements, we stopped evaluating methods, as our goal was always practical engineering rather than an exhaustive survey.
I hope you enjoyed this post and got the urge to explore and learn TLA+ - I feel this has the power to change the way we think and reason about our systems. I hope to write up more when I try to use it in real-world situations.
From here, I would like to read this which was one of the references from that paper and try to learn and write TLA+ for something(s).
Formal methods deal with models of systems, not the systems themselves, so the adage applies;
“All models are wrong, some are useful.”
~ ~ ~
oh, and TLA is an acronym for Temporal logic of actions
]]>I want to share a particular section which WOWed me. If you know Tamil and have an Amazon Primevideo account, search for “Alex in Wonderland” and go to the 58th minute of the show. For the rest of you people, I have typed in the bits that I wanted to share with you:
Alex says:
Think about this simple instrument
You know the name of this instrument?
It is called the Double Bongos right?
One of the simplest percussion rhythm instruments.
And you can buy this for 700 ruppees in Chennai even today.
And will you belive if I say this simple instrument ruled Tamil Film Music for half a century man?
I am not exaggerating.
For 50 years every other super hit song that came in Tamil film music had only this instrument as the core rhythm instrument.
This sound I’m sure you can all recall…….
(plays the double bongos)
This sound ruled Tamil film music for half a century.
This music director we all adore. He will live forever. He’s living forever.
He made amazing, wonderful, soulful songs.
The melodies will be out of the world.
But the percussion: just bongos and nothing else.
Of-course I am talking about the King of Melodies “M S Viswanathan” (fondly called MSV)
I think MSV wants to tell us one thing very clearly.
Beauty lies in simplicity.
Even on Bongos he wouldn’t complicate.
He would not go into the complex rhythm patterns and all.
Just the 4-beat rhythm for every song ya.
This four beat: one, two, three, four, that’s all.
Whatever maybe the situation. Whatever maybe the emotion that he has to show. Anything that ever happens in any story, for anybody’s life, MSV has captured anything and everything in this 4-beat rhythm.
one two three four.
(Alex sings some of the super-hit MSV songs by live-playing the four beats on double bongos)
Just WOW. These songs have been heard millions of times by a lot of people. I myself had heard them but never noticed this basic construct. That’s why I thanked Alex at the start of this post. We could not taste the essense of music without people like him.
If you are curious, here is a small playlist of MSV’s songs that Alex performs to demonstrate the double bongos.
Notice how thoughtfully the bongos hit in at the start of each song. They continue in harmony throughout every song.
This got me thinking and inspired. I think the lessons for me (and any of us reading this) are
We are talking about a legend here. At the core of his compositions lies this touch of simplicity. How beautiful! A little simplicity has a lot of mileage (50 years). MSV retired and didn’t compose songs for movies for over 20 years in his retirement. But as Alex bets, if he had composed during that time, the magic would have still worked!
I am already a fan of “Simplicity” - the reason I prefer using Go programming language a lot :D Simplcity is not easy, but trying to get there is well worth it. (Obligatory link to the famous tech talk on this subject here).
I searched Amazon for the double bongos and it still costs 700 Indian Ruppees. That is equal to 8.49 USD.
Crazy, right? MSV was able to produce legendary music with it.
I always advice myself and people to not worry about not having enough money to afford something to make progress in an area.
Learning programming? You don’t need that latest expensive MacBook Pro or whatever. All you need is a Raspberry Pi running linux.
~ ~ ~ ~
oh, and don’t forget the four beats of the double bongos.
one, two, three, four.
]]>context.WithValue
to do it.In retrospect while reading the Go docs for it, I believe I have gone against every possible rule for using it 😅 Sometimes you will have to try things out practically to get a lasting lesson.
This is such a case and I am going to share the lessons that I learned here.
All these lessons come from this single commit - feel free to take a look at it if you are interested.
I have three kinds of packages.
main
package - starting point of my apptrigger
, connector
, scaler
packages - these are called from main
and accept a context.event
package which is initialized in main
and is supposed to be used in the above packages1 | package main |
Inside the scaler, I would do something like this.
1 | func (s *Scaler) Register(ctx context.Context) error { |
This line ctx = context.WithValue(ctx, "eventBus", eventBus)
in main.go
is what is wrong.
While trying to refactor, I accidentally removed that line from main.go
and ran go build
. Guess what? The build succeeded without any problem 😱
This is scary because the eventBus
is at the core of my project. All the packages emit and subscribe to events via it. I would maybe expect a compiler error if something as obvious as not passing it to these packages was happening.
If we try to run the passing build, it would result in a runtime panic whenever we hit the code path where it was used. Because we are getting the eventBus := ctx.Value("eventBus").(event.Bus)
at runtime and we missed setting that value via context.WithValue
, we will get back a nil reference. Since that value is being used just after that eventBus.Subscribe()
, it will lead to a runtime panic.
1 | panic: interface conversion: interface {} is nil, not event.Bus |
It is time to visit the Go docs for context.WithValue
WithValue returns a copy of parent in which the value associated with key is val.
Yep, I did want value associated with my key.
Use context Values only for request-scoped data that transits processes and APIs, not for passing optional parameters to functions.
LOL, I was not even trying to pass an optional parameter, but a mandatory parameter.
The provided key must be comparable and should not be of type string or any other built-in type to avoid collisions between packages using context.
LOL, I was using string type.
Users of WithValue should define their own types for keys.
I did have this idea in mind and wanted to do it as a refactor.
To avoid allocating when assigning to an interface{}, context keys often have concrete type struct{}. Alternatively, exported context key variables’ static type should be a pointer or interface.
Okay, I still don’t fully understand this part because the example in the Go Doc seems to use the type of string
1 | type favContextKey string |
I would have expected it to be something like this based on that last line from the docs
1 | type favContextKey struct{} |
I am guessing k1
and k2
will result in memory allocation whereas s1
and s2
won’t. Could somebody confirm it for me?
As the docs suggest, it is should be strictly used for carrying request-scoped data that ideally live only during the lifetime of a request.
Example: let us consider an http handler which gets called every time we make an http request to a client.
1 | func(w http.ResponseWriter, r *http.Request) { |
So, here the context is very specific to the handler and lives only throughout the lifetime of the handler. It is used to store a piece of information very specific to the request (i.e. the request-id of the request) and pass it to the downstream API requests which could make use of it.
Two URLs on the internet helped me in my learning here:
~ ~ ~ ~
I dedicate this to all people who are faced with the question of “should I pass down my logger in my go context?” in their busy lives. The answer is simple. Don’t do it.
]]>