Large software systems are by far the most complicated artifacts ever constructed by humans. Think about that for a second. Submarines, Rockets, Televisions, etc. are simple by comparison (except where they incorporate large software systems, of course). As software developers, we generally repress that knowledge on a day-to-day basis so we can go about our jobs without dissolving into quivering wrecks. In this post, however, I’d like to examine the results of doing the opposite. Let’s see what happens if we confront this reality head on.


The Confident Engineer

I think a typical engineer enters the field feeling pretty confident. Each of the programming tasks she has taken on prior to her first real job has either been successful or failed because of some clear mistake recognizable in hindsight. If you combine that with a history of being the smartest person in the room for most of her life, you can understand how this happens. Based on past successes, she will likely even take pride in writing “clever” code, disregarding Brian Kernighan’s excellent observation:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

Even if she ends up on a string of failed projects, each one can usually be hand-waved away as the result of some error that seems obvious in retrospect.

In my experience, it takes a long time to see past this. Until you’ve worked on a variety of successful and unsuccessful projects, there isn’t enough data on which to base any kind of understanding as to what might lead to either outcome. It can take years or even decades to develop sufficient perspective. The problem is made worse by the fact that any reasonably well-designed large system will be constructed in a modular fashion. The small subsystem assigned to a hypothetical young engineer will look a lot like a small standalone program that she has had success with in the past. To a degree, this will always be the case since humans can only process a certain amount of complexity and any amount beyond what she can handle will typically fade into the background when she focuses on a particular task.

To summarize, for a significant portion of an engineer’s career, her mindset is typically something like: “Writing software is challenging but well within my abilities; however, it is necessary to be careful.”

This is patently false. Writing high quality, large-scale software is not “well within” anyone’s abilities. If it weren’t for the extreme malleability of software, few large systems would ever succeed. This is even more true for any kind of concurrent system. The anecdote in the next section should make this point more clear.


Everyone Gets it Wrong

What follows is an extended quote from a paper called The Problem With Threads by Edward A. Lee from U.C. Berkeley. I believe it speaks for itself:

In the early part of the year 2000, my group began developing the kernel of Ptolemy II [20], a modeling environment supporting concurrent models of computation. An early objective was to permit modification of concurrent programs via a graphical user interface while those concurrent programs were executing. The challenge was to ensure that no thread could ever see an inconsistent view of the program structure. The strategy was to use Java threads with monitors.


A part of the Ptolemy Project experiment was to see whether effective software engineering practices could be developed for an academic research setting. We developed a process that included a code maturity rating system (with four levels, red, yellow, green, and blue), design reviews, code reviews, nightly builds, regression tests, and automated code coverage metrics [43]. The portion of the kernel that ensured a consistent view of the program structure was written in early 2000, design reviewed to yellow, and code reviewed to green. The reviewers included concurrency experts, not just inexperienced graduate students (Christopher Hylands (now Brooks), Bart Kienhuis, John Reekie, and myself were all reviewers). We wrote regression tests that achieved 100 percent code coverage. The nightly build and regression tests ran on a two processor SMP machine, which exhibited different thread behavior than the development machines, which all had a single processor. The Ptolemy II system itself began to be widely used, and every use of the system exercised this code. No problems were observed until the code deadlocked on April 26, 2004, four years later.


It is certainly true that our relatively rigorous software engineering practice identified and fixed many concurrency bugs. But the fact that a problem as serious as a deadlock that locked up the system could go undetected for four years despite this practice is alarming. How many more such problems remain? How long do we need test before we can be sure to have discovered all such problems? Regrettably, I have to conclude that testing may never reveal all the problems in nontrivial multithreaded code.

Here we see recognized experts in the field follow every best practice in constructing a concurrent system only to have it fail four years into production. What are the odds that we’re the ones who will get everything right given the typically tighter time constraints on development in industry?


Estimates are Notoriously Bad

Anyone who has ever worked on a software project knows that people habitually underestimate the time required for even relatively simple tasks. They continue to do so in spite of prior personal experience — often by wide margins.

Why is this?

In my mind, this is a result of three factors:

  • The belief that developing software is “challenging but well within my abilities (CBWWMA)
  • The difficulty of predicting the unknown
  • “Imposter syndrome”

If asked to justify an estimate, an engineer can describe the work involved in implementing a solution but she cannot describe the work involved in dealing with the unexpected issues that will arise. Since writing software is something she is ostensibly good at, she is likely to balk at suggesting that “something unforeseen might come up”. She is even less likely to allot extra time for the unexpected if she suffers from imposter syndrome (as many/most of us do). For anyone unfamiliar with the term, it refers to the feeling that one is unqualified or out of one’s depth and, hence, an “imposter” of sorts. Perhaps surprisingly, I’ve never found an engineer who didn’t admit to feeling this way on occasion — even those who always seem from the outside to be incredibly confident and competent. When you feel this way, you don’t want to expose yourself as a fraud so you keep your doubts to yourself. Unfortunately, the other members of your team are likely doing the same.

The result is that an engineer almost always estimates software as if the development effort will follow a “happy path” that aligns closely with the relatively vague conception of the process she holds in her head. The fact that development occasionally does follow a happy path reinforces the behavior. Instead of seeing the successes as lucky breaks, she sees the failures as bad luck.


State is the Biggest Culprit

Though there are a vast number of ways in which software development can go wrong, the area that produces the greatest difficulties is the management of state. Ben Moseley and Peter Marks present a very clear and lucid description of the issue in their excellent paper:
Out of the Tar Pit. I heartily recommend reading it. I’ll just mention a few key points they discuss:

  • The number of possible states of a typical object-oriented software application is astronomically large.
  • In general, the behavior of a system in one state tells you very little about it’s behavior in a different state.
  • As a practical matter, tests can only be performed in a small subset of the possible system states.

Sounds bad, doesn’t it? Minimizing and isolating state in a software system are the two biggest contributors to software quality and reliability — particularly in the presence of concurrency. The extreme case of minimizing state occurs in a functional program. A fully functional program can have its parts tested in isolation and each one will continue to work correctly regardless of the context in which it’s deployed.

Think about this the next time someone dismisses functional programming as “not practical” or “unnecessarily restrictive”. In developing successful software, we need to take advantage of any leverage available to reduce the prevalence and severity of errors. Minimizing state via functional programming and immutable data is a significant lever. That said, please don’t think I’m claiming that functional programming is a silver bullet. It’s not. There are plenty of other ways in which our development can go wrong; however, functional programming can remove whole classes of possible errors. Who wouldn’t want that?

Facing the Problem

So, if CBWWMA is wrong, what is the right perspective? I suggest something along the lines of: “Writing complex software is at or beyond the limits of human capacity. Good software is possible, but not guaranteed. When building any significant system, I will undoubtedly get many things wrong and must be prepared to deal with the results.”

That’s the most important point… I will get many things wrong. Not “might” — “will”. It’s this realization that has led to most of (all of?) the improvements in software development over the last several decades. For example:

  • Assuming we’ll get something wrong in our design leads us to solicit design and code reviews or do pair development.
  • Assuming that specifications (or our understanding of them) will be wrong encourages us to construct good regression tests so that we can make changes without breaking things.
  • Assuming that our understanding of the interfaces between our software and other parts of a system will be wrong motivates continuous integration.
  • Assuming that our estimates will be wrong encourages shorter development cycles so that errors will cost less and be found sooner.
  • Assuming that we will be unable to understand or manage the complexity of a system with significant mutable state leads us to adopt immutable data and functional programming where possible.
  • Assuming that any or all of these things will go wrong encourages us to develop systems with strong introspection and debuggability built in.

Our mindset needs to change from one that treats errors as exceptional conditions to one that emphasizes limiting their scope and maximizing the ability to address them when they occur. Along with the practices mentioned above, two recent related software trends can help with this.


Microservices and Containers

It’s impossible to work in the software field and not be aware of microservice architectures and container technology. Microservices and containers are the two biggest buzzwords in the industry today. Unlike many “hot” technology trends in the past, there are good reasons for their popularity. Let’s see how they can help with the issues we’ve been discussing.

Problems with Monolithic Applications

What can be done when a large, monolithic application encounters a significant error (or even a crash)? The entire functionality of the application may be lost and, once the cause is discovered, correcting it can involve a large-scale retesting effort and a complex deployment process. The deployment may even fail if the runtime environment has changed since the last update. Since we’re assuming there will be errors, this is a less than ideal situation.

Fixing the Problems

What we would like is an application for which:

  • Unrelated subsystems continue to function in the presence of the error.
  • We only need to retest the subsystem containing the error.
  • We can independently deploy a corrected subsystem simply and quickly.

We can get all of these benefits via a microservice architecture using containers. The use of microservices does add complexity (though tools can help), but if done correctly can isolate errors within subsystems. These subsystems can be tested and deployed independently, avoiding the loss of the entire application’s functionality. Containers allow us to apply subsystem updates rapidly and consistently across a variety of underlying environments without special case code for different deployments. Additionally, containers often allow developers to test against a complete application rather than just small pieces, allowing integration issues to be discovered and dealt with earlier in the development process.

Together, microservices and containers can assist in discovering errors before deployment and drastically reduce the cost of errors discovered after deployment.

State of the Art

So, what would an ideal software development model look like? We would construct an application from isolated microservices in which each microservice:

  • Would be deployed in a container.
  • Would be developed in a test-driven fashion with design and code reviews.
  • Would have multiple developers who possess a deep understanding (possibly through pair programming).
  • Would go through a complete regression test suite on each code submission which exercises the full integrated application.
  • Would be written in a functional style with immutable data where possible.
  • Would be instrumented for easy debugging and tunable logging.

And we would still get it wrong… but we might have a prayer of fixing it.