Marginally Interesting by Mikio L. Braun

Everyone Is Still Terrible At Creating Software At Scale

I have a hunch that once people saw the economic potential of software, they started looking for ways to “scale it up” and we haven’t stopped searching yet.

There is something peculiar about software that makes it different from other crafts. A long while ago, I read “The Mythical Man-Month” by Frederick Brooks and I think he called it accidental complexity. How is it that we’ve found ways to organize the work around so many other creative disciplines but writing software is still hard?

What do I mean with hard? Things that would take yourself maybe a couple of days to do in a way that sort of works become projects that span months or years in the enterprise. And with enterprise I mean startups that can have as little as 50 software engineers. But I’ve also see much much worse.

A very senior leader at a company once told me that all big companies operate like this and that the big ones “can pull it off.” With “like this” I mean central project management, migration projects that take months to complete without little obvious customer impact. The mental image I get is Napoleon shuffling around his troops on some hills. He needs to have the cavalry up front, but first he has to move that unit across the river, but before that can happen three other units need to be moved around, and so on. Sure, it can be done (like Twitter infamously moved away from Monorail), but it is hardly fun nor does it look like a good use of everybody’s time and money.

It is interesting that I was thinking of the military at all, but that’s just the way the mind works, right? And we’ve seen many other metaphors for organizing the work to write software “at scale.” There’s the waterfall, a kind of top-down approach that is inspired by engineering projects. Brooks himself suggested a surgeon’s team as a better metaphor. Agile software development doesn’t have an explicit metaphor, apart from it being a bunch of people essentially (“people above processes!”). Don’t ask about SAFe.

I find myself coming back to the film business, because it is also an activity which has both a high degree of creativity but also a lot of specialization (anybody really knows what the “best boy grip” does?). They also had a couple more decades to figure it all out, so maybe that’s what’s missing for software development.

Now, what’s so special about software development? Somehow it seems to be an activity that benefits a lot from being done in one mind (as exemplified by this comic). Somehow, the code in front of you is just the tip of the iceberg of a lot of mental representation of what is happening, and it seems that even given the best intentions, you sometimes need to switch it all up and rewrite the whole thing (don’t do that, unless you must, of course. At least do it quickly).

As soon as you start to involve more than one person, you can either try to have them all work on the same mental representation (like in pair programming), or you need to introduce some restrictions on the area of responsibility, so that everyone can mind their own business, or at least work more independently.

Another book I’ve read a long while ago is called “Notes on the Synthesis of Form” by Christopher Alexander, and if I remember correctly, one of the main insights was that whenever you need to design a system against many constraints, things get exponentially easier if you take problems one piece at a time (I’m paraphrasing of course). So that’s a good idea, right?!

The problem is that by breaking down problems into parts, you severely restrict the solution space. If you do it wrong, the right solution depends on two different part doing something that may not make sense if looked at in an isolated fashion.

For some types of programs (e.g. webservices) we found ways how to structure them to make work easier, but often when you’re writing that one software system that will power your startup and move you into DECACORN domain, you might only figure out halfway through what the best approach is, and if you don’t have the guts to clean that up, you’ll in fact end up with a divison of (mental) labor that does not work without oversight, and that means software architects, program managers, and frequent check-ins.

This separation of work happens every time you decide who works on what within a team, but also once you define teams in a company. This will have very real effects on the software you write, also known as Conway’s law. The book “Team Topologies” by Skelton and Pais (recommended to me by Daniel Trümper) takes this insights and applies the “reverse Conway maneuver” by designing the teams in the way you want the system to look like. But still you need to know how that should look like.

Agile software development proposed a powerful tactic, and that is to write the simplest thing that will work, and then refactor (that means cleanup) you code base constantly. There are books (another book I read a long time ago I realize now was “Refactoring” by Martin Fowler) that talk about refactorings, but these are mostly smaller changes like moving fields in classes where they belong (I’m simplifying!)

At this point it has become pretty clear to me that “writing software at scale” is not a question of effort or will. No amount of “move fast and break things” alone will help you achieve that (and yeah, I know they changed it to “move fast with stable infrastructure”). The way we structure software and teams is very real and will constraint what any single engineer or team can do with what amount of work to be put in.

The metaphor I find myself going back to is that of a city, and I first read about that in the book “Connected Company” by Gray and Vander Wal. Cities are remarkable because they are often quite old. They almost seem immortal even if they have been rebuilt and remodeled over the centuries or decades and are constantly changing.

Cities can be thought as platforms for human activities. They provide basic infrastructure like roads, electricity, buildings, shop space you can rent, and now Internet. Some of these pieces of infrastructure change very slowly (like roads), while others are much more flexible, like the way apartments or shops are used.

What if software were built in the same way? What if the core parts of our business would be like streets, and all that newfangled stuff is something we could build on top, experiment, tear it down if it does not work? I’ve seen a few e-commerce companies from the inside, and while their systems are marvel of technologies able to handle thousands of transactions per second, it does not feel like this, but things like the app and the website are very deeply entangled with the rest. Even if you wanted, you couldn’t create a completely new app or website.

Conversely, if we built our cities the way we build our software, you would need to enter the shop through the special garage, and exit through the roof to walk a wire to get to another custom made building from scrapped containers to do the checkout. And some of the windows are just painted on because they’re an MVP.

We’re trying to make that work with platform teams that are like the electricity companies in the cities that provide commodity services so that not everyone needs to run his own Diesel generators in the basement. This is a good first step. But often, the platform teams are trying to cater to too many disjoint “customers” (like a deployment infrastructure for both backend services and data scientists), and we remove incentives to make it right for the customers by making the use of the platform team mandatory.

Cities are also remarkable because there is only limited central control. Shops open up if they see a market and close if they don’t have enough business. In comparison, most companies are run in a much more top-down fashion, leading to the wrong or not the right decisions being made.

At least, building a new project within a company should be easier than starting from scratch, but my hunch is that many companies fail that test.

As you might have realized by now, I only have pointers, no answers. We’re still not there. Maybe in 50 years we have found the right roles, the right ways to structure teams, and the right approach to architecting software “cities” that are fun to live in.

Till then, my recommendation is to look at structures and ask yourself, how hard is it for any one “unit” in your “system” to get stuff done. Everything that cuts across areas of responsibility adds complexity. The Team Topologies book suggests to favor teams that are end-to-end, that fully own a problem to be solved, supported by platform teams and teams that manage a very complex piece of technology.

For those teams that need to collaborate in a pipelining fashion, by all means look at the whole system to identify the bottlenecks and focus on improving them. This is the central idea behind the “Theory of Constraints” by Eliyahu Goldratt. You need to limit the work in progress so that it matches your bottleneck, everything else is just busywork.

Maybe there is even another approach, looking into ways for people to cooperate more effectively. I briefly mentioned pair programming above, and I have to admit I historically believed it cannot be done, or only by a small number of people who know each other very well and have figured out a way to bounce off ideas of one another.

Ironically, in my experience, just bouncing off ideas of one another is not the way this works well. What you can end up with is people throwing their ideas into the ring just for others to find faults in it. In the worst case, this can turn into a competition who knows most and who is smartest.

Unless you take care everyone has different understanding of the problem, and there is no focus on information gathering and constructive creativity. I’ve recently starting to read more about design thinking approaches, that use, for example, brainstorming both to collect information and generate new ideas. I found these things to be surprisingly effective at joint problem solving.

In any case, we definitely haven’t figured out how to write software at scale, but I also would rather not believe this is just how everyone does it.

This post trended on Hacker News in April 2021, and has over 350 comments in the discussion.