More precious than silver

04 Nov 2019

I’ve been doing a lot of reflecting lately on the last project I shipped - what went well, and what didn’t. A while back I tweeted out some halfbaked thoughts. One of which was a reflection that while the entire engineering organization beyond my team was using a tremendously powerful toolset, we still got bogged down.

My group (> 30 engineers) just delivered a fairly major set of systems, and lemme tell you no amount of leverage to make change within our components was able to save us from the overall complexity of the problem.
— Reid (@arrdem) November 17, 2018

Why? How? The entire reason I got started caring about software engineering and tooling in the first place was trying to find longer levers with which to move more. With which to take on and ship otherwise intractable projects. Have I just been barking up entirely the wrong tree this whole time?

While not particularly well structured or even cohesive, the thread seems to have hit some chords. Particularly Dimitri came in with this observation -

My team is over 30 devs, but we break into sub-teams of < 5. The reason has nothing to do with technology, but rather communication overhead. More people working directly together means more interactions, emails, meetings, and so on. I've yet to see an effective team over 10.
— Dmitri Sotnikov ⚛ (@yogthos) November 17, 2018

And Tim as well -

I want to say you're wrong, but I'm not sure you are. The fact is, most of the time when a project grows to needing 6+ devs, there's a push to break it up into more services, or the project boggs down. So, microservices that communicate via pre-defined protocols <cont>
— Timothy Baldridge (@timbaldridge) November 17, 2018

And somewhat to my surprise Zach -

That could be (generously) interpreted as something I’ve often argued, which is that everyone should have independent ownership of some part of the codebase, and the trick is to figure out who gets what. Obviously you want overlapping knowledge, but not overlapping authority.
— lambda the proximate (@ztellman) November 17, 2018

The question for me started out as how do I, as a software engineer, optimize myself and my ability to ship code. Okay so go out buy a keyboard learn a power editor, learn a language and ecosystem with leverage… and you’re a super hacker, right?

Six years heavily invested in Clojure and other things widely regarded as power tools later, well maybe not.

You can improve a single person’s output, but only to a point. There’s only so much sleep you can loose, so much coffee and pizza can do before you remember there are only 24 hours in the day and do you really want to be spending all of them with a keyboard? Of course not. Maybe there’s an age bracket for that - but even I’m a bit of a whippersnapper and I’ve found it fleeting at best.

The conclusion is that, well, college me miss-stated the problem. Surprise.

The question isn’t how to maximize single developer throughput as it is how to maximize team throughput. You and I don’t scale. We have fixed bounds - and better things to do I hope.

We however, maybe we scale. And the more of us that can be brought to bear on a problem, the more we can get done. So perhaps a better stated problem is this -

How do we make people more effective as engineers building software systems and using software to solve problems?

This is a big and thorny question which has been and must be attacked from many sides.

Coding?

The first and most obvious - the mistake I made - is to optimize the simple act of coding.

Traditionally the act of coding has been regarded something like the act of sculpting clay. One starts with a formless mass and from that creates the world through grand vision. This is somewhat apt. Most Software begin life as an empty set of formless buffers and directories on which the programmer must impose their will. This is a narrative which leaves no room for false starts, rework and learning. It falls afoul of the narrative fallacy, wherein past events correctly viewed presage their future when perhaps they did not.

In fact Parnas 86 “A Rational Design Process” provides a compelling argument that such a perfect forming of software into the world is impossible. In short - the act of creating software let alone deploying it into the world changes the world. No matter how perfectly suited the software was for the world into which it was deployed, the act of authoring is an educational one. We as program authors develop greater understanding of problem spaces when we build software. Furthermore anyone exposed to the initial product will refine their understanding of what their needs are. Both of these changes at least partially invalidate the criteria under which the software was to be evaluated and against which it was design, demanding re-design and iteration.

This suggests that, while coding may be an essential part of the software development process, it coding is more than simply building the right thing. It has exploratory aspects, and maybe that’s more the value add.

OODA?

Those of you perhaps more acquainted with fighter jets may be aware of an (arguably overused) term - the OODA loop.

The OODA loop (Observe, Orient, Decide, Act) loop is a bit of jargon used to refer to the albeit obvious structure of any decision making process. It was popularized by Boyd - particularly the idea that you can “get inside” another entity’s OODA loop and “beat them” that is be more effective at decision making just by being able to iterate faster.

A pilot who can react faster has more time to respond. A command and control system which can detect a possible “bolt from the blue” strike is better able to take measure to protect itself.

Or to choose a more mundane example, a programmer who is able to rapidly explore their intuitions and explore program behavior will be able to at the least test more theories and will probably learn more and certainly more easily. They will be able to develop confidence in their tools and their application through experimentation. They will be able to build up a metis (craft, cunning, skill) with their tools born of familiarity as opposed to techne (skill, technique) of rote understanding.

This is not a new idea. Our ability to think with tools depends in large part upon our ability to understand our tools as extensions of ourselves through rapid feedback. Read, Eval, Print, Loop (REPL) workflows and ideas about interactive programming exist to try and tighten the feedback loops between programmer and machine. And there’s an entire field of study devoted to characterizing the “acceptable” latency of machine interactions in terms of how quickly a human is able to process that feedback.

For instance learning Jenkins will be hard when it takes an hour to update a job because the JJB builder sits behind a slow CI system and you need reviews to make changes. Learning Python is easier - it sits on your computer and responds relatively quickly to inputs.

Likewise there are a ton of opportunities with blue/green deploys, red/green tests, α/β testing, traffic generation, traffic sampling and event based systems to choose architectures which allow for rapid, safe experimentation and tightening the OODA loop of “pushing to prod”. Assuming we can’t make local development sufficiently prod-like to offer similar feedback wins.

Planning?

Communication and coordination processes like Agile exist to try and optimize the OODA loop at the organizational level. They provide a framework for the communications of requirements gathering, prioritization, scoping and delivery. At the end, it gives an objective for delivered value which can be re-evaluated and iterated upon.

The OODA decision is just utilitarian evaluation on a short timescale - evaluation on a long one is too hard.

Unfortunately this leaves Agile open to the usual attacks against incrementalism and utilitarianism. Its limited scope of evaluation blinds it to long-term costs or yields.

This workflow excels where the problems to be solved can be incrementally delivered, and where decisions are not expected to have long-term or irreversible consequences. For instance a web application which simply provides views to some backing data store can easily be delivered incrementally. It can be reworked incrementally, and choices in its design can be cheap to re-visit because the application is merely a client with no carried state that must be managed. Re-planning or pivoting is cheap.

This is clearly not a general property of problems, and leaves out may problems which due to integrity constraints, fixed investment in data storage or other relatively immovables are facts of life. However as from Parnas, no matter how good our plans may be we will find fault in them and they must be adjusted. Fully gathering requirements and planning is impossible, as is developing “the right thing”, but neither can we blindly A/B test ourselves to product/market fit. Design and plans are required.

Communication?

Designing systems to minimize state, or even be stateless makes it easy to discover implementations of desired data flows and for engineers to build intuitions about the system. Unfortunately however, no matter how easy it is to build up intuitions and insight, they are personal. Even with exploratory tools we need to be able to communicate insights to other people if we really want to be able to scale out the development process. Furthermore in order to plan development effectively we absolutely have to be able to communicate lest we leave out understood factors.

One solution to this problem is simply to retain smarter people and have mostly single ownership as mentioned by Zach above. This works - until it doesn’t. Single ownership effectively optimizes to maximize context, it mitigates some sources of deadlock by assigning a leader for components of a project, while still enabling elective collaboration. Because collaboration is elective not the norm however, it’s easy for owners to become single points of failure who are unable to say get off on-call or go on vacation without impacting timelines. This will produce long-run org degradation. You won’t be able to have an on-call rotation. That person won’t be able to go on PTO without undue impact. Even on the most healthy team(s) people get tired and need change, and when they choose to do the next thing you’ll be bereft of the only person who was a domain expert on the component. A rewrite will be the likely result.

Single ownership however still seems like it could be an efficient thing to do because communication is hard and teaching is slower and harder. Communication costs increase with the square of the number of people involved - this is just an obvious corollary to Metcalfe’s law. So what’s the dynamic space here?

We all know Brooks’ The Mythical Man Month, if only for its famous statement that adding more engineers to a late project will make it later not speed it up. Brooks points out that adding more heads both increases the coordination burden on the team, and that the new heads have to be trained before they become productive! Obvious perhaps, but a mistake I’m sad to see I’ve seen folks make.

Okay so we know that coordination costs increase exponentially with headcount, and there’s a cost to bringing new folks up to speed which together produce Brooks’ slowdown. But how can we put numbers on the ideal size of org?

Well a straw poll works in a pinch, but the sample size is pretty small.

Follow up - what was the real duration?
— Reid D. M. (@arrdem) October 5, 2019

It turns out that no small amount of effort has been spent over the centuries optimizing the size of a combat unit, and perhaps infantry tactics being an applied exercise in group psychology and communication costs will be somewhat reflective of software dynamics.

The history is fascinating, but you’ll forgive me for not recounting all of it here. Powell ‘18 gives a good treatment of the US’s explorations, as does the previous Hughes monograph on which it is built.

The short version is that the US has experimented with a number of squad configurations consisting of several teams under a single leader - occasionally with a delegate. While extremes of two-person pairs and ten persons units under a leader have been tried current infantry doctrine appears based on two five-person teams with two overall leaders, for a squad headcount of twelve. The advantage of this organization fundamentally is that it - as the minimum size for ether team is three - both teams have resiliency built in as does the leadership. furthermore by being organized into teams, decisions can be devolved and the group overall is flexible. Other configurations such as three teams of three about as good, but have less resiliency.

It’s interesting to note that this conclusion aligns nicely with the vignettes about software team size from Demitri and Tim, as well as with my straw poll. This should be unsurprising - we’re trying to measure what should be a rough psychological constant - but it is a check to show we’re in the right ballpark.

Planning again? Or is it still communication?

So if the maximum team size is about five, and the maximum effective organization size under one leader is about 13, doesn’t that limit the scope of our designs? A group of 13 can only do so much. There are only 24 hours in the day, of which we only work 8, 5 days a week, 48±4 weeks a year or so. Furthermore, a group of 13 (really 10 or so engineers) can only have so much skillset coverage.

Perhaps of most importance, how do you coordinate design efforts across groups? Conway’s Law (solution architecture will mirror the structure of the organization) is real, and must be resisted. But if you can’t really get more than five minds on a problem, then at best you can get representatives of five 13 person teams in one place? Which puts the effective limit of the scope of an engineering project is 65 minds - partitioned into units no larger than the scope of 5 minds.

Maybe you could build arbitrarily large log₅ trees of engineers, but doing so presupposes that your chosen 1 of 5 representative AT EVERY LEVEL is able to capture ALL the context of their four peers. That is these representatives themselves must be Sufficiently Capable Domain Experts. Fundamentally this is the Scrum of Scrums concept.

If it isn’t safe to assume that context can be losslessly compartmentalized in this manner, then you’re faced with decision making costs which grow at least linearly on the size of the organization. The delegate being unable to capture all context and answer all questions, questions must at least be devolved and we’re again exploring the space between O(log₅ N) and O(N²) going back to Metcalfe.

All of this drives me to the conclusion that, well, some things just take a while. Both architecture and software development take time, and the “hammock driven development” meme is more valid than we may suppose. Some things just take time, and even once a rough design is settled upon, they still take time.

In general, problems don’t parallelize, and designing a problem based decomposition requires context and coordination, which limits scope and is serial. Large scale attempts to produce a decomposition will fall into the Conway Trap. Maybe you can even “Reverse Conway” engineer your organization to create a structure which will drive a more acceptable solution architecture - but you still have to come up with that design and this presumes your organization is politically flexible enough for such an outcome to be practicable.

So either we must do things achievable usable only with our present tools, or we must steel ourselves for the journey.

I guess this is a disappointing conclusion as it merely affirms that some hard things are hard and cannot be made easier. Maybe knowing saves our futile seeking and worrying; and that is more precious than silver.

Thanks to Matt ‘@arachnocapital2’ for reading a draft of this essay and contributing some managerial feedback.