Moneyball for Engineers

Moneyball for Engineers

Russian mathematician Andrey Markov appears in lieu of Benjamin Franklin on a baseball made of American money

The history of management of software development is littered with failed and discredited efforts to objectively assess individual performance.

There is little consensus in the industry, organizations vary widely in how they do this, subjectivity and politics are rife, and few (if any) engineering leaders are content with the status quo.

Meanwhile, pro sports figured this out. Moneyball solved it for Baseball in the early 2000s. Could that be applied to software development?

At this point I should mention that earlier drafts of this article provoked a lot of spirited and insightful discussion within the EPSD team. The article is much improved accordingly. There are many potential pitfalls in attempting to apply metrics to individual performance in software delivery. Key caveats and credits to my colleagues at EPSD are at the end. EPSD is a group of executive leaders that are strategic business technology advisors to leading tech firms.

The Allure of the Pro Sports Team Analogy

Business organizations often make analogies to pro sports teams. And every now and then engineering orgs are drawn to emulating them. There are quite a few business books written in this vein. Needless to say, no-one has a losing team in their mind when they do this. Rather, they’re thinking about NBA/NFL/Premier League/etc. work-rates, coaches as leaders, and most likely, self-servingly, the compensation.

Talent management of “elite” or “rockstar” developers was a notable trend in the early 2010s. But the original 10x research has now been discredited, both for using lines of code written or debugged as a primary metric, but also more fundamentally because software delivery at scale is a collaborative team sport. (Right now we’re seeing a resurgence of this phenomenon with key AI talent.)

Not to say that the sports team analogy doesn’t have value, but some sports are more relevant than others. The more collaborative the team sport, the harder it is to disentangle individual performance from that of the team.

Highly collaborative team sports like football resisted statistical analysis of individual performance until a key breakthrough in 2011, when Sarah Rudd won StatDNA’s competition by applying Markov chains to football (soccer). (I highly recommend this talk.)

Sarah Rudd’s key insight was that games can be decomposed into states in a Markov chain, each transition of which can be attributed to an individual player. Each can be scored as a positive or negative contribution to the probability of an eventual positive outcome - in football, a goal. This enabled players to be ranked based on their total net contributions over many games - a ranking that was fair regardless of their role or the position they played.

Sarah Rudd’s model worked because she designed the states and transitions carefully to model the game while limiting complexity. After watching the talk, I wondered - could this be applied to software development? However, if there is research that tries to apply Markov chains in this field, I couldn’t find it.

A key challenge is that engineering effort is divided between strengthening the organization/platform and delivering software. If the project is urgent, 100% should go into delivering. But over time stakeholders want leverage, and that comes from investing in higher-performing organizations and platforms with coaching, innovation, improvement and platform development.

In sports, talent development happens on the practice field, but individual sports ratings are all about performance in competition.

Engineering activities can be separated into two time horizons - tactical and strategic. Tactical delivery is the engineering equivalent of on-field competitive performance, so that’s where these techniques could be applied. It’s no coincidence that individual assessment of tactical software delivery is highly contentious.

At this point, I imagine quite a few software engineers are teeming with objections and despairing at yet another attempt to go down this path. “Architecture and design are more important than coding.” “What about testing, quality and resilience?” “What about infrastructure roles?” “Coding is the least valuable activity, and AI is automating it anyway.” “Haven’t we learned from the mythical man-month, lines-of-code, numbers of PRs, etc.?” “Metrics are always gamed.” “What about ambiguous requirements?” “Unrealistic deadlines.” “Just use DORA.” (Unfortunately DORA resists individual attribution.)

Snakes & Ladders

… But perhaps most pertinently, “You’re going down a rabbit hole - Lattice, 360 and engineering ladders solved this.” These tools rely on 360 surveying to assess individuals against yardsticks of standardized expectations for the role and level. Every tech firm out there has their own engineering ladder, many of them open source. There are hundreds, but those from Gitlab, Square, Monzo, and the Financial Times are worth taking a look at.

Unfortunately surveying people about other individual people is notoriously susceptible to gaming and influence. After all, this is how democracies choose their leaders - the process is inherently political. Within organizations, stealing credit, reciprocity bias, and feedback fatigue are huge problems. Engineers with effective sponsors and personal branding amplify perceptions of their contributions, while those with their heads down get penalized. HR business partners enumerate a laundry list of various biases to account for, prominent among them reputational and recency biases. None of this instills any confidence.

When used for 360 surveys, the yardsticks themselves are also problematic. Engineering ladders align with professional standards, but are not oriented to align with business goals that evolve over time. For any professional standard there will be a low bar and a high bar; the actual height of the bar should be determined by risk acceptance and pursuit of goals. Negotiating trade-offs within an acceptable range is fundamental to winning, and fixed bars do not take this into account.

Organizations factor in OKRs to performance reviews, or use individualized OKRs, but these suffer from numerous problems including being a lagging indicator and mis-attributing wins and losses. In a dynamic business environment, OKRs rarely survive intact three months after they are adopted.

Ensuring that surveys don’t reduce to misaligned popularity contests requires exorbitant expenditures of effort to remove subjectivity and corroborate feedback with work data (PRs, tickets, work product, and incident reports). Org-wide calibrations compound these problems: forced distributions wrongly penalize strong teams and boost weak ones. (Many leading tech firms have quietly dropped calibrations in favor of high-trust systems where managers are coached but not second-guessed.)

These serious problems with ladders and 360 surveys are reminiscent of the bad old days in pro sports, when sentiment determined value in the absence of reliable data. The productivity drain and the unreliability are why engineering leaders continue to struggle with accurately assessing value on an individual basis.

“What Is To Be Done?”

Here’s how I’m thinking Sarah Rudd’s method could be applied to assessing tactical software delivery.

It’s worth recalling that the business value of software engineering efforts is working software in live products used by users. No matter what technologies, platforms, infrastructure, research, architecture, or design activities are involved in its creation, the business value is in delivering the roadmap. Software delivery is fundamentally the creation and flow of code into production. It is therefore legitimate to evaluate this flow as a proxy for all the efforts involved.

Let’s park the usual objections for a minute, and think about appropriate states and transitions for software delivery, from the perspective of a stakeholder. The unit of work for software developers is typically an individual work item - a ticket, feature or story. Again, important non-coding activities such as design and testing exhibit themselves in the resulting code, so they don’t need to be explicitly modeled. Code flows into production via pull-requests. Focusing on these essentials, there are 4 fundamental transitions:

  1. PR submission and acceptance
  2. PR rejection (at any stage - code review, merge, CI/CD/CT, etc.)
  3. Fixing a bug introduced by a PR that caused a production incident
  4. Cancellation of the work

Number three is important, for reasons that will quickly become clear.

Which brings us to the states, and these need to be carefully chosen to be small in number, aligned to the business goal of “working software in production,” and clearly distinct. “Working” in this context means incident-free, i.e. not just functional but also reliable and secure.

In football, there are really only two final states - a goal (positive), or loss of possession or end of period (negative).

But in software there are degrees of success and failure. “Deployed on-time, stable in production without incident” is the ideal outcome. But is a single incident and hotfix that bad? It’s certainly not as bad as multiple incidents.

Meanwhile, late delivery is always worse than on-time delivery, assuming the same number of incidents.

Is a single PR rejection in code review that bad? Probably not - but multiple rejections or CI/CD failures are definitely worse than ideal. PR rejections and re-work are a productivity drain - both developers and reviewers have to repeat their work.

At this point, I’m imagining more protest. “Aren’t incident retros supposed to be blameless?” “Bug-fixes are different work than the PR that introduced the bug.”

Indeed, a basic problem with typical software project management is that post-incident remediation is recorded as different work than the introduction of the defect. But I would advocate, perhaps controversially, that root cause analysis and individual attribution are essential. An incident due to a bug or a vulnerability is a test that was missing and a PR that should have been rejected. I’m going to go out on a limb here and say that remediation work should be recorded as a continuation of the original development.

In reality, even the most “blameless culture” tech firms will exit engineers for causing production incidents with bugs or vulnerabilities.

From a management perspective, there are major qualitative differences between 0-1 PR rejections, 2, and 3 or more. Similarly between no incidents, a single incident, and multiple incidents. With that in mind, here are the initial and intermediate states I came up with.

    working - 0-1 PR rejections, no incidents
    working - 0-1 PR rejections, 1 incident
    working - 2 PR rejections, no incidents
    working - 2 PR rejections, 1 incident
    working - 3+ PR rejections, no incidents
    working - 3+ PR rejections, 1 incident
    working - 2+ incidents (supersedes PR rejections)

The final states (absorbing states in Markov chain parlance), and their scores:

    1 - deployed on-time, no production incidents
    0.25 - deployed late or with 1 production incident
   -1 - deployed late and with 1 production incident
   -2 - deployed with 2+ production incidents (whether late or not)
    0 - abandoned - rolled back or never deployed, abandoned prior to deadline
   -2 - failed - rolled back or never deployed, abandoned after deadline

The “score” here is a measure of how good (or bad) the outcome is. Again, I predict some skepticism here, but I’ve aligned these scores to a typical stakeholder perspective, emphasizing the value of delivering on-time and incident-free.

-2 scores are used for strongly negative outcomes such as giving up on a feature because it couldn’t be delivered reliably or on time, or multiple incidents causing a botched launch and probable reputational damage to the business.

There could also be a +2 score for deploying early without incident, but in my experience that never happens.

How would this be used?

  1. Ingest PRs and tickets and build a Markov chain transition matrix from the data. It is helpful if PRs are linked to tickets via tagged commit logs.
  2. Use Monte Carlo bootstrapping and k-fold validation to validate the model.
  3. For the initial and all intermediate states, compute the probabilities of arriving at the final states; multiply each probability by the corresponding score, establishing a weighted-average score for each intermediate state.
  4. For each engineer, iterate over their transitions (PR acceptances, rejections, post-incident bug-fixes) and total the delta-scores contributed by that engineer to the eventual outcome.
  5. Rank the engineers based on their scores, and focus on the outliers.

I think this would be an interesting piece of research in a large organization, and possibly a useful product. I would not advocate using it in isolation for performance review purposes, but it could usefully accompany 360 surveys. Ideally it would remove the need to exhaustively corroborate them.

Problems and Caveats - Work-Rates

This proposal would really measure effectiveness, not productivity. An engineer who did no work at all would score 0 - probably not the worst score. To address this the score could be multiplied by a measure of work-rate, perhaps PRs per month, BUT…

When it comes to assessing work-rate and PR complexity, there is a big problem. Work items are never created equal. Some work is straightforward, while other work is full of ambiguity that has to be navigated by engineers using their knowledge of business context, imagination, and experience. Some PRs are simple code modifications, while others are algorithmically complex. Cyclomatic complexity scoring might help but would not account for ambiguous requirements.

Tech Debt Avoidance and Pitfalls

Good engineers forecast requirements and navigate around tech debt pitfalls. Bad engineers don’t have their head in the game and code traps for engineers who come later. Forecasting would not be credited to the right engineers, while climbing out of pitfalls would be debited to the wrong engineers.

Arbitrary Deadlines

Some product managers impose unrealistic deadlines, and engineers who are subjected to this will be penalized. In a healthy organization, engineers are empowered to negotiate movement along the Pareto frontier of time/scope/resourcing. In fact it can be argued that such negotiation is an essential engineering skill. In any case, scoring delivery would need to account for the health of the environment that the engineers are working in.

If estimates aren’t provided by empowered engineers, deadline misses aren’t a good gauge of performance.

Ambiguity

If engineers have, or can obtain, good business context, then they can generally resolve ambiguity. Being able to do so is an important skill. As long as this is consistent, engineers will be fairly rated. But if the business context is unavailable, ambiguity will severely impair velocity.

An engineering team with a capable product manager will significantly outperform one subjected to poor product management.

Transparency

If this kind of scoring were introduced, it would be very important to do so in a transparent and consultative manner. No-one will trust fairness in a metric that is adopted or measured in secret.

Another reason to be transparent about such metrics (as long as they are gaming-resistant), is that they could be highly beneficial in improving time management. Low-scoring engineers will wonder why, and examine how they’re allocating their efforts.

The benefits of improved time management and effort allocation could potentially be more significant to the organization than improvements to the objectivity and accuracy of ratings.

Outliers

As in Sarah Rudd’s talk, the value of objectively scoring a population is in identifying the outliers, particularly the undiscovered stars and the underperformers who’ve been evading scrutiny.

There isn’t enough accuracy for stack ranking in the middle of the distribution to have merit.

No Silver Bullet

“The perfect is the enemy of the good.” Hopefully this will catalyze some improvement in objectivity. The status quo needs it.

Credits

Many thanks to my colleagues Sarah Wells, Melanie Ensign, and Ken Jenkins at EPSD for feedback and discussion on earlier drafts! This article is massively improved because of their insights.