Estimating Effort - An Explicitly Implicit Approach

Feb 14, 2022

It is difficult to make predictions, especially about the future.

Sage advice.

So why bother estimating the amount of work needed to complete a product backlog item? After all, since estimates are about the future the probability is high that they will be wrong. Actually, they may very well be guaranteed to be wrong. It's just that some of the guesses will be more accurate than others. And if they happen to match what the effort ended up to be, they just look like they were "right."

I've written in the past expressing my thoughts about estimating the effort needed to complete product backlog items, particularly with respect to story points. I believe working to find a relative gauge to how well teams are estimating work is important. Without them, cognitive biases such as the optimism bias and planning fallacy can significantly distort a project delivery timeline. However, the phrase "story point" is burdened with a lot of baggage. It has been abused and misused such that invoking the phrase often causes more harm than good.

I've been experimenting recently with a different approach to estimating effort. The method I'll describe in this post got a bit of a boost after listening to a recent interview with Psychologist and Nobel laureate Daniel Kahneman. In this interview, Kahneman describes an experience he had while serving in the Israeli army some sixty years ago. He was assigned the job of setting up an interview process that would determine how well a recruit would do as a combat soldier. For this process, he selected six traits and instructed the interviewers to ask questions designed to evaluate each trait independently and score them. The interviewers were not happy with this approach. As a compromise, Kahneman instructed the interviewers, when they were finished asking about the six traits, to close their eyes and just jot down a number they felt matched how good a soldier the recruit might be. What he discovered:

When we validated the results of the interview, it was a big improvement on what had gone on before. But the other surprise was that the final intuitive judgments added, it was good. It was as good as the average of the six traits, and not the same. It added information, so actually we ended up with a score that was half determined by the specific ratings, and the intuition got half the weight. That, by the way, stayed in the Israeli army for well over 50 years.

This intuitive evaluation made by the interviewers is similar to what Agile methods ask of development teams when determining a value for "story points." T-shirt sizes, planning poker, dot voting, affinity mapping and many similar techniques are all designed to elicit an intuitive sense of the effort involved. If there is a disagreement between team members, than a dialog follows to understand what the discrepancy is all about. This continues until there is alignment on what the team believes the effort to be. When it works, it works well.

So on to the details of the approach I've been experimenting with. (It doesn't have a name yet.) The result of this approach is a number I call the "effort value." The word "value" is a reference to the actual elementary mathematics value being derived. Much like the answer to the question "What value results from adding 2 and 2?" Answer: 4. The word "value" also suggests an intrinsic worth, something beyond a hard number. My theory is that this will help teams think beyond the mere number and think also about the value they are delivering to stakeholders. The word "point" correlates to a hard number and lacks any association to intrinsic worth or value.

Changing the words introduces a simple and small shift that nonetheless has a significant impact. With the change, teams are more open to considering a different approach to determining estimates.

So how is the effort value derived?

I begin by having the team define 4-5 characteristics or attributes that, to them, describe what they mean by "effort." It is important for the team to define these attributes. By doing so, they own the definition and it becomes much harder for them to dismiss the attributes as "someone else's" and thereby object to their use in deriving an effort value. These attributes can be anything that is meaningful to the team. Examples:

Complexity - Is the work straightforward (e.g. code a bubble sort function) or does it involve interrelated systems (e.g. code a predictive inventory control algorithm)?
Dependencies - How dependent is the product backlog item on other backlog items or other teams?
Familiarity - Is this work very similar to work the team has done in the past or something quite new? Tasking a coder with documenting a piece of straightforward code may actually be a difficult effort because the coding language they spend most of their day with is familiar whereas writing clear sentences that non-technical people can understand is unfamiliar.
Information - Is the detail in the product backlog item complete? Are the acceptance criteria and definition of done clear?
Technical Debt Risk - Does the PBI require any refactoring of related code? Is any technical debt being incurred with the PBI?
Design Stability - Is there a lot of discovery and exploration needed to complete the PBI?
Confidence for Completing a PBI within the Sprint - This category may roll up several categories.
Tedium - Perhaps the effort involves a lot of repetitive copy and paste that nonetheless requires careful attention to avoid simple mistakes.

The team can define any attribute they wish. However, there are a few criteria to consider:

Keep the list limited to 4-6 attributes. More than that risks turning the derivation of an effort value into the equivalent of a product backlog item navel-gazing exercise.
Time cannot be one of the attributes.
The attributes should be reasonable. Assessing a product backlog item's effort value by evaluating it's "aura" or the current position of the stars are generally not useful attributes. On the other hand, I've listened to arguments against evaluating estimates in terms of "complexity" as being similarly useless. I see the point of those arguments, but my view is that the attributes must first and foremost be meaningful to the entire team. In the end, it's an educated guess and arguments about the definition of terms like "complexity" are counterproductive to the overall intent of deriving an effort value.

Each of these attributes is then given a scale, the same scale for each attribute - 1 to 10, 1 to 15 - whatever the team feels is most appropriate. The team then goes through each of these attributes and evaluates the product backlog item attribute on the scale. (NB: After nine months of Plan-Do-Check-Adapt, a better approach for scoring the attributes has been determined.) The low number on the scale represents very little impact. If dependency, for example, is one of the attributes then a 1 might mean that the product backlog item is entirely self-contained. A 10 might represent a case where the product backlog item is dependent on several other product backlog items or perhaps the output from other teams.

When this is done, ask the team where on the modified Fibonacci scale they think this particular product backlog item's effort value should be. If they're struggling you can do the math: find the average for all the attributes and match that number in the modified Fibonacci scale. If the average is a decimal, for example 3.1, match the value to the next highest modified Fibonacci scale number. In this case the value would be 5. Then ask the team if they feel that number it's a good representation of the effort value for the product backlog item.

This may seem like a lot of unnecessary gyrations, but for technical people it's a simple process they can understand. The bonus is a number they can calculate. The number isn't what's important here. What's important is the conversation that happens around the attributes and what the team feels about the number that results from the conversation. This exercise is meant to develop their intuitive muscles for considering multiple aspects and dimensions behind the "effort" needed for them to get the work done.

Use this process enough times and eventually calculating the average can be dropped from the process. Continue using this process and eventually calculating the numbers for the individual attributes can be dropped from the process. I don't know if it's a good idea to drop the use of the attributes for generating the needed conversation around the effort needed, but it will certainly be valuable to reconsider the list of attributes from time to time so as to fine tune the list to match what the team feels is important.

With this approach I'm turning the estimation process on its head (or back on its feet, if Kahneman is right.) Rather than seek the intuitive response first (e.g. t-shirt size) and elicit details later if there is a mismatch between team members, this method seeks to better prime and develop the team's intuition about the effort value by having them explicitly consider a list of self-selected attributes (or traits) for effort first and then include an intuitive evaluation for effort.

Closing thoughts from Daniel Kahneman:

Don’t try to form an intuition quickly, which was what we normally do. Focus on the separate points, and then when you have the whole profile, then you can have an intuition and it’s going to be better. Because people form intuitions too quickly, and the rapid intuitions are not particularly good. If you delay intuition until you have more information, it’s going to be better.

Image Credit

The Stoic Agilist

Discussion about this post