Systematic management of tech debt
Introduction
Tech debt is a fact of life in software engineering. However, teams typically struggle when it comes to effectively dealing with tech debt. Some view tech debt as something evil that must be avoided at all costs, some ignore tech debt until it can no longer be ignored and some think a periodic focus on tech debt (e.g. tech debt month) will keep it under control.
I argue that a more systematic and continuous approach to managing tech debt that balances borrowing with prioritizing payoff can enable teams to drive short/medium term growth while maintaining long term sustainability.
In order to make my case, I use parallels with the world of finance.
Financial debt is a practical necessity in the modern world that both individuals and businesses frequently leverage to power their growth. For example, individuals use debt to grow their lifestyles (housing loan), grow their careers (education loan), and grow their wealth (small business loan). Similarly, businesses use debt to grow their market presence (hire GTM team), grow their product portfolio (invest in R&D), grow their inventory (stock up), and grow their margins (invest in automation).
However, financial debt can be a double edged sword; if it is not explicitly tracked (i.e. on the books) and responsibly managed (i.e. regularly serviced), it can drive individuals and businesses to take on too much debt and end up in bankruptcy which is a really bad situation. Responsible management of financial debt needs ongoing and careful attention to factors such as existing debt obligations, cash inflows and outflow and cost/amount of any new debt being considered.
Tech debt should be leveraged in a similar manner to financial debt to power the growth of businesses. A less than perfect solution that accrues tech debt by deferring critical work such as automation, quality, performance, reliability can be a great source of leverage for a business. It can help a business gain first mover advantage, learn from customers and create a more compelling offering.
However, just like financial debt, ignoring the accrued tech debt or continuing to accrue more tech debt can start to build significant overheads over time and bring teams to a grinding halt — whereby they cannot deliver any new functionality because the costs of keeping up with existing functionality consume all their cycles. This is the equivalent of bankruptcy in the world of finance.
In the rest of this document I define what I mean by tech debt, talk about the reasons that make it problematic for teams and describe a simple and practical framework for teams to responsibly manage tech debt.
What is Tech Debt?
At its heart, software engineering is about designing, building, testing, operating and supporting systems that meet a given set of requirements based on a set of assumptions. The process of designing, building, testing, operating and supporting these systems can in turn involve other automation artifacts.
Also, for a set of engineers to be able to collaborate and contribute effectively to building, testing, operating and supporting systems and automation, they need knowledge and context about the decisions and choices made, the rationale for these decisions and choices as well as the tools/processes and methods used.
Improper and/or inadequate automation and the lack of easy to find/consume knowledge and context can create overheads and waste in the system — I classify the work needed to remedy these as tech debt.
There are three major drivers for the accumulation of tech debt:
- Implementation compromises are made: to power business growth goals, teams may need to make some compromises on implementation due to lack of sufficient bandwidth or calendar time. These compromises can take many different forms — not taking the time to properly comment and document designs or code, taking on a lot more external dependencies than strictly necessary, not testing code sufficiently, not setting up proper monitoring/alerting, etc. These compromises can result in a higher frequency of problems, greater maintenance time & effort, higher cost to onboard new team members, etc. The work needed to remedy these implementation compromises is tech debt.
- Requirements & assumptions drift over time: over the lifetime of a software artifact, the requirements and assumptions it was based on will almost certainly drift. For example, an assumption may have been made early on that a manual method is acceptable for tasks such as building, testing, backup or deploying because these tasks are performed infrequently and therefore do warrant the cost of automation. Similarly, a specific design or implementation method may have been chosen because of the set of requirements known at that time. However, these types of decisions do not get revisited as the artifact matures and the underlying assumptions and requirements are no longer valid. As a consequence, the lack of automation or a mismatched design/implementation can end up posing significant overheads. The work needed to fill automation gaps or remedy design/implementation issues is tech debt.
- Maintenance of knowledge/artifacts are ignored: all software and knowledge related to it needs ongoing maintenance but more often than not, this essential activity is either starved or inadequately resourced due to competing priorities, shift in focus or staff turnover. Critical bugs or vulnerabilities go unfixed for long periods of time, tests are not kept updated, documentation falls out of sync with implementation, deprecation or EOL notices for dependencies are ignored, releases do not happen frequently, production best practices are not followed, etc. Ignoring maintenance either results in frequent and expensive fires or death by a thousand cuts. The work needed to update automation and/or knowledge in the face of ongoing changes is tech debt.
Why is Tech Debt problematic?
There are three key reasons why teams typically struggle with managing tech debt:
- Lacking visibility: teams don’t maintain a current and consolidated inventory of all their tech debt. As a result they don’t know how much aggregate tech debt they are carrying and how much “principal” and “interest” they owe on each item of tech debt they carry. Imagine suddenly discovering you owed money on a vacation home you’ve not been to in a decade, or a boat you did not know you owned or a college loan you had forgotten about.
- Adding unconsciously: teams are always launching new artifacts and/or modifying existing artifacts — these actions have implications for tech debt (such as addition or reduction/retirement) but are never consciously discussed or considered. As a result, teams keep borrowing more tech debt over time without even realizing they are doing so. Imagine going on a really indulgent and luxurious vacation using your credit card without worrying about the college loan or the mortgage payment that needs to be made every month.
- Repaying randomly: when teams do decide to repay tech debt, it is more often than not done in a somewhat haphazard manner — i.e. they don’t consider all the right trade-offs to ensure they are paying off debt that will yield the greatest benefit. Imagine paying off your super low interest home loan before paying off your really high interest credit card balance.
Taking a systematic approach
Framework
I build on the financial debt analogy introduced earlier in the document and propose a framework based on the following key concepts:
- Debt tracker: this is a tool that is used to keep track of all the technical debt that a team owes. I recommend implementing the tracker using tagging or a similar approach within the issue tracking system the team already uses for tracking all their work rather than introducing a separate tool.
- One-time cost (OTC): this represents the amount of effort needed to get rid of any particular debt item and is very similar to the loan amount involved in a financial setting. I recommend using Person Years (PYs) as the currency to capture OTC.
- Recurring cost (RC): this represents the amount of wasted effort being incurred when dealing with the consequences of the debt item and is very similar to the loan interest involved in a financial setting. Again, I recommend using PYs as the currency to capture RC.
- Leverage: this represents the advantage to be gained by getting rid of the associated item and is the ratio of RC and OTC. Higher the leverage, the more advantageous it is to prioritize the particular debt item over others. For two items that offer similar leverage, it would be advantageous to prioritize the item with lower OTC because it will yield savings faster.
The table below illustrates the debt tracker for a hypothetical team.
For example, the row with the “Automate scaling” item says it will take about 3 PYs to fully implement auto scaling, the lack of auto scaling is creating about 1 PY worth of overhead for the team and that the leverage gained from implementing auto scaling will be 0.33.
The item with the highest leverage here is “Build playbooks for oncall”. If a team were to invest any effort in paying off debt, it should choose that item. The items “Fix flaky tests” and “Eliminate false alerts” both offer the same leverage — 0.4. When prioritizing, the team should pick “Fix flaky tests” because it has a lower OTC. The intuition here is that it would be cheaper to implement and free up bandwidth sooner that can then be deployed elsewhere — paying off more debt or working on other items.
I recommend using a lightweight method for estimating the OTC and RC for items. Using the collective brainpower and experience of the team to crowdsource these estimates would likely yield the best results. It is more important to get the relative sizings right as opposed to getting each of the individual sizings accurate. The purpose is to help establish the right priorities more than anything else.
Methodology
Based on the above framework, our methodology for responsible debt management involves focusing on these two aspects:
- Conscious borrowing: this deals with ensuring that new technical debt is taken on with deliberate care and attention rather than by being unconscious and inadvertent. Teams need to establish the right checks and balances within their operating processes for executing on planned initiatives as well as regular operations and maintenance to ensure they are asking themselves the right questions related to tech debt and making the right tradeoffs/decisions on when to accept new tech debt versus not. We recommend these two practices be integrated into the way teams work:
a. Discussion of debt in designs: an explicit discussion of debt should be part of every design doc and review. If a design pays off debt, document and discuss that. If a design adds debt, document and discuss whether or not the team can afford to take on the debt. Even if a design is debt neutral, be explicit about documenting that and make sure everyone agrees with that perspective.
b. Discussion of debt before launch: an explicit discussion of debt should be part of launch reviews for major new systems or features. The reason for doing this in addition to the design review is that more often than not, there are differences between the proposed design and actual implementation. These come up as teams discover new constraints or requirements. Having a discussion similar in vein to the one related to the design before the actual launch will help uncover any deviations related to debt. These can then be at least documented and potentially even addressed prior to the launch. - Prioritizing payoff: this deals with ensuring that there is an ongoing and steady investment of effort to pay down technical debt and not let it get out of control. Teams need to commit a portion of their available bandwidth to paying off debt and ensure they add appropriate initiatives into their plans. We recommend these two practices:
a. Budget for technical debt: based on the current debt situation a team faces, it should work with its cross functional partners to commit a portion of available bandwidth to servicing technical debt. The level of bandwidth dedicated will depend on the level of technical debt the team faces. Higher the level of debt, the more bandwidth the team will need to dedicate.
b. Prioritize based on leverage: as discussed earlier, leverage should be used to prioritize items to payoff. Picking higher leverage items offers greater advantage because of the disproportionate return on investment in terms of reducing recurring costs (RC) which represents wasted effort. Faster reduction of wasted effort frees up available bandwidth which can in turn be used to tackle more technical debt.
When it comes to payoff — an interesting approach to consider is taking a phased approach to payoff. With a phased approach, debt may be partially paid off — so more expensive debt (high leverage) could be replaced by less expensive debt (low leverage). This can be a way to gain efficiency sooner.
Summary
The figure below illustrates the advantages of taking a more systematic approach to tech debt management. The situation on the left is one teams more often than not find themselves in. The one on the right is what we believe happens when teams take a more systematic approach to tech debt management.
The consequences of not paying attention to tech debt over long periods of time is bankruptcy. Bankruptcy is a situation where teams spend north of 80% of their time on wasted efforts. Such teams will have poor productivity, flagging morale and deliver sub par business results. Using the approach outlined above, teams can both avoid bankruptcy as well as work themselves out of bankruptcy to a more healthy and sustainable state.