Sustainable Engineering
Introduction
This document describes a “sustainable engineering” model for teams building products. It is based on a set of observations and experiences I’ve had over the years as an individual contributor and team leader building both B2B and B2C products. I use the term “product teams” in an inclusive manner and mean folks involved across all aspects of delivering products to developers — product management, development, UX, doc, support, quality, operations, evangelism, etc.
From a sustainable engineering perspective, I classify work that product teams typically handle into one of these four work buckets (list of examples for each work bucket in the last section):
- Keep the lights on (KTLO): work required to sustain the existing customer usage and maintain business at a steady state. This work is low leverage — i.e. does not provide much long term return on investment.
- New products & features (NPF): work required to launch new products, add new features to existing products and enhance existing features. With effective prioritization, work in this bucket can deliver high leverage — i.e provide significant long time returns on investment. Low leverage work results in this bucket when individual products or features are used only by a relatively small fraction of the customer base (e.g. a feature that only meets one customer’s unique needs).
- Product & engineering excellence (PE/EE): work required to improve customer experience, meet business requirements, deprecate EOL (end of life) features or products, reduce technical debt and improve developer productivity. Again, with effective prioritization, work in this bucket can deliver high leverage. Low leverage work results in this bucket when efforts with low ROI are taken on (e.g. rewriting code because of style misalignments).
- Platformization (PLAT): work required to create reusable frameworks and infrastructure that enables the team to deliver products and features faster and more efficiently. Here as well, with effective prioritization, work in this bucket can deliver high leverage. Low leverage work results in this bucket with premature platformization — i.e. the level of reuse is very low to non-existent.
The need for a sustainable engineering model for product teams is not dissimilar to the need for a whole body approach to wellness that typically involves balancing attention paid to multiple aspects such as nutrition, exercise, self-care and mental health. Ignoring or underinvesting in one or more aspects of wellness may be tolerable for a short term but will end up creating issues in the long term. Similarly, the prolonged lack of sustainable engineering practices in product teams results in problems such as these:
- Poor user/customer experience: starving critical areas such as product & engineering excellence and platformization can lead to a stagnant product, long standing unresolved quality, security or usability issues as well as problems with reliability, performance, scaling, etc.
- Execution churn: unplanned work is a fact of life when it comes to sustaining and maintaining products. The lack of a proper strategy for understanding the nature of and dealing with unplanned work can lead to engineers having to context switch a lot, PMs having to constantly reprioritize and the team being unable to meet customer/business deadlines.
- Poor team morale: ignoring or underestimating unplanned work leads to over commitments on planned work. Teams typically try to compensate for such over commitment and deadline pressures with a combination of working longer hours and/or borrowing from the future (aka tech debt — quality, documentation, implementation shortcuts, etc.). The former leads to poor work-life balance, fatigue and burnout. The latter leads to a lack of pride of ownership and frustration with doing substandard work.
- Poor bandwidth utilization: as tech debt piles up, it can not only add on to the unplanned workload but also create a lot of task overheads for the team. As a result, teams start to spend more and more of their time and energy on low leverage work. Furthermore, ignoring investment in critical areas like platformization can create duplicative work and drive up the costs of feature and product development/maintenance over time; this is again low leverage work. As the volume of low leverage work grows, it takes bandwidth away from high leverage work, leading to poor bandwidth utilization.
Sustainable engineering is about balancing bandwidth investments across the four buckets described earlier. Product teams have no choice on how much bandwidth they invest in the KTLO bucket — this is the unplanned work bucket — work here is non-negotiable for most part and also urgent in most cases. Therefore teams really only have a choice in how they invest the remaining bandwidth across the other three buckets.
Work buckets by example
In this section, I provide lots of examples to help illustrate the four work buckets described above.
- New products & features (NPF): examples (not an exhaustive list):
a. Launches: work required to get the initial version of a product or feature in the hands of developers. There may be multiple stages involved in a given launch — EAP, alpha, beta, GA. All products and features must meet the bar for “done” — i.e. have the right “ilities” — usability, quality, reliability, performance, metrics, observability, supportability, etc.
b. Lands: work required to drive product or feature adoption among developers. There may be data collection and analysis involved. There may also be customer interviews/meetings involved.
c. Evangelism: work required to educate developers about products/features such as webinars, blog posts, conference sessions, demos, etc.
d. Compliance: work required to enable products and features to meet various regulatory requirements in the areas of security and privacy. - Keep the lights on (KTLO): examples (not an exhaustive list):
a. On call: work required to respond to production and non-production alerts and keep systems highly available, scaled and performant.
b. Customer support: work required to deal with customer requests that the support team cannot handle by itself because of a lack of documentation or knowledge required.
c. Customer escalations: work required to handle escalations from developers who may be upset because of failures or being blocked on fixes/patches.
d. Bug fixes: work required to triage and fix customer and internally reported bugs on an ongoing basis.
e. Security/vulnerability fixes: work required to triage and fix security issues either externally reported or internally detected on an ongoing basis.
f. Releases: work required to package/deploy fixes/features and get them to developers.
g. Reviews: work required to conduct requirements, design, code and other types of reviews on an ongoing basis.
h. Unplanned deprecations/EOLs/patches: work required to deal with deprecations/EOLs or emergency patches of service/library dependencies or infrastructure without any advance notice. - Product & Engineering Excellence (PE/EE): examples (not an exhaustive list):
a. Fit and finish: work that adds the bells and whistles needed to make the product or features delightful for both new and existing developers.
b. Deprecations/EOLs: work required to get developers to stop using features/products. There may be migrations, customer discussions and customer coordinations involved.
c. Self service: work that enables developers to do more with the product themselves and not require help from support such as troubleshooting guides, APIs for workflow integration, APIs/UX for configuration changes, etc.
d. Internal documentation: work required to capture internal knowledge such as design docs, architecture diagrams, troubleshooting/operations playbooks, API references, etc.
e. Metrics: work required to instrument products, features and other assets to give us the necessary insights to drive improvements across user experience and internal processes.”
f. Training: work required to adopt new processes, tools and technologies — this would include any necessary training and migrations involved.
g. Availability and performance: changes to the product to address issues causing availability, scaling or latency issues.
h. Quality: work aimed at increasing product quality via addition of testability hooks, increasing test coverage or adding more types of testing such as stress/chaos testing.
i. Developer productivity: work required to make developers more productive through the development, deployment, operations and maintenance lifecycle. This can include efforts to build tools, develop automation and create documentation.
j. Support productivity: work required to make support engineers more productive through the triage, diagnosis, mitigation and root cause analysis process for customer issues.
k. Operational efficiency: work aimed at efforts such as re-architecture, re-implementation, re-factoring, automation that reduce operating/support costs.
l. Margin improvements: work aimed at reducing hardware and software costs that contribute towards product margins in order to drive improvements to profitability.
m. Development practices: work to implement development, operation and support practices that meet the regulatory requirements for security and privacy as well as requirements from standards bodies like the ISO.
n. Planned deprecations/EOLs & maintenance: work required to deal with deprecations/EOLs or maintenance/upgrade of service/library dependencies or infrastructure dependencies. - Platformization (PLAT): examples (not an exhaustive list):
a. Frameworks: work required to create re-usable and re-configurable frameworks that make it much easier to develop, operate and/or support new features and/or update existing ones.
b. Infrastructure: work required to create and/or adopt infrastructure that sPE/EEds up the development, operation and/or support of new features as well as improves user experience by providing higher reliability, better performance and such.
c. Extensibility: work required to make the platform and infrastructure extensible so that new use cases and variants of use cases can be handled with much smaller efforts and in a much shorter time.
KTLO hell
Due to market and customer pressures, product teams can tend to favor investment in the NPF bucket while starving the PE/EE and PLAT buckets. Doing so over a long period of time tends to drive up KTLO costs significantly; so much so that product teams can find themselves spending a significant percentage of their time doing low leverage KTLO work and unable to find sufficient cycles for high leverage work including products & features. I call this KTLO hell.
Product teams can literally grind to a halt because of this situation and create major risks to the business. The figure below depicts this situation:
On the other hand, a more deliberate approach that ensures the PE/EE and PLAT buckets are given adequate resources will help keep KTLO under check and ensure it is always at a manageable level. The figure below depicts this situation:
Following sustainable engineering practices is therefore critical to avoiding KTLO hell.
Model
Here is how I’ve approached the implementation of sustainable engineering practices:
- Comprehensive tracking & backlog: maintain work tracking and backlogs that span all four buckets including KTLO work. Tracking KTLO work is especially important because its unplanned nature can lead to it flying under the radar and therefore be invisible. Because KTLO work can sometimes happen under emergency circumstances (such as an outage or security incident), tracking the work post completion is perfectly acceptable. Having a paper trail for KTLO enables proper estimation of resource consumption — this bandwidth is essentially off the table for allocation to any of the other buckets.
- Bucket specific allocations: subtract resources needed for KTLO work from the available resources and allocate a share of the remaining resources to each of the other three work buckets. The relative size of resource investments needed in each of the three buckets can vary across product teams based on their operating context and circumstances. Some important factors to consider are:
a. Product competitiveness: a newly introduced product looking to gain market adoption will require a larger investment in the products & features bucket as compared with an established product.
b. Product maturity: a more mature product with a large customer base will typically need a much larger investment in the KTLO bucket as compared with a newly launched product.
c. Tech debt: a high tech debt load would require greater investment in the product & engineering excellence bucket.
d. Product fit and finish: a product with a lot of fit and finish issues will need greater investment in the product & engineering bucket.
e. Feature velocity: teams grappling with productivity or other issues will need greater investment in the product & engineering excellence bucket.
f. Duplicative work: teams spending a lot of time and effort in duplicative work will need greater investment in the platformization bucket. - Bucket specific execution: create bucket specific OKRs/goals and plans during each planning cycle that are based on the needs, priorities and resources available for the bucket. Only the KTLO bucket will not have any OKRs/goals and plans associated with it. The product team will need to determine the priority of KTLO work items on an ongoing basis; teams should implement a lightweight process for KTLO prioritization based on their specific needs.
- Ongoing governance: revisit the resource allocation across work buckets as and when there are material changes in the operating context and circumstances. This ensures that the allocation scheme is always aligned with the current needs of the product team.
Success metrics
If implemented well, sustainable engineering should deliver positive metrics trends on the following fronts:
- User/customer happiness: developers should see more products/features and have a delightful experience when using them. Could be measured using CSAT, NPS, conversion, retention or other similar metrics.
- Execution velocity: the business should see changes delivered frequently, on time and with high quality. Could be measured using release frequency, turnaround time, backlog burn rate or similar metrics.
- Team engagement: those working on the team should feel productive and quickly see the impact of their work. Could be measured using ESAT, EPS, retention or other similar metrics.
- Bandwidth utilization: the team should be spending the least amount of time on low leverage KTLO work and directing the majority of their efforts on high leverage work. Could be measured using bandwidth spent on KTLO, product and feature turnaround time or similar metrics.