Playbook for a Product/DevOps transformation
Contents
In the late 1970s, the U.S. automotive industry went into a quality crisis and was surpassed by the Japanese, who could assemble a car comparable to theirs with 60 percent of the effort and far less quality problems. Ford and General Motors eventually realized what was going on in Japan and adopted lean principles to help save themselves.
In the last ten years, tech companies went through a similar change through the adoption of cloud computing and DevOps. The DevOps movement started as a better way for operations and development teams to work together. Since then, it has become an umbrella for all sorts of tools and practices to improve business agility and organizational performance in the Age of Software.
This post is a compilation of ideas from the DevOps community, aimed at companies that have adopted DevOps technologies and practices, but not the culture. Typically, such companies have a functional organizational structure, with role-based teams and people working on many projects. They are on the cloud, but struggle with quality, complexity and ownership issues; have difficulty scaling the teams and aligning on priorities. This post suggests what can be done to improve, and further reading material.
How functional silos harm quality and stability
Traditional IT organizations follow a functional structure which divides employees into teams based on their role or specialty—e.g. Product, Engineering, Infrastructure, QA and Ops. In this structure, the primary team of an employee is their functional team; their tribal identity is aligned to their department, and each employee is assigned to multiple projects. Such structure enables local optimizations within each department, but introduces silos, handovers, ownership issues, and is less effective when measured by end-to-end value delivery. A functional organization optimizes for cost, not outcomes.
A functional organizational structure harms stability—and quality in general—in the following ways:
- Blind spots: team silos and handovers create blind spots which lead to quality and stability issues. Each handover causes massive information loss due to tacit knowledge, which remains with the creator and is never handed over. As a result, when systems are developed and operated by different teams, first responders have insufficient information to react effectively to incidents.
- Cognitive overload: humans have limited cognitive capacity (e.g. working memory) and make mistakes when dealing with too much complexity. As people are assigned to more projects, they suffer more cognitive overload, make more mistakes, and are more likely to burnout.
- Diffusion of responsibility: humans experience less social pressure and feel less responsibility when their performance is difficult to assess amid a crowd. People feel less responsibility when assigned to multiple projects, and teams feel less responsibility for shared issues.
- Ownership and accountability issues: in a functional structure, no single team is accountable for project outcomes. When something goes wrong, it is usually unclear or ambiguous where the problem was. For example, if a service runs out of memory, it could be because the developers used too much memory, or the infrastructure provided insufficient memory, or monitoring failed, and so forth. In any case, each team feels less ownership of the issue.
Research by DevOps Research and Assessment (DORA, acquired by Google in 2019) shows that a generative culture predicts software delivery and organizational performance in technology. The idea that a culture optimized for information flow is predictive of good outcomes is not new, and is based on research by Dr. Ron Westrum, which included human factors in system safety, in the context of accidents, in domains such as aviation and healthcare.
Organize around value streams and product teams
Whenever there is a product for a customer, there is a value stream.
Teams should be structured in ways that empower them to move faster and to get closer to their customers. Organizations structured around value streams are all about customer focus and flow. They optimize for end-to-end value delivery, rather than department costs, and manage software delivery to business results, instead of technical activities.
In this structure, the primary team of an employee is a cross-functional product team aligned to a value stream. Their tribal identity is aligned to the customer, the value stream, and the product, not their job role. Instead of being assigned to projects, they join a stable product team, which succeeds and learns together. Each team has an internal or external customer and some form of product management. Work is brought to the teams in a controlled, pull-based way, which limits the amount of complexity per team.
This product-oriented structure aligns the organization to business outcomes, not functional silos. OKRs become more meaningful and easier to plan, since teams have fewer external dependencies and do not need to seek alignment across functions for every single project. In order to deliver greater value for customers, organizations need to focus on enhanced visibility and clarity of work, greater integration between the functions involved in a value stream, and governance across value streams.
Optimize for end-to-end flow with small teams
It’s not the big that eat the small… it’s the FAST that eat the slow!
Effective product teams are small and autonomous. They have the resources to operate with minimal external dependencies, and are empowered to make key decisions. The small size is important to minimize communication and coordination costs, and maximize ownership and motivation.
Research has repeatedly shown that teams that can ship software quickly and effectively outperform their slower counterparts in terms of company profitability, market share, and just about every business metric that matters. Importantly, it has shown that the teams that focused on quality are the ones that gained speed. In software development, quality is predictive of speed, and speed reduces costs. As a rule, tech companies should focus on quality in order to gain speed, and also to reduce costs.
In a value stream, speed means flow time: the time from customer need identified to customer need met. Because end-to-end flow is difficult to track, teams tend to focus on local output metrics and create local optimizations, instead of improving end-to-end flow. A classic example is water-scrum-fall.
In order to improve flow, we must focus on end-to-end flow metrics and avoid ineffective optimizations. The system of work within a value stream should be pull-based at every level, in order to reveal the capacity of the system, which can then be improved.
Improve quality and stability with full-service ownership
Having to support your own product is a strong motivator to maintain high quality and supportability.
Full-service ownership means that each product team is in charge of running their services in production—i.e. you build it, you run it. This level of ownership brings teams closer to their customers and enables them to build increasingly reliable and efficient services, while also deploying code faster and more frequently. Essentially, it unlocks accountability, reliability, and continuous improvement.
Empowering engineers to own the full lifecycle of their code is a competitive advantage. It benefits the company as well as the engineers, because this accountability ensures high-quality work and promotes mastery and autonomy. Engineers become more skilled, well-rounded and engaged.
Full-service ownership requires a safe environment that accepts failure. Psychological safety is a top predictor of both team and organizational performance. Teams must be allowed to fail fast and often in order to succeed sooner. The greater the psychological safety and the actual safety—thanks to guardrails and a risk-aware culture—the faster the teams can innovate and improve business outcomes.
Align product teams and the software architecture
Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. Conway’s Law
How product teams are structured has a powerful effect on software design, the resulting architecture and business outcomes. Many organizations invest lots of time and effort in adopting technologies, but fail to achieve critical software delivery outcomes, due to limitations imposed by architecture. As demonstrated by Conway’s Law, good team design leads to better software design and more effective teams. Teams should be deliberately structured to support the emergence of a desired architecture.
Conway’s Law helps us design product team boundaries in the context of our value streams and desired communication patterns. To optimize flow, it encourages us to align the teams to a value stream and minimize the dependencies between teams. To limit cognitive load, the scope of each team’s domain should be small and bounded. The resulting team boundaries should be small, and ideally have high cohesion (team does one thing well) and low coupling (minimal dependencies on other teams).
In a service-oriented architecture, small product teams produce loosely coupled microservices, which enables quicker innovation. Small product teams can also work on parts of a monolithic system, and iteratively migrate the system to a microservices architecture using the strangler pattern. Ultimately, every team should have a clearly identified value stream, customer and product.
Manage the organization by value streams
Most businesses are naturally set up as value streams—it is usually the org structure of the “tech people” that is partially, or not at all, aligned to business value streams. Under a functional structure, as a business grows, tech teams become opaque cost centers; business metrics, such as cost and revenue, become impractical to track per product; and the tech staff loses sight of the customers and the business.
Conversely, organizations structured around value streams, with cross-functional teams, offer a quantum leap improvement in work visibility and clarity. Bottlenecks and pain points are easier to identify. Business metrics, such as cost and revenue, are easy to track per product and per value stream. This enables better management decisions, and places tech in profit centers, rather than cost centers.
A value stream is a flow of value creation, from need identified to need met. It has a value proposition, an external or internal customer, and is genuinely end to end. Importantly, value streams are nested. For example, in the figure above, both Processing and Settlement are business value streams which rely on nested value streams—e.g. Gateway and Reconciliation—for fast flow of value.
Due to product/team relationships, value streams also form a graph, known as a value stream network. Ideally, each value stream should have high cohesion (does one thing well) and low coupling (has minimal dependencies on other value streams). But, in the beginning, we should not worry about getting the value streams right. It won’t be right. We need to continually inspect and adapt—i.e. manage the value stream network itself as a product—to continuously improve the business/software architecture.
BMW’s most important product is not the i8, nor the 3-series… it’s their plant. Mik Kersten
Amplify results with platform thinking and enabling teams
There cannot be a more important thing for an engineer, for a product team, than to work on the systems that drive our productivity. Satya Nadella
Platform thinking enables organizations to make the most of their investments. A well-designed and well-run platform can be a significant force multiplier for product teams—to reduce complexity, stimulate innovation, and accelerate software delivery. Platforms build on top of other platforms: for example, a billing platform, on top of a data platform, on top of a delivery platform. Each platform should be managed as a thin, customer-driven product. Stay clear from the “build it and they will come” trap.
A good platform provides standards, templates, APIs, and makes it easy for teams to do the right things, in the right way, for the organization. Ideally, platforms should be self-service—especially in regulated industries, where self-service can automate compliance. For example, a self-service infrastructure (e.g. based on GitOps) provides automatic “separation of duties” compliance and empowers product teams to deploy to production without an infrastructure engineer.
Another pattern to amplify organizational capabilities are enabling teams: a team of specialists in a given technical domain, which cross-cuts to the product teams and partners with them when necessary, to bridge a capability gap. For example: a Security Engineering team, to consult on security topics; or a Developer Experience team, to improve developer productivity1.
Move closer to the SRE model
Hope is not a strategy. Traditional SRE saying
One common anti-pattern when introducing DevOps to an organization is to assign someone the role of “DevOps” or to call a team a “DevOps team”. Doing so perpetuates the kinds of silos that DevOps aims to break down. In the full-service ownership model, DevOps is something every team does—potentially, without an infrastructure engineer. In other words: it is a practice, not a role. Traditional infrastructure teams should be converted to platform teams and/or enabling teams (SRE).
Site Reliability Engineernig (SRE) is a more prescriptive, customer-focused approach to DevOps. All SRE efforts should result in protecting or improving customer satisfaction, even when the customer is internal. The SRE organization consists of enabling teams that partner with the product teams to define and track SLIs and SLOs for every important service in production:
- SLIs (service level indicators) are monitoring metrics, such as the four golden signals;
- SLOs (service level objectives) are target value ranges for the SLIs, defined with Product;
- SLAs (service level agreements) are legal agreements with repercussions for failure to meet. They are commercial decisions, between Sales and a customer, usually based on a margin over the SLOs.
Product teams plan their work with their SLOs in mind, to avoid going over their service’s error budgets. Such error budgets help to balance the speed of new features with whatever work is needed to make the service reliable. When a service “runs out of budget”, it goes automatically into a feature freeze and the SRE team may intervene. No budget, no pushes.
The relationship between an SRE team and a product team changes at different points of a service’s life. Initially, the product team alone builds and runs the service in production, until the scale merits SRE help. During a second stage, as the service usage increases, SRE provides guidance to the product team on how to make the service work better. Later, if the scale merits it, SRE may become fully involved by running and supporting the service. This is the time when SLOs are defined.
Further Reading
- Sooner Safer Happier: Antipatterns and Patterns for Business Agility
- Team Topologies: Organizing Business and Technology Teams for Fast Flow
- Project to Product: How to Survive and Thrive in the Age of Digital Disruption
- Accelerate: The Science of Lean Software and DevOps
- Lean Enterprise: How High Performance Organizations Innovate at Scale
-
It is good practice to dedicate 3-5% of developers to improving developer productivity (tools, documentation, templates, libraries and frameworks, etc.) ↩︎