How to Avoid Human Bottlenecks in Production
There is no doubt we’ve all heard of a term “bottleneck”: A bottleneck is one process in a chain of processes, such that its limited capacity reduces the capacity of the whole chain ( Wiki).
Generally speaking, it is required to have multiple humans to run a larger business to perform ideation, design, project management, development, QA, marketing and infrastructure operations. When a single human limits a capacity of a team it becomes a Human Bottleneck.
In this post I’d like to highlight two distinct types of Human Bottlenecks, which both can make a negative impact on the productivity of the team from the prospective of Operations and Site Reliability.
Imagine, Jane, a seasoned Site Reliability Engineer, who has been hired to build an infrastructure on AWS for a small startup. She brings in a lot of knowledge and expertise to the team. She knows how to implement all required processes based on the vast experience in the field.
Jane has mostly worked in fast-paced technology companies where leadership heavily invested in their staff, so Jane naturally expects all core team members skills to be on-par with her. She expects the team to be able to operate & communicate on similar abstraction and complexity levels.
Jane, having enough knowledge and experience, implemented a cli toolchain that performs all tasks that were listed in the requirement. Jane chose Golang, as this language shows to be very capable for cli apps and community support is rather high. This tool:
- Configures VPC & Networking
- Configures IAM Security
- Configures CI/CD
- Deploys Api Service
- Deploys Worker Service
The application is properly designed: abstracted, encapsulated and follows good architecture practices, so by Jane’s design it will be extended by a Site Reliability Engineer (or any experienced Software Engineer), whenever there is a need to add a new feature (let’s say “Configure Logging” or “Deploy new AWS Service via Infrastructure as Code”).
What Jane has not yet accounted for, is that the skill level of the core team of a startup she is now with was not nearly close to what Jane’s expectations were. As a result, every time a non-standard change in the infrastructure code is required, Jane becomes a go-to person to help the team, as no one is capable of building on the framework she has developed, even with great documentation that she provided.
Let’s take a look at what happens when Dylan, Software Engineer who works close with a Marketing Team asks Jane to build a new infrastructure feature.
Dylan: Jane, we need to start using a new service - AWS Pinpoint.
Jane: Sure. Feel free to extend my AWS package, here’s a link to AWS api for that service.
Dylan: Ehm, I’m not sure I’d be comfortable with writing Golang, it would take too much time to ramp up.
Jane: Should be nothing complicated, just use generic package as an example. It would take you a week or so to complete, if you start today.
Dylan: Can you please add the functionality this time? I will get back to it later.
Jane: Well, ok. I’ll have to do it after completing my current priorities. So ETA to start is in a week, and it would take another week to complete.
This way, Jane becomes a Human Bottleneck of all complex changes to the Infrastructure as Code.
The main idea is to set up the team in a way, so that Software Engineering would be able to integrate with Site Reliability team and take a portion of their tasks as needed by using the tools provided to them. In order to do so, we need to consider matching skills, expertise and motivation of both.
As technology leaders we need to validate if Jane is capable to build tools with great Runner Experience in a way, so that all team members would be able to use it naturally, as well as incrementally learning the internals of it.
Jane may also need to be able to interface with the team on sub-standard level (for her), as that would be a way for her to build proper bonding and integration. Nevertheless, she still must maintain high standards internally, to deliver high quality operational tools that are not full of bugs, so it doesn’t cause terrible operational consequences.
As a last resort, a pure business solution may turn out to be not hiring Jane for this role at all, but instead hiring Mary, who has shown to have more experience in working with less technical teams, thus, being capable to properly laying down Runner Experience Design.
At the end, the main goal is providing enough support and empower Software Engineers to use and build on top of Site Reliablity and automation tools, instead of blocking and frustrating them.
The next type of Human bottleneck is indirectly related to the opposite issue: lack of automation expertise.
Imagine Bob, who is a so-called DevOps Engineer, who works for an AI startup, which runs on AWS. Bob’s job is to maintain the infrastructure, but he never had a chance to implement Terraform or CloudFormation properly: it’s there, but no one actually can use it without Bob’s involvement (including CI/CD). Bob has to pass a myriad of obscure
TF_VAR_variable values to each terraform run, module structuring is fragile and requires a rough shake on each infrastructure feature.
Every time a change in infrastructure is required everyone has to ping Bob to make a change, so when Alice, a full-stack developer, needs to scale the capacity of their application, it usually could look like this:
Alice: We need to increase the size of the API instance
Bob: Sure. Please submit a ticket, I’ll do it next week.
Alice: Can I do it myself so it happens faster? Our infra is in code right?
Bob: Yes and no. It’s quite complicated. And you need admin permissions. But I’ll take care of it.
Alice: I’d appreciate if this can be done faster.
Bob: I’m pretty swamped right now with a new initiative from Marketing. Please submit the ticket and I’ll look into it next week.
Sounds familiar, right? We can call it Pseudo Automation, as Bob doesn’t actually utilize automation to help the team to implement DevOps. He’s a manual-kind of a person, which makes him a Human Bottleneck. Lack of automation is also producing Toil, thus we should be avoiding it.
As in the first case, the main approach is to empower the whole team to be able to use the tools. It would be hard for Bob, since he’s not automating things enough. This way, a first step towards a solution would be increasing automation.
Define core processes
Before rushing into the coding part we need to have an idea of the whole set of processes that we may need to be automating. Generally you can think of how you perform the following:
- Code deployment
- Infrastructure changes & deployment
- Backups & Validation
If your application deployment hasn’t been automated yet - you have a great candidate. Find a way to deploy your software with minimal number of tools required. If you can deploy your code without extra tooling - it maybe your best choice. Shoot for something, that can already be used by other team members AND CI/CD.
Automation Testing & Validation
You need to be able to trust your automation and have proof for all team members that it can be trusted. The best way to do so is to put an automated process under CI/CD, so everyone can see that it actually works. Don’t deploy Jenkins or Gitlab on-prem just for fun, instead Use Github Actions, TravisCI or similar cloud CI/CD platforms. Start with low-impact areas like
dev environment and gradually move towords production, as the trust in automation increases.
Post photo by @ryoji__iwata