Contents

Tired of Documentation? Try Runner Experience Design

I’m an adept of a Credo of Phoenix approach when we talk about infrastructure design: Whatever you build should have an ability to be rebuilt with no-to-minimal effort over and over again by anyone or anything with sufficient permissions.

While such poetic way of calling Idempotent Infrastructure clearly has a lot of important technical characteristics, this time I’d like to talk about the other side of it: “anyone or anything with sufficient permissions” - runners and their experience.

Runner’s Intent

Yes, someone else will use our infra. Surprise - surprise. We can call them Users, Internal Customers or Runners. It is crucial to remember, that whoever (or whatever) runs it may have zero-to-little knowledge about our infra.

Don’t get me wrong, Documentation certainly should remain a cornerstone of your infrastructure. Nevertheless, we should only loosely rely on Documentation on our Runner Experience Design quest. The reasons are obvious - it is quite unrealistic to use it as a first entrypoint into the actions our Runners will perform in times of high business velocity. We all know that Documentation gets outdated, stale, and sometimes smells like bugs in naphthalene, so we should try to find an additional reliable alternative for our team and ourselves.

What I’ve discovered, is that there is always something that a Runner can rely on: an obvious and a natural way to perform a desirable action, which is the easiest to understand in a short amount of time. If you are familiar with User Experience Design, or UX, you know that based on some rules we can make assumptions of where the User may want the “YES” button to have green color, and not red color. It gets confusing otherwise, doesn’t it? We can and should utilize similar principles in our systems.

Let’s try to assume what could be the goal of a Runner when first using our Infrastructure as Code:

As a Runner I may want to:
  • Deploy infra: I need to have everything I need to get my infrastructure.
  • Deploy code: I need to quickly see how my code performs in the current / new environment.
  • Destroy all: I want to go home. I don’t want to incur unnecessary charges.

This assumption helps us to put down a proper design, and have an obvious way to call a deploy.infra, deploy.app, destroy.all actions, as well as aliases like deploy and destroy which generalizes Runner’s intent even more.

By creating a good Runner Experience Design we are simply being nice to other people: a new employee who’s just starting, a colleague who wants to help us, or our boss to whom we’ve said “saionara”. All of them would be able to get the infra up and running very quickly, if we follow these practices carefully.

If you loved/hated the "boss example..."
Runner Experience Design approach, actually, brings more value to you as a leader and a team member. The key is to be vocal about what you’re trying to do, and have the stake-holders understand that you are reducing Human Bottlnecks by implementing self-documenting tools. If you do it right you, most likely, will have less reasons to say goodbye to your boss or your team.

It is not very hard to assume a typical path that a Runner might want to take. It is not very hard to observe and evolve your Runner Experience Design alltogether, you just need to think of ways of being nice to other people.

The Good of Default

Naturally, coming from the previous thought, we aim at the first-time Runners the most.

We must embrace sane defaults which align with our business values and environment that our action is being executed against. Let’s not overestimate the desire of customization for the first-time Runner. Generally, their primary goal is “To get it up”, thus each parameter that can have a sane default must have it.

Values that can have sane defaults
  • Instance Size: t3.micro (set in variables.tf - vs no default at all)
  • Environment Name: dev (inherited from the reliable parent source - vs no default at all)
  • AWS Account Number: 123456789012 (derived by aws-cli - vs no default at all)

On the other hand, setting too much defaults can be dangerous, such as making assumptions of the environment name without reliable source of truth. For example, we could have hardcoded dev environment, assuming that would be first-time Runner’s choice, but it may be not true, and they have just forgotten to set their environment name to userenv. This may lead to some unexpected or even catastrophic consequences, in which case it is recommended to try to find a better abstraction layer to set those defaults. For example, user’s AWS account, where we can derive defaults more reliably.

So please, always think of sane defaults as a way of being nice to people. Trust me, they will appreciate it and, as a result, they will be able to run your tool/infra with much less effort.

Human Bottlenecks

I’ve written about Human Bottlenecks before on this blog, and they greatly decrease the quality of Runner Experience, as they block, confuse, create uncertainity and toil. You need to reduce the number of those, so that Runner Experience quality increases.

When a Software Engineer needs something done on the infrastructure level, they can follow two paths to get it done (in addition to doing it themselves from scratch).

Reach out to SRE

(or so-called DevOps Engineer)

It could have a good Runner Experience in the short-term, as Site Reliability Engineer would be the one providing it. Generally it follows the following formula: $R_x = Ha$, where $Rx$ is Runner Experience, $H$ is Human and $a$ is availability.

The game changes drastically as soon as SRE’s availability decreases (not a Highly Available SRE, pun intended). Software Engineer doesn’t get enough attention, features that become blocked by infrastructure never get shipped, frustration intensifies. So if $a=0$, then $R_x = 0$, Which is obviously, not a great Runner Experience.

Rely on automation

(the right answer)

Software tools are not humans, they should be able to scale indefinitely. Each process that can be automated should be automated. Sounds amazing when it works, although human factor still cannot be ignored.

If the Software Engineering team can’t use the tool that SRE developed efficiently, such SRE becomes a Human Bottlneneck, as human communication becomes a requirement for a Software Engineer to complete their task.

The bus factor

Velocity of the business if everything. What happens to the team if they are left without a human SRE to support them? Will they continue using the tools for their day-to-day operations? Will they hit the wall and don’t know what to do?

In any case, good Runner Experience dictactes that you should be building your architecture solutions with the team’s context in mind. I usually ask myself a question:

If I build an infrastructure for a someone this way, would they be able to use and support it if I leave tomorrow for $∞$ days?

The level of dependency on the Site Reliability for day-to-day operations depends on the answer to the question above.

Can’t operate without an SRE

My team won’t be able to operate without me.

If the answer is “no” then it’s important to understand why.

  • Is it due to Human Bottlenecks?
  • Is it due to the lack of team’s expertise?
  • Is it due to the unobvious ways of running it (bad Runner Experience Design)?

The only way to know for sure, is to set up some pair programming time with the team, ask them to deploy the Infrastructure and watch. Whatever would be the reason, clearly the Runner Experience Design would not be ideal, as your users won’t able to use the system you’d designed without your manual intervention.

Can operate without an SRE

My team will be able to operate the infra normally. They already do.

If the answer is “yes” - then everything is good. There is no Human Bottleneck on the Site Reliability side and the team can function without extra hand for some time, which is a great benefit for the business and personal side of the company.

Not enough data

If the answer is “not sure", then the observations skills need to be improved. Basically there always should be a binary answer which you should always keep in mind.

Conclusion

To summarize, Runner Experience Design is something that can help you and your team with maintaining clear context and information pool without putting all the burden on the tools like Documentation or Training. Design your system in a human-friendly, intuitive way, and you’ll see how current and new team members will improve their engagement levels, and business velocity overall.

If you’d like to learn more about Runner Experience Design and how I utilize human-friendly SRE, follow me on Twitter: theAutomationD.

If you found a typo or have something to add, feel free to jump into comments. I’d love to have a healthy discussion about pros and cons of Runner Experience Design approach!


Post photo by dnevozhai