Charity Majors @mipsytipsy CTO @honeycombio, ex-Parse, Facebook, Linden Lab; cowrote Database Reliability Engineering; loves whiskey, rainbows. I test in production and so do you. 🌈🖤 Jun. 04, 2019 7 min read

OH this morning: "drink every time someone quotes charity" (#monitorama channel, honeycomb user slack)

... i miss you all! wish i was in portland with you! *blows kisses*

so i have a treat for y'all! i'm going to live tweet @akvanh/@alainadev's talk on "Creating a Culture of Observability".

I first saw the talk three months ago, and it made my knees tremble. I learned *so much*, and been meaning to share ever since.

 https://www.youtube.com/watch?v=yz517w36PLE&feature=youtu.be&t=1960 

Alaina and Alyson are both software engineers at @honeycombio. But they aren't ex-googlers, and they weren't super senior when they joined. They went to school for other careers, then retrained at a hacker school -- I think this is their second programming job?

This talk is entirely their story. How did they learn about observability, how did it help them level up faster? How did they and their teammates build a culture of shipping changes fast, with high confidence? How do they respond to failures?

Without further ado.

First off: what is observability? They define it as encompassing both systems *and users*; o11y is about understanding the intersection of what your users are doing on your production systems.

Users being the tiny agents of chaos that they are, unknown-unknowns are key.

Conversations about o11y tend to focus on tooling. But what if you had one of the engineers on your team instrument your app, and everyone else on the team always went to them to ask questions about how it worked.

Would that team have achieved observability?

No, they say. The team is no more empowered if only one person has access to this critical system information. All you've done is create yet another (human) SPOF.

Cultural practices need to shift in order to achieve and maximize the value of observability.

Our customers often ask us, "how can we get more people on board? how can we get people excited about observability?"

Alyson and Alaina witnessed this firsthand, in their transition from their former company to honeycomb.

In this talk they will describe some of the differences in culture between their old co and here that they believe have created a positive feedback loop, where all team members -- including nontechnical ones -- are excited about observability and use it in their day to day work.

And this shit matters! Because of these cultural shifts, they've witnessed a faster development cycle, faster time to resolution, and improved software quality. At a high level:

* no SPOFs,
* democratizing knowledge,
* increasing thoughtful discussions about our systems.

At their old company,

* Questions weren't encouraged. People did their tasks.
* When a service went down, the ops person would say "just restart it"
* New hires were discouraged from touching backend code
* Senior eng often took on pet projects, got defensive/territorial

There were many teams -- frontend, backend, ops, qa -- and no one knew what was going on outside their team. When somebody left the team, it left a huge hole.

These cultural practices kept them from growing as individuals, progressing as teams, or taking ownership over outcomes

... despite the fact that there were many smart, experienced, dedicated engineers there who genuinely cared and were doing their best. [ed: yes, it was a pretty high caliber team!]

At @honeycombio, they say, their lives are much better with a culture of observability. 🐝🌈

Questions are encouraged. People lust after knowledge, and dive deep into specific questions.

Curiosity is rewarded, because we can ask any question of our production data.

There is a lot less division between teams, and a lot more emphasis on ✨software ownership✨.

If you write a feature, you are responsible for testing it, deploying it, and making sure it works well in production.

While we do have folks whose affinity is more towards frontend or backend, the lines are very blurry.

Based on our past and present experiences, these are some practices that help develop a company-wide culture of observability.

No SPOFs, drop new hires right into the deep end immediately, learn from each other constantly, and facilitate blameless retrospectives.

We put every single engineer on call. We fundamentally believe in the power of software ownership.

People are always going to have areas of expertise, but we want everyone to have the power to jump in and fix things. If only 1-2 people can fix things, that's a fragile team.

Putting all engineers on call has a powerful leveling effect. Everyone is contributing to the quality of the system -- juniors and seniors, frontend/backend/ops -- and asking questions is normal and expected.

Instead of making production some gated kingdom, we get new hires involved on day one. (There is no substitute!)

We have devised a bunch of puzzles for new hires to solve, using honeycomb to debug honeycomb.

Example 1:

"determine the dataset sending us the most traffic in the last two hours, and figure out if it should be rate-limited or not. which customer is responsible for this dataset?"

Example 2:

"we forgot to instrument a new feature, but we know which endpoints are involved. figure out how many users tried out the new feature in the last week. who was using it most often?"

Example 3:

"find a dataset running slow queries. what qualities of the queries seem to make it slow?"

As you can see, these questions start to get at the beating hearts of both how our system works and how our users experience it, and how each affects the other.

We assign a buddy to the new hire, and the buddy puts the puzzles in context, explaining the goal of each question and why you might want to ask it or explore it.

(Sometimes new hires point out things that look weird or off, and then we get to investigate together!)

We also give an infrastructure overview to each new hire... and then they give the overview to the next new hire. (This might sound mildly sadistic, but we support them and make it fun.)

Having a solid foundational knowledge of the system is important to us.

We have a distributed team, and everyone has different expertise. So it's very important to us to create the opportunity to learn from each other and shoulder-surf asynchronously.

We add annotations inline as we are debugging, and insert markers when important things happen.

Being able to go back and read what your team members were thinking as they debugged, & see what they saw at that point, is critical.

We talk a lot about bringing everyone up to the level of the best debugger in every corner of our systems, by leveraging each other's expertise.

Blameless post mortems are key too. Instead of focusing on the human making the mistake, we focus on how we can minimize the impact of the error.

At their old company, post mortems revolved around the "5 Whys". Eventually, someone would confess "it was me! i fucked up!"

and they would get drilled with "well WHY did you fuck up!" which was really toxic and stressful. 😱

At Honeycomb, A accidentally erased a field from the entire production db. Worse: a cron job runs every 10m and if that field didn't exist, it would erase all customer data.

A senior engineer congratulated her. "This is great! This exposes a key vulnerability in how we handle production data!"

Turned out, even a customer could have accidentally triggered this behavior via the UI. They restored the database together and improved the system.

... and then initiated A joyfully into the scarred ranks of systems warriors who have all brought down prod.

A was already beating herself up, so having the senior engineers be cheerfully supportive & focused on actionable lessons was a balm and surprise.

Observability isn't just for engineers, either: the whole company uses honeycomb!csddsf It improves not just software quality, but also customer support, sales interactions, etc.

A sales engineer can observe a team's usage patterns and recommend a new workflow or feature, etc.

We use the same instrumentation that we use for debugging and understanding systems, for user behavior and support issues.

Instrument once, understand everywhere. And learn from each other by watching our teammates interact with these systems, too.

You too can have a culture of observability. ☺️🐝 No matter where your co is in your o11y journey.

When looking at your processes, ask: does this help empower people on my team to ask questions? Can we leverage each other's learning or do we all have to start from scratch?

And that's the end!

The first question is about how to bootstrap a culture of observability, because at most places this is seen as privileged knowledge held by just a few, and most engineers aren't used to having their curiosity encouraged or rewarded. [ed: sad but true 😕]

Alyson suggests pairing, notes it starts on day one with our onboarding q's. Alaina says curiosity is highly contagious. Most people have a natural inquisitiveness that they've learned to suppress because the tooling is so painful.

People ❤️ finding answers to questions!!

I loved this talk. It was was so amazing to hear about their experiences ... starting here at honeycomb, learning our systems, reporting how they've grown. we are truly lucky to have @alaina_dev 🐝 and @akvanh. 🐝

If you'd like to watch for yourself:  https://www.youtube.com/watch?v=yz517w36PLE&feature=youtu.be&t=1960 


You can follow @mipsytipsy.



Bookmark

____
Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.

Enjoy Threader? Sign up.

Threader is an independent project created by only two developers. The site gets 500,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Running this space is expensive and time consuming. If you find Threader useful, please consider supporting us to make it a sustainable project.