PyData London 2024

5 Things I Learnt from Causing a Cloud Provider Outage
06-16, 15:45–16:25 (Europe/London), Salisbury

Earlier this year, my team caused an outage across Europe for a major cloud provider. The incident response taught me a lot about working with cloud data lake systems at massive scale. How do you make these systems performant, resilient, and easy to maintain? And of course, how do you stop them from behaving like a DDOS attack on the cloud provider?


My team allows cybersecurity researchers to find needles in haystacks. We let researchers dynamically hunt for a single IP, a bunch of hashes, particular domains across petabytes of telemetry in thousands of different data lakes.
The root cause of the outage was an overload of the cloud provider’s metadata servers. Since we allow users to make dynamic queries - i.e. run arbitrary SQL across thousands of different data lakes- one query triggers thousands of IAM checks. Consequently, the solution to the outage was to freeze our team’s account.
For us, this meant we had a huge backlog and many anxious users. We now had a big new problem - how to empty the backlog without getting locked out of our account again? We solved this problem; here are the 5 most important things I learnt from doing so.
The Five Lessons (30m, roughly 6m each)
1. How to do effective Root Cause Analysis (RCA)? I found this was mostly a non-technical issue and more of an emotional one! During an outage, I felt that everyone was hurrying to just get the system back online. To figure out the root cause, I felt we had to resist the pressure to slap a bandaid on and persistently dig a little deeper.
2. Build software with the expectation that things will go wrong. But how do you actually do that on a dynamic, distributed query system? It’s a lot more complex than just returning a status code from an API. There are many different components, everything is asynchronous, and users depend on all of it.
3. How do you create a good user experience? From a user’s point of view, they just want to get the answer to their query as fast as possible. They do not care how much data they are querying or how complicated your fancy distributed system is, nor should they. However, typically we don’t design huge distributed query systems from a UX point of view - we only think about the backend!
4. Eliminate room for user error. Our users often write complex SQL and if they make a mistake, it takes ages for us to debug it. Imagine finding one bracket in the wrong place in 100 lines of SQL - it is slow! How can you find ways to let users write dynamic queries but with less user error?
5. Monitoring a crucial system with many different components is not the same as monitoring a system with one/two components. More components means things break in more places, which means more alerts and that means more noise. So how do you pick the metrics you actually need to pay attention to?
Conclusion (5m)
A theme that runs through these five lessons is that communication can be more important than technical detail. Even RCA turned out to be more about team communication dynamics than technical acumen! I think we often discount this when we consider what makes a good software engineer.
You learn by breaking things and putting them back together. But typically you don’t break a huge system that tons of people depend on for urgent work. I learnt so much from this experience that it made me want to break similar other systems just to figure out how to put them back together. Obviously, you can’t really do that - but you can pretend! Next time I build something, I will definitely ask more questions like “What happens if this goes down for 2 weeks because of something that is 100% out of our control?”.


Prior Knowledge Expected

Previous knowledge expected

Alexander is a software engineer working in cybersecurity at Palo Alto Networks. He first got into coding because he speaks Russian and wanted to automate reading endless Russian disinformation.

He cares about digital rights, inclusion in tech, writing safe code and contributing to the open source community.

When he isn't coding, he likes to drink at least 5 coffees a day and play basketball.