An IT Ops Detective Story - Mystery of the Website Crash
Updated: Mar 26, 2020
It was a warm and lazy Saturday afternoon. After several weeks of intense release work the company employees were relaxing with friends and family - at a football game, a birthday party, hiking outdoors or simply tinkering around the house.
Alas, these pleasant times were not meant to last for long. The company website decided it was time to stop working and soon the dreaded message from the Crisis Manager was on its way - All hands on deck! The firefight had to begin.
A group of 25 and more subject matter experts and managers duly gathered in the virtual war room, but needless to say the atmosphere was not happy. No one was looking forward to losing their weekend to yet another firefight (and weren’t these crisis situations getting a bit too frequent these days?).
To motivate his team, the new VP of IT Ops got actively involved. Outages directly affect the corporate revenue, and he wanted to lead his new team during the first significant crisis on his watch. Let’s get it resolved quickly, he said, so we can get back to our weekend fun.
Unfortunately, the war room seemed to suffer from a loss of direction. Some seemed to focus on deflecting blame, while some others wanted to help but didn't quite know how to. Do we think it is a network problem? Nope, said the Net Admin. Nothing has been changed. How about a Database issue? Can’t be, said the DBA. All is well on the DB front.
By now, the initial message from the crisis manager had built up into a train of over 50+ replies and responses that seemed to lead into an abyss of discussions and sub-discussions consisting of I-say-he-says-she-says-we-say messages. It was getting to be a challenge to follow what people were thinking and doing, and where the investigation should be headed.
Then someone suggested that the newly deployed app code could have caused the crash, and a consensus seemed to soon build up that it was the most likely culprit.
And so the spotlight shifted to the dev manager of the app. As a newbie on the team, she felt she and her team had been presumed guilty without much evidence. Nonetheless, this was a good opportunity to prove her team's skills and her own worth.
She called up her senior developers and together they started exploring the evidence to unravel the mystery of the website crash. Frequent updates to management and other stakeholders were necessary, as they were anxious to know about the findings. The chain of messages and replies got deeper and more complex, with details of various actions taken, the outcomes and decisions. The app team had to keep switching between communicating and investigating. The pressure of the situation became quite suffocating for all.
As the afternoon rolled over into the evening and it started to darken outside, there they were, still digging for clues to solve the mystery. Finally, they reverted back to an older version of the app code, and it made no difference - the website still would not come up. Voila! Innocence proved.
Now the burden of problem investigation would have to transfer to some other team. So, who would it be next? It was a moment of nail-biting suspense.
To be continued….
Do we need a better way to handle IT Ops incidents? We at smartQED say YES! With its new cognitive paradigm for collaborative problem solving and ML for augmented intelligence, smartQED OpsSpace is transforming the way teams resolve problems, whether an issue is big or small, urgent or not, and it involves 3 people or 300. To learn more about our innovative patent-pending technology, send me a demo request.