In the second article in this series, we will consider how an IT team can create situational awareness during a continuity event.

Awareness is defined as “knowledge or perception of a situation or fact”. So knowing something happened doesn’t necessarily mean that people really understand what’s actually happening to their most critical applications and the far ranging implications on the business. That’s why it’s important to be patient and assess the entire situation carefully. There are many things that can happen such as human error, application problems, hardware failures, network interruptions, cyber attacks, natural disasters and data corruption.

Why is having a complete view into the incident so important? Simply put….it provides perspective as you will soon need to get to a root cause of the incident. Without understanding the whole situation, people tend to go directly into recovery mode and miss the reason why the continuity event happened so it’s destined to happen again.

What happens if you do have a server outage? Let’s take a use case where a blue screen on a critical application server happens. It may not become obvious immediately but eventually a catastrophic event will happen that brings down the entire mission critical application. At a minimum, you find out that your system has gone offline when users start to call the help desk and are seriously impacted. Once that happens your probably many minutes into the event. At worst, you don’t see the downstream consequences of this one event as other systems are affected by this causing a much broader outage. The total time could actually take minutes to hours to fully assess the full extent of the outage before any steps can be taken to recover the service. In the meantime, the users and applications are impacted.

If you have ever experienced a system wide outage with a cloud service, in many cases it takes a bit of time to restore services back to normal. That’s because they are careful to ensure they understand the problem before the bring services back online. This is a carefully orchestrated process. This situational analysis helps long term to avoid the issues that caused the problem in the first place, improves the overall service delivery and uptime.

Ideally, you would want automation that enables application awareness to test for and initiate resolution. This type of intelligence can speed up getting to the recovery process by removing the human element as much as possible. The more humans are involved, the longer it takes to move the process forward to full recovery so automation is essential! Billions have been spent of infrastructure monitoring software which can detect issues like this. What’s missing is leveraging this data to actively automate recovery for continuity restoration purposes. That said, without automation it still takes human intervention to move into the next step in the Anatomy of an Outage.

How do you become situationally aware? There are six basic principals involved:

  1. Identify all the possible systems that could be affected by the outage.
  2. Understand the symptoms being reported by users.
  3. Predict what could happen when systems are offline.
  4. Stay vigilant until you know all the facts of the incident.
  5. Draw from similar documented experiences.

In a outage situation, having the best situational awareness can enable organizations to react more efficiently and allows for improved process execution.. What comes next? Once you have awareness of a situation, you can start the resolution process. This is the subject of the next post in the series.

If you missed the first part of the series, please read “The Real View of a Disaster – Understanding the Anatomy“. However, if you would like to read a short eBook on this topic, download “Anatomy of an Outage” by Neverfail. I’m sure you will find it interesting!

Comments are closed.