Failover happens when there is no ability to resolve or bring a resolution to a critical system failure. This typically is the last resort when all other possible resolutions are exhausted. Failover is “a method of protecting computer systems or software from failure, in which standby equipment automatically takes over when the main system fails.” This may seem academic but in practice is extremely hard to do.

Failover involves orchestration which is a step by step process to restore critical applications and systems. In fact, a failover can represent hundreds of very complex interactions and steps within the application, operating system and between application systems. All these must be orchestrated perfectly for the process to work smoothly. This is typically too complex and time consuming for manual procedures and does not perform properly at grand scale continuity events. Executing on a manual failover process increases risk tremendously and will greatly lengthen the recovery time objective (RTO). Therefore, to ensure accuracy and efficiency of the process, automation is required. Diagram 1 illustrates the complexity of a failover process between applications systems.

Diagram 1

The failover process can be likened to a high quality analog wrist watch. This wrist watch could have dozens of very small precision crafted parts designed to work together harmoniously to achieve one thing, the accurate measurement of time. This is how failover orchestration works! Many small precision processes that need to be executed in the right sequence to produce a reliable restoration of business services.

When you apply automation to this process now you get fast and more importantly reliable recovery! Automation standardizes the process to a point where now you have predictable results. This is a key point! You must take the human element out of the failover process because people can make mistakes no matter how well the process is documented. The real work is knowing what continuity technologies optimizes the failover process and understands Windows Servers or other operating systems that leads to recovery. Recovery is the subject of our next article.

If you missed the third part of the series, please read “The Real View of a Disaster – Resolution“. However, if you would like to read a short eBook on this topic, download “Anatomy of an Outage” by Neverfail. I’m sure you will find it interesting!

Comments are closed.