The other day, we had a rather severe weather event in my area that resulted in a rather lengthy outage of our cable service and in turn no connectivity to my enterprise. Luckily for me, I had a backup plan and was able to reconnect
using my cellular 3G card. It got me to thinking about all of the possible points of failure that can impact us in the enterprise. In a typical enterprise, there are many moving parts that must all work together to provide end users seamless
connectivity and high availability. Even with ever increasing reliance on other modes of communications such as email, Instant Messaging, and even social networking today, the end user still has literally no tolerance when their voice platform fails. You can probably blame your local telephone company for setting this expectation since they had decades and decades of experience to build highly redundant and highly available voice networks. This expectation transitioned into the enterprise with the advent of the PBX. These platforms were hardened appliances that were engineered for maximum uptime and many PBX providers had maintenance services to provide critical spares to customers that were willing to pay for it. In short, voice failed in only the rarest of occasions and when it did fail it was fixed promptly. So it's completely legitimate and logical to ask if we can obtain the same kind of resiliency in a world where communications has been converted from TDM to IP, from hardware to software, run on separate networks to converged with all other enterprise
traffic and even embedded into line of business applications. The obvious answer to this question is "Of Course It Can".
In order to achieve the same level of business continuity in this brave new world, it's critical to ensure that you have the right architecture. The attributes of such an architecture are varied but some of the critical ones are listed below.
Active/Active vs. Active/Standby: In an active/active architecture all registered endpoints are cached on both sides of the cluster. Most enterprise voice platforms today support active/standby in which a re-registration process must take place when one side of the cluster fails. This is completely unsatisfactory in an enterprise production environment today. Active/Active ensures that not only is a re-registration process not necessary but when a side of the cluster fails, the end user knows no difference. They continue to process calls and invoke features. Furthermore, the architecture is built so that only one side of the cluster is necessary to support all subscribers.
Multiple levels of hardware/software redundancy: Simply put this means no single point of failure in the core call control platform. From a hardware perspective, this is normally handled by replicating all critical hardware interfaces and components in a server (hard drives, processors, NIC cards, and power). The software is much more complex. Linux as an underlying OS is critical. In addition, some clustering functions are required that work in concert with the OS as well as the application to optimize the failover process when it's necessary.
From Embedded to Standalone: Some UC architectures today have completely intertwined the call control with other communication channels that a failure in the core architecture means all communication channels are impacted. The architecture must support the ability to back off to a basic voice capability should the application platform become impacted.
Standalone Survivability: As I mentioned previously there are many moving parts in the enterprise today and so even with the most resilient architecture there is always the possibility for network hardware or the network in general to fail. When this happens a typical customer may not have a backup connection. That means the individual site must be able to standalone and provide basic services to the end user as required. It must do so in a seamless manner and must restore service to the core platform when the network is available again.
Some architectures provide these standalone functions in combination with the routing functions. This can be a significant issue if that platform fails as all services in the branch will be impacted.
Beyond the attributes mentioned above, I believe it's possible to take the resiliency functions even farther. With the right combination of software, hardware, and networking complimenting each other, resiliency can be extended and multiplied. Take the example of my cable service outage at the beginning of my blog. Extend the principle of using a wireless backup capability to restore service in the enterprise space. With the advent of 3G and soon 4G, it would be possible to provide this alternate network connection on demand. If you take this concept forward you can see an environment where virtual software functions can be moved from data center to data center ensuring call control will always be available.
In my opinion, we are nearing the day when we will raise the bar for resiliency that was originally set by Ma Bell a long, long time ago.