Keeping Cloud Services On Tap

March, 2012

How do providers of high availability cloud-based or hosted contact center services prevent failure and disaster from becoming loss of service?

This month we look at failover, one of the key pillars of delivering high availability cloud-based or hosted contact center services. In particular, how do you prevent failure and disaster from becoming loss of service? And what questions should users/ service providers ask of vendors such as ourselves?

We in the developed world are blessed with utilities on constant supply; clean water, electricity, an Internet connection. And we are surprised and outraged at the inconvenience caused when one of these systems is interrupted: no cup of tea, no light, no instant connectivity. Horrors!

For the call center, interruption of core cloud/ hosted services such as IP bandwidth, telephony, call control, etc, is more than inconvenience; it can mean bad customer service, loss of revenue and loss of reputation.

Although 100% uptime is the ideal, the challenge of real-time processing in the call center makes this impossible. Why? Read on.

Even apart from this, the reality is that without a Department-of-Defence-sized budget, the most users can expect is ultra-high uptime. Scheduled replacement, failure of hardware, network, power, voice carrier, etc, can all contribute.

From a software perspective, downtime is usually caused by some form of outage:

  • planned – If a software platform is not designed to be upgraded on the fly, upgrades can cost minutes, even hours; not good for a ‘high availability’ system.
  • unplanned – the result of a failure somewhere in the system; these can be foreseen e.g. lack of resources (memory, disk space, etc). With careful planning and appropriate system monitoring, this can be eliminated. Others can be unforeseen but are inevitable and can come on any scale, from individual component level (e.g. hard disk, network switch, media gateway) to major disasters (e.g. earthquake, tsunami)

The central question for any vendor/ service provider offering high availability is: how do you prevent failure and disaster from becoming loss of service?

The key is to eliminate ‘single point of failure’ by duplication/ replication of services, a.k.a. software redundancy. But this comes with its own challenges.

The ideal is that every service has a ‘hot standby’ – a secondary service that is constantly running and mirrors the state of the primary. On failure, all dependencies and resources are seamlessly switched over. This is the basis of the worldwide web, and other carrier networks. But while this works for many processes in the call center, it cannot work for real-time processing (e.g. conferencing/ recording of voice traffic, or dialer/ ACD pacing). Being real-time, the state of each changes too fast to make persisting to disk practical. So if a processing service fails, resources cannot simply be switched and normal service resumed. There will be some temporary degradation of service as current sessions end, and have to be re-established by the back-up system, or as the backup dialer service gets up to speed.

The alternative is ‘cold standby’. In this model, a copy of each service is kept on a separate system (maybe a VM, a different server, even on a different continent) ready to be brought into service when necessary.

But how do we know when this is necessary? For high availability, waiting for someone to notice a failure is not good enough. Action must be taken immediately and automatically. This requires a monitoring service continually asking surrounding services “Are you alive?” If the expected answer “Yes” does not come, a control service is also required to tell the secondary service to start. (Incidentally, each service must also have a back-up. As Juvenal asked: “Who watches the watchers?”)

Another challenge is that the primary will have been in a particular state when it failed. The secondary must be initialised using the same settings, including any security and licensing. This could come from a duplicate configuration file on the secondary server, or in the cloud. It must also be made ready, perhaps from an up-to-date ‘current status’ file.

Finally, all active resources and routing must be switched to the secondary.

After a smooth handover, what happens to the primary? If the cause of the failure was a transient glitch, auto-restart would be best, reprovisioning and reconnecting resources to bring itself back into service. If not, the IT department will be getting their hands dirty.

High availability for hosted/ cloud-based call center services cannot be taken for granted. Support for failover must be designed into the software at a deep level. It must be planned for and worked toward, so that service is as seamless as possible.

Next time you turn on the water tap, remember to count your blessings and thank the Romans for pioneering a water system with ultra-high availability.