High Availability – Seeing The Bigger Picture

August, 2012

Some vendors' claims for very high uptime - maybe 99.999% - refer to software only. But for cloud-based contact centers, the vast majority of downtime is caused by hardware and network failure. This blog looks at the real areas of risk.

For the large scale, distributed call center, or hosted/ cloud based service provider, every minute of unscheduled outage costs money in agent wages and missed opportunities. This drives a search for the Holy Grail of software that delivers very high uptime – maybe 99.999%.

Even if we believe the hype of some vendors, if we look at the bigger picture, we see that any such claims refer to software only. But in the new world of virtual, hosted and cloud-based contact centers, the vast majority of downtime is caused by hardware and network failure. And in real-life (i.e. outside of the Marketing Dept.) no-one is interested in separating out the causes of failure. Downtime is downtime, whatever the cause. So the question end users should be asking software vendors is not “How many 9’s?” but “How well does your software cope with the inevitable infrastructure failures?”

Let’s look at the areas of risk and how call center software vendors can (and should) respond to them.

Risk 1 – software failure

Yes, software sometimes fails. Call center software does not operate in a sealed, air-tight, proprietary environment as it once did; IP-based technology is open to the elements, and subject to all the outages and problems that come with distributed components and high-volume network usage.

But software vendors can employ several methods to minimise the impact of software failure:

  • Auto failover – if for some reason the software meets a condition with which it cannot cope, it should restart automatically and carry on where it left off, with minimal impact, and no manual intervention.
  • Segmentation – dividing the product into smaller and smaller discrete chunks so that the impact of any failure can be minimised.
  • Clustering – spreading the processing load between many instances, with automatic load balancing. This allows for the failure of one unit with minimal impact on the system

Risk 2 – hardware/ network failure

Hold on. Why should software vendors like Sytel be concerned about conditions beyond their control? Because any responsible vendor aims to make life as productive and trouble-free as possible for the end-user. This means providing backup and redundancy capabilities to mitigate against failures outside of the software.

As with back-up generators that take the strain after a power outage, so each component in the chain – from servers and power supplies, to voice and data networks – should have a back-up ready to take over in the event of hardware or network outage. In the case of servers/ virtual machines, this doesn’t necessarily mean ‘1 for 1’ redundancy. As it is unlikely that more than one would fail at one time, a single back-up could be configured to take over for a number of different primaries.

In order to take over automatically, any back-up must be constantly maintained to mirror the state of the primary server. This is where good software design comes in, providing a mechanism which will notice when the primary is no longer in service, and immediately kick-start the back-up.

Be prepared to ask hard questions of your supplier, like “What happens if an earthquake hits?” If the answer is “Not our problem!” you might look for a vendor that is willing to try harder and consider the bigger picture.