The Reality of Failure-Proof Systems in the Cloud

February, 2019

How often does the server/virtualisation stack fail? We take some examples from our own experience and identify 3 things you can do to get you close to 5 nines up-time.
As an OEM software vendor providing technology to integrators and carriers, failsafe systems are our lifeblood.  This gives us some unique insights into how to achieve high uptime in reality. Traditional strategies for High Availability (HA) revolve around redundancy at a process, VM and physical infrastructure level to minimise risk of outage, and minimise recovery time when outage occurs.  But how often does the server/virtualisation stack fail?  Let’s take some examples from our own experience.

Sytel has around 30 rackmount servers in the computer room at our R&D labs. We manage them well and over the last 15 years we have had:

  • one server outage because of a memory chip failure
  • one network switch failure

Downtime virtually zero.  Maybe good management on our part but really an indication of what can be achieved with hardware.

On the software side; yes you can set up services and VMs to fail over to a redundant backup, but unless your contact center software stack does this for you automatically, the management overhead of setup and the likelihood of misconfiguration, virtualisation-based HA is just a fig leaf.

There is another reason why the holy grail of 5 nines uptime can be difficult to achieve with software.  If the software itself is well-proven and reliable, the biggest sources of real-world failure in large-scale contact center systems are:

  • Load-related database errors (both Relational and NoSQL databases)
  • Server OS handling of extreme network load
  • Too much work being done by one application instance

HA solutions per se don’t solve these problems.  But there are three things you can do, in particular, to mitigate the problems that arise here:

  1. Perform application-level caching and support for rollforward recovery on failure for all services that write to the contact center database.
  2. Ensure your platform’s internal signalling protocols (the things that control system state) are hardened to deal with temporary network glitches.
  3. Ensure your system architecture is componentised and able to multiplex everything in order to achieve scale and limit cost of component failure.

If you ensure your platform does these three things (and destruction test to make sure that it does!) then this will get you much closer to 5 nines uptime, rather than just implementing redundant services, virtualisation-based failover and redundant hardware.

It’s not received wisdom, but it’s definitely common sense!