The Reality of Failure-Proof Systems in the Cloud

February, 2019

How often does the server/virtualisation stack fail? We take some examples from our own experience and identify 3 things you can do to get you close to 5 nines up-time.

As an OEM software vendor providing technology to integrators and carriers, failsafe systems are our lifeblood. This gives us some unique insights into how to achieve high uptime in reality. Traditional strategies for High Availability (HA) revolve around redundancy at a process, VM and physical infrastructure level to minimise risk of outage, and minimise recovery time when outage occurs. But how often does the server/virtualisation stack fail? Let’s take some examples from our own experience.

Sytel has around 30 rackmount servers in the computer room at our R&D labs. We manage them well and over the last 15 years we have had:

one server outage because of a memory chip failure
one network switch failure

Downtime virtually zero. Maybe good management on our part but really an indication of what can be achieved with hardware.

On the software side; yes you can set up services and VMs to fail over to a redundant backup, but unless your contact center software stack does this for you automatically, the management overhead of setup and the likelihood of misconfiguration, virtualisation-based HA is just a fig leaf.

There is another reason why the holy grail of 5 nines uptime can be difficult to achieve with software. If the software itself is well-proven and reliable, the biggest sources of real-world failure in large-scale contact center systems are:

Load-related database errors (both Relational and NoSQL databases)
Server OS handling of extreme network load
Too much work being done by one application instance

HA solutions per se don’t solve these problems. But there are three things you can do, in particular, to mitigate the problems that arise here:

Perform application-level caching and support for rollforward recovery on failure for all services that write to the contact center database.
Ensure your platform’s internal signalling protocols (the things that control system state) are hardened to deal with temporary network glitches.
Ensure your system architecture is componentised and able to multiplex everything in order to achieve scale and limit cost of component failure.

If you ensure your platform does these three things (and destruction test to make sure that it does!) then this will get you much closer to 5 nines uptime, rather than just implementing redundant services, virtualisation-based failover and redundant hardware.

It’s not received wisdom, but it’s definitely common sense!

What We Offer

Flexible Contact Center Software

Contact Center as a Service (CCaaS)

Contact Center Platform – CCaaS Partners

Contact Center Platform – Enterprise

Blended Media Desktop

Work Anywhere

Customise, Localise, White Label

Integration via APIs

World Class Development

Capabilities

All Media Channels

Agent Multi-Tasking

Intraday Management

Predictive Dialing

Optimised Inbound Routing

Agent Scripts

Voice and Screen Recording

Analytics and Data Feeds

IVR, Bots & Conversational AI

Customer Journey Tracking

Modules

Sytel Real-time Optimisation (SRO)

Sytel AI Dialer™

Softdial One™

Softdial Scripter™

Softdial Pathfinder™

Softdial Media Server™

Softdial Reporter™ 5

Softdial Publisher™

Softdial Recording Monitor™

Softdial Campaign Manager™

Softdial Repository™

Sytel Global Compliance™

Solutions

By Business Type

Enterprise

Small/ Medium Business

By Industry

Customer Service

Market Research

Debt Collection

Sales & Telemarketing

Healthcare

The Reality of Failure-Proof Systems in the Cloud

Latest Blogs