Designing for the Cloud – Challenge #6 – Mitigating Failure

March, 2021

The universal law of communications is that things go wrong. Every formal signalling protocol has this fact built into its design. But all too often, this doesn’t extend to the rest of the infrastructure that runs the contact center. So... what happens when things break?

In the 6th and final blog in our Designing for The Cloud series, we take a look at the sources of downtime and how to mitigate failure to protect your CCaaS business.

The universal law of communications is that things go wrong. Every formal signalling protocol has this fact built into its design. But all too often, this doesn’t extend to the rest of the infrastructure that runs the contact center.

Of course, if you take a service from a cloud provider, they (in theory) have infrastructure, redundancy and so on that prevents outage. Some of them make great virtues of this, claiming 99.999% uptime, etc.

While a ‘high availability’ solution with redundancy is of course important, this will only prevent ‘lights-out’ outages, caused by server/OS failure.

But this is not a major source of agent downtime in any contact center operating at scale. If everything else in your solution is reliable, this makes only a tiny practical difference to reliability – perhaps the difference between 99.99% and 99.999% uptime.

The sources of downtime

In real life, the sources of downtime are a bit more mundane:

  1. misconfiguration leading to unexpected system behaviour.
  2. poorly-designed integrations creating bottlenecks and leading to Denial of Service (DoS).
  3. database logjam, created by somebody running a ‘get me everything from everywhere’ report.
  4. and last but not least, software defects.

These issues cause considerably more downtime than server and OS outage, so avoiding these problems is crucial to delivering a true ‘high availability’ system.

Sources of Downtime Pie Chart

Mitigating failure

Key to this is design, starting with the APIs exposed by the platform.

  1. Misconfiguration – Having robust API processing involves not just reporting errors but also reporting on, and mitigating the effects of, misconfiguration.
  2. Integration design – The shape of the APIs also determine the shape of any integration with third-party software. Clear separation of concerns and having resource failure mitigation as part of the API contract are necessary to deliver system availability in the face of resource problems.
  3. Database logjam – Again design is the solution. Failsafe mechanisms can be built by having clear separation of concerns, and software that proxies database communication on behalf of other components in the platform.
  4. Software defects – The elephant in the room. All manufacturers attest to their great software quality – it is a mantra which sadly has no bearing on reality. The next time you’re looking to refresh your contact center platform, maybe the question you should ask is ‘What does your platform do to mitigate the effects of software defects?’ The first key element of this is to implement guard code as protection from ‘nonsense’ logic. The second is having systems to alert the service provider about potential problems.

Even in a well-designed and tested platform, integration of third-party technology and user configuration can have unintended consequences; for example agent scripts that run SQL queries that, when run for many agents in parallel, cause contention.

Being able to capture problems like this before they become an operational problem helps keep your clients happy, and your business rolling along.

For more information about how our products are designed for the cloud, just talk to us.