Recent Public Cloud Outage Demonstrates the Need to Get Back to IT Basics

Fault Tolerance Requires Avoiding Single Points of Failure

A December public cloud outage was a rare but real occurrence of data disruption. It underscores how IT managers and directors must remind their teams about the need to get back to basics. When managing infrastructures, key tenets include avoiding single points of failure as well as building tolerance into the environment.

Outages that occur to data centers managed by the major public cloud providers are highly visible because many businesses use these platforms. Even for customers who don’t deploy applications in the cloud directly, they, likely, rely on third-party SaaS providers who use these large public clouds behind the scenes for VoIP, email, chat, help desk, call center, website hosting, and other enterprise application services. In fact, avoiding a single point of failure is sound infrastructure design, whether using public cloud or private cloud platforms.

If any of your third-party services happen to run in the cloud, you may not have sufficient redundancy. It is critically important you ask your third-party providers a) who their cloud provider is and b) what their back-up plan is if their cloud provider suffers an outage.

Many small business owners as well as IT pros at the enterprise level might be under the impression that they have distributed their workflows sufficiently because they are working with multiple IT infrastructure partners. However, if your partners are using the same underlying cloud provider, you still have a single point of failure. Also, keep in mind that your public cloud options are not limited to the major public cloud providers with wide name recognition. Other organizations, including Multi-Tenant Data Centers, like DataBank, can provide you with public cloud access as well.

Real Redundancy Requires More Than One Environment

Implementing a failover site that can host your infrastructure should a disaster strike may seem obvious, but this is sometimes forgotten when new infrastructures are deployed or when an existing infrastructure is migrated to a new environment. In fact, the major public cloud providers also emphasize the need for their customers to set up failover infrastructures at secondary data centers. It’s not enough to just deploy a backup site; you also need to test your failover capabilities regularly.

Before the cloud existed, when companies hosted their infrastructure on-premises, they focused on redundancy for all layers of their stack, ranging from the top—the application itself—down to lower levels, including power, cooling, servers, and network devices. As enterprises began to migrate to the public cloud, they stopped worrying about the parts that are outsourced to their cloud provider. They most likely assumed their cloud providers managed all of this. However, for real redundancy, you need more than a single environment.

Make Sure Tolerance Is Part of the Design Conversation

As you design your infrastructures, make sure fault tolerance is part of the conversation. Your system engineers should consider what will happen if, at random, services are shut off coming from a public cloud provider. Hypothetical outages could range from single VMs to entire datacenters, as well as any SaaS platform. Create a failover plan and then test it to see if the business can tolerate the shut-down.

Warning: If your system engineers are uncomfortable with this exercise, that’s a sign the current state of your fault-tolerance capabilities may not be up to your business requirements.

Testing what will happen if public cloud services suddenly become unavailable is critically important. As an alternative, you can build a private cloud or use a colocation data center to get more control over resiliency decisions. You will want to work with a cloud provider who makes decisions based on your best interests, not their own.

For example, you can make sure your data center is not too close to the blast radius of a strategic city. It’s also a good idea to deploy ample levels of back-up generator capacity along with multiple IP transit providers, diverse fiber optic pathways, and a true A-B server and network infrastructure.

Failover Site Considerations

Before deploying a failover site for an existing IT infrastructure or one you’re about to deploy, evaluate the mission-criticality level of the applications supported by the infrastructure. How long can your business afford for that infrastructure and the applications it supports to be unavailable to end-users? They may be internal users managing operations or external users—such as customers and vendors—trying to transact orders.

From there, determine the IT resources and processes to meet the recovery time and recovery point objectives. Another key consideration is just how much control you want to maintain over the primary and failover infrastructures. Public cloud platforms give you the least control while on-premises data centers give you the most. In between those two extremes are private cloud and colocation data centers.

You may, for example, deploy your primary infrastructure in a colocation data center where you have more control. You can then set up the failover environment in the public cloud, where there’s less control—but where you can run the infrastructure for as long as it takes to restore the colocation data center should that infrastructure go down.

There’s also an added benefit to this approach. You can use the public cloud environment for extra capacity if a flash event occurs and the workload on your colocation data center spikes. In this case, you would address scalability and failover requirements with a single environment.

Portability Facilitates Fault Tolerance

Where you deploy the primary and failover infrastructures will vary according to your business requirements. Given all the variables, it’s best to consult with an IT partner who has data center and hybrid environment expertise. It’s also important to realize that as your applications mature, it may make sense to move your infrastructure to a different type of environment. This means you will need to evaluate your failover plan regularly, perhaps, as often as you test the plan.

The need to migrate infrastructures is where environment portability comes in handy. If you partner with data center providers that give you flexibility as to where you deploy your infrastructures, you have the freedom to migrate either your primary or failover infrastructures when the business or technical requirements warrant. Your choices can include colocation and on-premises data centers as well as public and private clouds.

The key is never to forget about the basics. Mission-critical applications always need a failover plan that’s tested and evaluated regularly. Your business can’t afford the revenue loss in the scenario that customers can’t place orders and if internal teams can’t fulfill those orders. You also risk tarnishing your brand reputation if customers stop trusting your ability to deliver products and services reliably.

Summary:

Assure that your workflows are sufficiently distributed to avoid a single point of failure.
Create a failover plan and test it regularly.
Broaden your public cloud options, not just those from the major providers.
Consider building a private cloud for greater control over resiliency decisions.
Look to IT infrastructure partners who have hybrid environment expertise and accommodate portability.

For more information on how to set up failover environments and build resilience for your mission-critical applications, contact DataBank today.

This article was written by Phil Rosenthal, Vice President of Product and Engineering, DataBank.

Month:

Resources

DataBank Blog

Resources

DataBank Blog

Recent Public Cloud Outage Demonstrates the Need to Get Back to IT Basics

Fault Tolerance Requires Avoiding Single Points of Failure

Real Redundancy Requires More Than One Environment

Make Sure Tolerance Is Part of the Design Conversation

Failover Site Considerations

Portability Facilitates Fault Tolerance

Discover the DataBank Difference

Get Started

Request a Quote

Tour Our Facilities

LATEST NEWS

Recent Public Cloud Outage Demonstrates the Need to Get Back to IT Basics

Fault Tolerance Requires Avoiding Single Points of Failure

Real Redundancy Requires More Than One Environment

Make Sure Tolerance Is Part of the Design Conversation

Failover Site Considerations

Portability Facilitates Fault Tolerance

Discover the DataBank Difference

Get Started

Request a Quote

Tour Our Facilities