Cloud Outage Resilience











PDF version of this report
You must have Adobe Acrobat reader to view, save, or print PDF files. The
reader is available for free
download
.

Cloud Outage Resilience

by James G. Barr

Docid: 00018022

Publication Date: 2203

Publication Type: TUTORIAL

Preview

Major cloud outages are more common than cloud providers want to
acknowledge, and enterprise cloud users want to believe. These outages can
disrupt critical enterprise business functions resulting in loss of
customers, loss of income, even loss of reputation. To help mitigate the
damages, enterprise planners should aim to achieve greater cloud outage
resilience, including developing cloud-agnostic applications or
applications capable of running on multiple cloud platforms.

Report Contents:

Executive Summary

[return to top of this
report]

Cloud computing has revolutionized business data processing. enabling an
enterprise (either a private sector company or public sector agency) to
outsource many, if not most, of its computer-aided functions to one or
more third-party data centers – collectively, “the cloud.”

Related
Faulkner Reports
Cloud Computing Concepts Tutorial
Unclouding Tutorial
Preparing a Business
Continuity Plan Implementation

Cloud computing was – and is – an ideal solution for small-to-medium
sized enterprises (SMEs), for which the management and maintenance of
information technology is not a “core competency.” By delegating data
processing operations to the cloud, SMEs could avoid:

  • Capital expenses for servers, storage systems, and other "big ticket"
    items
  • Operating expenses for software, salaries, and service contracts

Although the cost of engaging a cloud provider could be high, it was less
than any do-it-yourself option and, perhaps more importantly, it was
predictable.

As cloud computing evolved from software as a service (SaaS) to
infrastructure as a service (IaaS) to today, often described as everything as
a service, even large enterprises with well-established, large data centers
started jumping on the proverbial bandwagon, incorporating the cloud as an extension of their information infrastructure.

The cloud itself is dominated by three major providers, which
collectively account for 61 percent of cloud services.
They are:

  • Amazon Web Services (AWS) – 32 percent
  • Microsoft Azure – 20 percent
  • Google – 9 percent

Other contributors include Alibaba (6 percent), IBM (5 percent),
Salesforce (3 percent), Tencent (2 percent), Oracle (2 percent), and all
others (21 percent).1

While the fact that the cloud is represented by reputable and
well-resourced firms like Amazon and Microsoft is encouraging, it also
means that the cloud is particularly vulnerable to a major provider
outage, an exposure which, of course, ripples down to individual cloud
users. For enterprise users (that is, companies, agencies, and other
organizations) that rely on cloud services for critical operations, a
measure of resilience is needed. For enterprise planners, the key question
is:

How do we provide for ourselves – and our
customers – if core cloud services are not available?

Or, as analyst Cyril Plisko inquires, “How [do we] architect for
resiliency in a cloud [outrage] reality?2

To affect cloud outage resilience, enterprise planners can normally elect
one or more of four options:


  • Cloud-Agnostic

    – Engineer a cloud application to run on
    multiple cloud platforms.

  • Cloud-Native

    – Engineer a cloud application to run on a
    preferred cloud platform, but permit operations on secondary platforms.

  • Privatization

    – Redeploy a cloud application from the public
    cloud to an enterprise-managed private cloud.

  • Unclouding

    – Remove a cloud application from the cloud.

Cloud Outages

[return to top of this
report]

Major cloud outages are more common than cloud providers want to
acknowledge and enterprise cloud users want to believe. Consider the
following 2021 incidents:

  • In February, “Google Assistant suddenly stopped working. The outage
    made it impossible to connect to Google Home devices, from smart lights
    to home security tech.”3
  • In March, “[the] largest Europe-based cloud service provider,
    OVHcloud, saw its SBG2 data center in Strasbourg burn down, damaging the
    SBG1 data center as well.”4
  • In June, “[the] AWS EU-Central region experienced a major outage. The
    outage lasted three hours and [was] caused by the failure of a control
    system that disabled multiple air handlers in the [effected]
    Availability Zone.”5
  • In October, “Facebook and its subsidiaries – Messenger, Instagram,
    WhatsApp, Mapillary, and Oculus – became unavailable for six to seven
    hours.”6
  • In November, “Google Cloud went down …, taking services like Home
    Depot, Snapchat, Etsy, Discord, and Spotify down with it.”7
  • In December, “[one] of the mission-critical AWS cloud units,
    US-East-1, was hit with an outage that [disabled] services like Disney+,
    Netflix, Slack, Ticketmaster, stock trading app Robinhood, and the
    crypto exchange Coinbase.”8

The number and severity of cloud outages – and cloud application
outages – will likely increase owing to three factors:


  1. Centralization

    – A handful of providers, namely Amazon and
    Microsoft, essentially control the cloud. By definition, any outage
    effecting one of these major providers will impact a large number of
    enterprise customers.

  2. Cyber Attacks

    – The cloud, like any modern information
    infrastructure, is subject to cyber attacks. As cyber attacks escalate
    in frequency and intensity, the risks to cloud infrastructure and cloud
    applications increase concomitantly.

  3. Cloud Fever

    – Despite the potential for could outages, the
    cloud is becoming a more attractive destination for enterprise
    applications. Remote work offers a recent impetus, but, more generally,
    the movement to an

    everything-should-be


    as a service
    philosophy is driving higher levels of cloud adoption.

Resilience Options

[return to top of this
report]

There are four basic options for reducing exposure to the cloud and cloud
outages.

Cloud-Agnostic

The cloud-agnostic approach refers to engineering cloud applications to
run on multiple clouds, eliminating single-point-of-failure
scenarios. As CloudZero reports, according to Richard Bailey, an expert in
cloud cost management, “A business is cloud-agnostic when the company IT
systems are not locked into a single cloud vendor or do not rely on one
cloud provider’s proprietary services. Typically, services are spread
between multiple cloud vendors to preserve and ensure the uptime of
critical applications.”

Due to the complexity of engineering compatibility and interoperability
between multiple cloud hosts, the cloud-agnostic option is often reserved
for mission-critical applications.

While beneficial in terms of uptime, the cloud-agnostic path can lead to
degraded functionality. As CloudZero explains, “With a cloud-agnostic
approach, organizations lock themselves into services that are
transferable across multiple cloud providers. This ‘lowest common
denominator’ approach is necessary if the organization decides to switch
cloud providers – but it also means the engineering team is not able to
take advantage of the best-of-breed services of each of the cloud
providers.

“Not being able to use the innovative services of cloud providers can
slow down the team and put competitors at an advantage.”9

Cloud-Native

The cloud-native approach (or cloud-agnostic lite) refers to engineering
cloud applications to run on a preferred cloud, but permitting them to
operate, as possible and practical, on multiple clouds, either routinely
or on an emergency basis. Not all cloud-native applications can run in this quasi-cloud-agnostic
mode.10

Privatization

The privatization approach refers to redeploying cloud applications from
the public cloud to an enterprise-managed private cloud. While not
eliminating the cloud outage exposure, the privatization route enables an
enterprise to reduce – and, otherwise, better manage – this vulnerability.

In terms of enhancing cloud outage resilience – along with enhancing
protection for an enterprise’s most precious digital property – enterprise
planners should implement private clouds for applications that process:

  • Employee data
  • Customer data
  • Financial data
  • Compliance-related data
  • Intellectual property and trade secret information11

Uncloud

The uncloud approach refers to removing cloud applications from the
cloud. While unclouding may seem an extreme action, many enterprise
planners have embraced the concept. According to a 2017-era Datalink/IDG survey of more than 100 IT
professionals, almost 40 percent of organizations with public cloud
experience have migrated systems back from the public cloud to in-house
data centers. The top reasons were:

  • Security – 55 percent
  • Cost/pricing concerns – 52 percent
  • Manageability – 45 percent12

Key steps in the unclouding process include:

  • Selecting a Landing Site – Determining where (i.e.,
    to which environment) the unclouded application will be relocated.
    Remember that the obvious solution – “going home,” or restoring the
    application to its point of origin – may be impractical, particularly if
    the original supporting infrastructure is no longer in place.
  • Creating a Test Site – Deploying the application to
    its future home and conducting extensive functional and performance
    tests. As appropriate, invite select end users to “pilot” the
    application. As analyst Tom Nolle advises, “A cloud exit strategy should
    also emphasize application lifecycle management. [Enterprises] should
    test applications in their new location, and validate workflows. If you
    made significant application changes, or if the applications were in the
    cloud for an extended period, run a pilot test to ensure they run
    correctly before you cut over.”13
  • Developing an Unclouding Plan – Working in
    collaboration with the present and future application hosts, prepare an
    exhaustively detailed unclouding plan, including the specific
    responsibilities of all stakeholders.

Recommendations

[return to top of this
report]

To achieve greater cloud outage resilience, enterprise planners should
pursue the following three-step action plan:

  1. Amend their software development protocols to develop cloud-agnostic
    applications.
  2. Reengineer their existing cloud applications – especially their core
    applications – to be cloud-agnostic.
  3. Distribute their cloud-agnostic applications across multiple cloud
    platforms.

With the knowledge that cloud outages are a significant and ongoing
threat to critical business functions, enterprise planners should update,
as appropriate, their:

  • Incident response (or crisis management) plan
  • Disaster recovery plan
  • Business continuity plan

[return to top of this
report]

References

1 Synergie Research Group.

2 Cyril Plisko. “How to Architect for Resiliency in a Cloud
Outages Reality.” InformationWeek | Informa PLC. March 11, 2022.

3 Twain Taylor. “7 Biggest Cloud Outages of the Past Year.”
TechGenix. February 11, 2022.

4 “The Cloud in 2021: 21 Game-Changing Outages, Security
Issues, and Highlights.” CAST AI Group. January 11, 2022.

5 Ibid.

6 Ibid.

7 Twain Taylor. “7 Biggest Cloud Outages of the Past Year.”
TechGenix. February 11, 2022.

8 “The Cloud in 2021: 21 Game-Changing Outages, Security
Issues, and Highlights.” CAST AI Group. January 11, 2022.

9 “Cloud Agnostic: What Does It Really Mean and Why Do You
Need It?” CloudZero. August 13, 2021.

10 Ibid.

11 “Public Cloud vs. Private Cloud: What’s Best?” CoreSite.
October 3, 2016.

12 Esther Shein. “Unclouding: How One Company Reversed the
Cloud Migration Process.” TechTarget. March 2017.

13 Tom Nolle. “Build a Cloud Exit Strategy in Three
Steps.” TechTarget. January 2016.

About the Author

[return to top of this
report]

James G. Barr is a leading business continuity analyst
and business writer with more than 40 years’ IT experience. A member of
“Who’s Who in Finance and Industry,” Mr. Barr has designed, developed, and
deployed business continuity plans for a number of Fortune 500 firms. He
is the author of several books, including How to Succeed in Business
BY Really Trying
, a member of Faulkner’s Advisory Panel, and a
senior editor for Faulkner’s Security Management Practices.
Mr. Barr can be reached via e-mail at jgbarr@faulkner.com.

[return to top of this
report]