Updated: Oct 24, 2018
Technology is an enabling tool. It has become the lifeblood and a key differentiator for a lot of companies. On the downside, some companies can go out of business or suffer a huge loss as an aftermath of a prolonged datacenter downtime resulting from a technical failure, a natural disaster, hacking, sabotage, and the famous human error among other root causes.
Two years ago, a datacenter downtime costed Delta Airlines a whopping $150M loss after suffering some major electrical issues, impacting around 2,000 flights. Same thing happened to Southwest Airlines on July, 2016 when a lone router failed, crippling 2,300 flights at a price tag estimated at $175M.
Whether we admit it or not, most companies or segments of a company can no longer operate without a fully functional system. We all have become dreadfully dependent on technology —and without it, we simply cannot do anything anymore.
The Quest for Uptime
The Uptime Institute uses a four-tier ranking system as a benchmark for determining the reliability of a data center according to its design, construction, and performance. This proprietary rating system begins with a Tier I computer room, which offers minimum power and cooling protection, up to an enterprise-grade Tier IV datacenter, which offers full redundancy plus a stand-by power and cooling system guaranteeing a 99.995% uptime.
Basic Datacenter Tier Construction Elements according to Uptime Institute:
Tier 1 – Low criticality computer rooms, often used by small businesses or non-critical and non-technology dependent sites or segments of a company:
28.8 Hours of downtime per year
Tier 2 – Medium criticality computer rooms, typically used by medium sized businesses or a segment of a company:
Partial redundancy in power and cooling
Experience 22 hours of downtime per year
Tier 3 – High criticality datacenters, used by large companies or large segment of a company:
99.982% uptime (Tier 3 uptime)
No more than 1.6 hours of downtime per year
N+1 fault tolerant providing at least 72 hour power outage protection
Tier 4 – Ultra high criticality datacenters, used by enterprises and multinational companies:
99.995% uptime per year (Tier 4 uptime)
2N+1 fully redundant infrastructure (the main difference between tier 3 and tier 4 data centers)
96 hour power outage protection
26.3 minutes of annual downtime.
Designing Uptime is a mix of science, engineering, math, and art. These ingredients are all essential, but they need to be blended properly with the exact amounts to meet the desired availability requirements.
My personal quest for “Uptime” started after being tasked to manage the Global datacenter of CEMEX back in 2006, just when the company decided to consolidate 13 regional datacenters worldwide into 1. It was a perfect timing to evolve a maxed-out Tier II datacenter to at least a Tier 3 and serve the worldwide operations of the company 24x7. It was a long and painful journey but its worth it. With meticulous planning, design, execution, adjustments, re-adjustments, calibration and re-calibration, CEMEX‘s Global Datacenter got the first Uptime Institute Tier 3 with elements of Tier 4 certification across the whole Latin-America.
Kudos to all who helped in this journey: Uptime Institute (Ken Brill & Lars Strong), ATN (Engr. Manuel Victorio and team), friends from ICREA (International Computer Room Experts Association), to all the CEMEX datacenter staff, and of course to the one of the pioneers of Uptime Institute engineering in LATAM - Engr. Pedro Cantu.
Microsoft’s Datacenter downtime – not cool!
In the past year, Microsoft Azure’s uptime was 99.9979% which is undoubtedly a Tier 4 grade performance base on its service level attainments. However, a cooling problem in Microsoft's South Central U.S. datacenter near San Antonio caused a partial datacenter shutdown in one of its segments. It was apparently associated with a lightning strike according to MS press release. The downtime caused issues and intermittency to a number of Microsoft Cloud services users in the U.S. and other countries till late afternoon on Sept. 4. This downtime impacted services related to Office 365, Xbox, OneDrive, Azure Active Directory and Visual Studio Team Services, to name a few.
In Azure‘s web site, Microsoft published that, “Starting at 09:29 UTC on 04 Sep 2018, customers in South Central US may experience difficulties connecting to resources hosted in this region. Engineers have isolated an issue with cooling in one part of the data center, which caused a localized spike in temperature, as the preliminary root-cause…”
Without knowing the real root cause, I can only surmise that there was an engineering design error which had caused this costly downtime for Microsoft, its reputation, and the companies that were impacted by the failure.
Tier 4 Datacenters having a 2N+1 design means that there is a full redundancy setup, plus a stand-by emergency system for all critical datacenter domains, such as power, cooling, and mechanical components, which obviously in this particular case was not present. I highly doubt that it meets the protection level of a Tier 3 datacenter, having an N+1 redundancy design.
My guess is that, Microsoft Azure's 12 month uptime of 99.9979% was based on luck and not on engineering.
Human error is still the main root cause of most downtimes
Your weakest link in the datacenter uptime chain is not related to infrastructure, but the people who designed, implemented, and operates it. According to Uptime Institute, 70% of downtime in the datacenter is still caused by human error. In my professional opinion, Microsoft‘s downtime on September 4 was a design flaw, therefore, a dreadful human error. But, as they say, "To err is human, to forgive is divine".