99.99% available services, Are Tiers relevant? The Cloud may be more disruptive than misinformed

Uptime Institute has a post in Response to an AFCOM post on Tier Standards.


An abbreviated version of this column was written for Data Center Knowledge in response to an interview with AFCOM Denver Chapter President Hector Diaz, on September 11, 2014.

...

The Tier standards offered by the Uptime Institute can often be confusing at present.

Tiers are summarized as these 4 levels.

  • Tier IV - Fault tolerant site infrastructure
  • Tier III - Concurrently maintainable site infrastructure
  • Tier II - Redundant capacity components site infrastructure (redundant)
  • Tier I - Basic site infrastructure (non-redundant)

So when you want a highly available service you would assume you need a Tier 3 or 4 data center.  But for services like Netflix, eBay, and Google, there are 3-5 data centers running services where a data center can go down and services are still available.  I don’t ever hear these guys talking about they have built Tier 3 or 4 data centers.  Heck, Netflix proudly says they don’t need data centers, using Amazon and Google Cloud Services.

Given 99% or more of start up are using the cloud to build services and following guidance like AWS architecture for high availability services, are Tier ratings of data center relevant?

NewImage

Instead of talking about tiers, highly available cloud services talk about availability zones.

Regions and Availability Zones

Amazon EC2 is hosted in multiple locations world-wide. These locations are composed of regions and Availability Zones. Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Resources aren't replicated across regions unless you do so specifically.

Amazon operates state-of-the-art, highly-available data centers. Although rare, failures can occur that affect the availability of instances that are in the same location. If you host all your instances in a single location that is affected by such a failure, none of your instances would be available.

NewImage
 
 

While AFCOM and Uptime Institute debate Tier Standards, the technology community is moving to Availability Zones practices. 

Netflix just survived the unavoidable rebooting of 10% of its cassandra servers in AWS due to Xen maintenance.  And Netflix has survived a variety of data center outages in AWS as well.

A State of Xen - Chaos Monkey & Cassandra

 
On Sept 25th, 2014 AWS notified users about an EC2 Maintenance where “a timely security and operational update” needed to be performed that required rebooting a large number of instances. (around 10%)  On Oct 1st, 2014 AWS sent an updated about the status of the reboot and XSA-108.


While we’d love to claim that we weren’t concerned at all given our resilience strategy, the reality was that we were on high alert given the potential of impact to our services.  We discussed different options, weighed the risks and monitored our services closely.  We observed that our systems handled the reboots extremely well with the resilience measures we had in place.  These types of unforeseen events reinforce regular, controlled chaos and continued to invest in chaos engineering is necessary. In fact, Chaos Monkey was mentioned as a best practice in the latest EC2 Maintenance update.