Architecting for Outages, an architect posts on surviving AWS

Everyone wants to survive a data center outage, but as AWS outage shows, not all do survive.  Here is a post that summarize best practices in SW architecture to survive an outage like AWS.

Retrospect on recent AWS outage and Resilient Cloud-Based Architecture

DateThursday, June 9, 2011 at 8:19AM

A bit over a month ago Amazon experienced its infamous AWS outage in the US East Region. As a cloud evangelist, I was intrigued by the history of the outage as it occurred. There were great posts during and after the outage from those who went down. But more interestingly for me as architect were the detailed posts of those who managed to survive the outage relatively unharmed, such as SimpleGeo, Netflix,SmugMug, SmugMug’s CTO, Twilio, Bizo and others.

The list of best practices are:

The main principles, patterns and best practices are:

  • Design for failure
  • Stateless and autonomous services
  • Redundant hot copies spread across zones
  • Spread across several public cloud vendors and/or private cloud
  • Automation and monitoring
  • Avoiding ACID services and leveraging on NoSQL solutions
  • Load balancing

If this seems daunting, there are new services coming to provide scalability and availability services.

The emerging solution to this complexity is a new class of application servers that offers to take care of the high availability and scalability concerns of your application, allowing you to focus on your business logic. Forrester calls these "Elastic Application Platforms", and defines them as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.

Steve Jobs Keynote serious about data centers, compares Apple, Amazon, and Google

Steve Jobs gave his iCloud keynote http://events.apple.com.edgesuite.net/11piubpwiqubf06/event/

at minute 115:00 you can see Steve Jobs compare Apple, Amazon, and Google cost of music cloud services.

image

To make the point Apple is committed to iCloud he makes the point Apple is serious about data centers.  Steve discusses its 3rd data center in Maiden, NC at minute 116:00.

image

image

Steve says this data center is as eco friendly  as a data center can be with modern technology.

image

Steve is a great show man as usual and wows people showing the scale of the building.

image

and points to two dots on the roof that are actually people, getting laughs from the crowd.  When is the last time you heard someone laugh when they talk about the scale of their data center.

image

image

"Full of stuff.  expensive stuff."  More laughs.  Who would ever call millions of dollars of IT equipment stuff?  You won't see Jobs calling an iPhone, iPod, or iPad stuff.  Do you think he is making fun of the other stuff he doesn't make?

image

image

image

It's been over 20 years since I worked at WWDC as an Apple employee, and never would have thought Steve Jobs would be talking about data centers.  A lot has changed in 20 years.  Wow 20 years, and there are people I know that have been there the whole time.  This video was probably some of the first pictures they've seen of their mothership data center in Maiden, NC.

image

Will Automation automatically fix Amazon's Outage issues? automat* mentioned 9 times in post mortem

Amazon posted a while ago its post-mortem on the AWS outage.  One of the entertaining ways to look at the Summary is the number of times "automat*" gets used.

Here are a few examples.

image

For these database instances, customers with automatic backups turned on (the default setting) had the option to initiate point-in-time database restore operations.

RDS multi-AZ deployments provide redundancy by synchronously replicating data between two database replicas in different Availability Zones. In the event of a failure on the primary replica, RDS is designed to automatically detect the disruption and fail over to the secondary replica. Of multi-AZ database instances in the US East Region, 2.5% did not automatically failover after experiencing “stuck” I/O. The primary cause was that the rapid succession of network interruption (which partitioned the primary from the secondary) and “stuck” I/O on the primary replica triggered a previously un-encountered bug. This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required. We are actively working on a fix to resolve this issue.

So, AWS figured out there was a bug in the monitoring agent to automatically fail over.

This bug left the primary replica in an isolated state where it was not safe for our monitoring agent to automatically fail over to the secondary replica without risking data loss, and manual intervention was required. We are actively working on a fix to resolve this issue.

And, they are going to fix the problem with more automation.

image

We will audit our change process and increase the automation to prevent this mistake from happening in the future.

Here are a few more areas where automat* is mentioned.

image

We’ll also continue to deliver additional services like S3, SimpleDB and multi-AZ RDS that perform multi-AZ level balancing automatically so customers can benefit from multiple Availability Zones without doing any of the heavy-lifting in their applications.

Speeding Up Recovery

We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster. We have a number of operational tools for managing an EBS cluster, but the fine-grained control and throttling the team used to recover the cluster will be built directly into the EBS nodes. We will also automate the recovery models that we used for the various types of volume recovery that we had to do. This would have saved us significant time in the recovery process.

With automat* mentioned so many times, it makes you think there is a lot of manual work going on in AWS.

If you want an automated private cloud, you could learn from some of the original AWS EC2 team as they started the company http://nimbula.com/

image

Google Infrastructure is more than Data Centers and Servers, it's software

In the data center world if you hear the word infrastructure you naturally think of the data centers and servers.  Why not Infrastructure is defined as:

Infrastructure is the basic physical and organizational structures needed for the operation of a society or enterprise,[1] or the services and facilities necessary for an economy to function.[2] The term typically refers to the technical structures that support a society, such as roads,water supply, sewers, electrical grids, telecommunications, and so forth.

A couple of years ago at a conference I was talking to a Google architect and I eventually asked what he did.  He said I work on the infrastructure.  When he said infrastructure I named a few people in the data center group I had ran into at data center  conferences, but he didn't know any of those people.  Then, he repeated I work on THE infrastructure.  What we build our applications on - search, storage, compute.  OHHH, you guys get it that your infrastructure is more than physical devices.  Software is infrastructure few think about to build services.  Most typically think physical infrastructure.

GigaOm has a post with Google's Infrastructure czar, Urs Hölzle.  Om Malik says it has been 5 years since he has touched base with Urs.  I would never go that long.

Hölzle was company’s first VP of engineering, and he has led the development of Google’s technical infrastructure.

Hölzle’s current responsibilities include the design and operation of the servers, networks and data centers that power Google. It would be an understatement to say that he is amongst the folks who have shaped the modern web-infrastructure and cloud-related standards.

When you read the GigaOm post don't just think physical infrastructure, think about the software Google has in place to support cloud services.

Others might disagree, but Hölzle believes Google’s common infrastructure gives it a technological and financial edge over on-premise solutions. “We’re able to avoid some of that fragmentation and build on a common infrastructure,” says Hölzle. “That’s actually one of the big advantages of the cloud.”

When will Netflix move to another cloud besides AWS?

I don’t own any Netflix stock, but if I did I would ask “Do you really think hosting Netflix in  streaming media competitor Amazon.com’s data centers is the best decision?  What other cloud providers have you evaluated besides AWS?”

Being in the cloud makes sense, but couldn’t Netflix be in a facility like SoftLayer or Rackspace?

With Netflix’s outage in AWS, should these questions get asked?

Netflix Confirms Outage; Showtime Shows to Be Pulled

By Mark Hachman

Netflix (for iPad)

Netflix reported problems with its Web site and streaming service on Tuesday night, which the company has yet to explain.

While the Web site was functional at 8 PM PT, newer interfaces such as those used by the Logitech Revue were unable to connect. Other users reported that Netflix streaming was still down via the Roku box and the PlayStation 3.

"We are aware that the website may not work for everyone at this time. We're working to get it fixed as quickly as we can," the NetflixHelps account tweeted about 4 PM Pacific time.

Netflix proudly discuss the 5 reasons why they went to AWS.  I wonder what Netflix thinks about their #3 point.

3. The best way to avoid failure is to fail constantly.

We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.

If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.