Do you have a Power Hog in your Data Center? AOL saves the bacon moving to the cloud

I wrote about Mike Manos's post on Attacking the Cruft and how the best part I liked was the Power Hogs.

What I like is the the Power Hog part.

Power Hog – An effort to audit our data center facilities, equipment, and the like looking for inefficient servers, installations, and /or technology and migrating them to new more efficient platforms or our AOL Cloud infrastructure.  You knew you were in trouble when you had a trophy of a bronze pig appear on your desk or office and that you were marked.

I took a picture from Pike Place Market's infamous brass pig and put it in my post as a placeholder.

Within 30 minutes Mike sent me a picture of the "AOL Power Hog"  With the call to action "The Cloud is Calling…Help Save Our Bacon."

Photo  3

Do you have a Power Hog in your data center?  One way to get them to move on to a diet is give them this trophy.

Maybe data centers need a the wall of shame.  The top 10 power hogs.

Who would you nominate?  Your HR system? The latest acquisition? The executive pet project that has grown and grown?

 

1000 Genomes, 200+TB of data available in AWS to run compute jobs

Normally when you think of running a compute project in AWS, you need to move you data and then compute.  AWS has hosted the 1000 Genome project with over 200 TB of data available to run compute jobs against without moving the data into the environment.

The 1000 Genomes Project

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3.

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

This is a lot of data.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow.

It is interesting to think that AWS is hosting data that is too expensive for people to move around.

More information can be found here http://aws.amazon.com/1000genomes/

If you want to get the data yourself.  here it is

Other Sources

The 1000 Genomes project data are also freely accessible through the 1000 Genomes website, and from each of the two institutions that work together as the project Data Coordination Centre (DCC).

Human Errors cause Windows Azure Feb 29th outage

On Mar 1, I met with an out of friend guest in Bellevue and one of the other people who joined us was an visiting Microsoft MVP.  In the conversation, he brought up the outage on Feb 29 of Windows Azure, and he shared his views on what had gone on, and how could Microsoft make a leap year mistake.  How? Human error is an easy explanation.

Here are a few of the media posts.

Microsoft tries to make good on Azure outage


GigaOm - 1 hour ago
In his post, Bill Laing, corporate VP of Microsoft's server and cloud division, said the outage affectedWindows Azure Compute and dependent services ...
Microsoft Offers Credit for Azure Cloud Outage‎ Data Center Knowledge
Microsoft details leap day bug that took down Azure, refunds customers‎ Ars Technica
Microsoft Azure Outage Blamed on Leap Year‎ CloudTweaks News

A high level description is provided by GigaOm's Barb Darrow.

Microsoft tries to make good on Azure outage

Microsoft is issuing credits for the recentLeap Day Azure outage. The glitch, which cropped up on Feb. 29 and persisted well into the next day, was a setback to Microsoft, which is trying to convince businesses and consumers that its Azure platform-as-a-service is a safe and secure place to put their data and host their applications.

But, I want to point out some interesting details in Bill Laing's blog post.

There are three human errors that could have prevented the problem.

Prevention

  • Testing. The root cause of the initial outage was a software bug due to the incorrect manipulation of date/time values.  We are taking steps that improve our testing to detect time-related bugs.  We are also enhancing our code analysis tools to detect this and similar classes of coding issues, and we have already reviewed our code base.
  • Fault Isolation. The Fabric Controller moved nodes to a Human Investigate (HI) state when their operations failed due to the Guest Agent (GA) bug.  It incorrectly assumed the hardware, not the GA, was faulty.  We are taking steps to distinguish these faults and isolate them before they can propagate further into the system.
  • Graceful Degradation. We took the step of turning off service management to protect customers’ already running services during this incident, but this also prevented any ongoing management of their services.  We are taking steps to have finer granularity controls to allow disabling different aspects of the service while keeping others up and visible.

Another human error is the system took 75 minutes to notify people that there was a problem.

Detection

  • Fail Fast. GA failures were not surfaced until 75 minutes after a long timeout.  We are taking steps to better classify errors so that we fail-fast in these cases, alert these failures and start recovery.

Lack of communication made the problems worse.

Service Dashboard.  The Windows Azure Dashboard is the primary mechanism to communicate individual service health to customers.  However the service dashboard experienced intermittent availability issues, didn’t provide a summary of the situation in its entirety, and didn’t provide the granularity of detail and transparency our customers need and expect.

...

Other Communication Channels.  A significant number of customers are asking us to better use our blog, Facebook page, and Twitter handle to communicate with them in the event of an incident.  They are also asking that we provide official communication through email more quickly in the days following the incident.  We are taking steps to improve our communication overall and to provide more proactive information through these vehicles.  We are also taking steps to provide more granular tools to customers and support to diagnose problems with their specific services.

One of the nice thing about Cloud Service is the need for transparency on cause of outages.  This is a marketing exercise that needs to make sense to a critical thinking technical person.

Conclusion

We will continue to spend time to fully understand all of the issues outlined above and over the coming days and weeks we will take steps to address and mitigate the issues to improve our service.  We know that our customers depend on Windows Azure for their services and we take our SLA with customers very seriously.  We will strive to continue to be transparent with customers when incidents occur and will use the learning to advance our engineering, operations, communications and customer support and improve our service to you.

The Feb 29th outage was like a Y2K bug that caught Microsoft flat footed.  There was little to point blame on a hardware failure.  What caused the problems were human decisions made in error.

Is the Public Cloud a place of refuge from the infighting in Enterprise IT?

There are many reasons why the public could is popular.  MSNBC has a post on how executives hate their jobs just as much as lower level employees.

Execs are just like you: They don't like their jobs, either


By Allison Linn

If you feel stuck in a job you don’t like, maybe you can take comfort in the fact that the big boss may well be in the same boat.

A new global survey of business executives finds that less than half like their jobs, although most don’t plan on leaving.

The Path Forward, a survey of 3,900 business executives from around the world conducted by consulting firm Accenture, found that only 42 percent said they were satisfied with their jobs. That’s down slightly from 2010.

And, reading about the Power of Habits reminded of a possible reason for the displeasure.  The fact that some companies are a civil war.

Companies aren’t big happy families where everyone plays together nicely. Rather, most workplaces are made up of fiefdoms where executives compete for power and credit, often in hidden skirmishes that make their own performances appear superior and their rivals’ seem worse. Divisions compete for resources and sabotage each other to steal glory.

Companies aren’t families. They’re potential battlefields in a civil war.

Then it hit me that the Data Center is the one place that all theses families (internal company teams) need to put their information.  What other place other than finance has the whole organization connecting.  The finance scenario is actually probably easier as it ultimately a money issue.  But, enterprise IT is very complex.

If you accept this difficulty of having everyone get along in enterprise IT which can be wearing and frustrating, then maybe people just want to escape the mental anguish and feuding between groups.  The lower costs and better service of a cloud environment like AWS could be the side benefits when the ultimate reason was the frustration dealing with central enterprise IT.  If you accept this as a potential reason for why users have gone to the public cloud, they are not going to be satisfied with a private cloud run by the central enterprise IT.

 

Oops, Cloud Data Center to serve US Fed gov't opened in May 2011 closes

The media makes it seem like the cloud is everything, but when you look carefully, the cloud momentum isn't as big as some believe.  Want some proof? Harris Corporation is shutting down its cloud data center operations as it finds its federal customers want physical space.

‘Cloud’ Data Center Closes Because Federal Agencies Prefer Earth

According to Harris, the government prefers to keep its data in-house. Photo: gregwest98/Flickr

Harris Corporation — an outfit that provides computing infrastructure for government agencies — is selling its super-secure data center in Harrisonburg, Virginia and leaving the “cloud computing” business, saying that both its government and commercial customers prefer hosting “mission-critical information” on their own premises rather than in the proverbial cloud.

...

“[The closure will] allow us to refocus our capital and efforts on the secure, cost-effective communications and IT solutions that our customers are demanding,” read a statement from Harris CEO William M. Brown.