A Lesson from Minority Report, sometimes you want a everybody agreeing to be right

March 8, 2012 Dave Ohara

Two of my friends and I have been discussing a variety of technical and business decisions that need to be made. One of the things we have done is to make it a rule that all three of us need to be in agreement on decisions. Having three decision makers is a good pattern to insure that a diversity of perspectives are included in analysis, and decisions can be made if one decision maker is not available.

Triple redundancy though is typically used though where as long as two systems are in agreement than you can make a decision.

In computing, triple modular redundancy, sometimes called triple-mode redundancy,^[1] (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

But, an example of the flaw in this approach could be taken from the Minority Report and the use of pre-cogs where a zealousness to come to a conclusion allows a "minority report" to be discarded.

Majority and minority reports

Each of the three precogs generates its own report or prediction. The reports of all the precogs are analyzed by a computer and, if these reports differ from one another, the computer identifies the two reports with the greatest overlap and produces a majority report, taking this as the accurate prediction of the future. But the existence of majority reports implies the existence of a minority report.

James Hamilton has a blog post on error detection. Errors could be consider the crimes in the data center. And, you can falsely assume there are no errors (crimes) because there is error correction in various parts of the system.

Observations on Errors, Corrections, & Trust of Dependent Systems

Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.

If you think like this.

This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.

Maybe you won't let the majority rule and listen to minority. All it takes is a small system, a system in the minority to bring down a service.

One Way to Build Real-Time Energy Controls for the Data Center, Inventory all your assets with an IP address

September 25, 2011 Dave Ohara

DCIM is a hot topic and a lot of what people are trying to do is get control over their data center assets. One of my data center friends heard that the current count is up to 59 companies provide some form of DCIM solution.

The people who have been the early adopters of DCIM have learned that no one tool does it all, and they need to put together multiple products to solve their DCIM problem. What is lacking is an architecture for the DCIM that meets the needs of their enterprise.

What is an architecture? Merriam-Webster definition. #1-4 is what most people think about. I am referring to #5 for computer systems.

Definition of ARCHITECTURE

1

: the art or science of building; specifically : the art or practice of designing and building structures and especially habitable ones

2

a : formation or construction resulting from or as if from a conscious act <the architecture of the garden>b : a unifying or coherent form or structure <the novel lacks architecture>

3

: architectural product or work

4

: a method or style of building

5

: the manner in which the components of a computer or computer system are organized and integrated

Great architecture solves problems

So what problem are you trying to solve?

The problem is most don't even know what problems they have that need to be solved. For example, what knobs and dials exist in the data center to control the energy consumption in the data center? Who gets to turn the knobs and dials?

NewImage

If you think you want to control the energy consumption of your IT assets you need to know all your assets and be able to talk to them, to control them. One approach is JouleX's latest product.

Other highlights of JEM’s expanded capabilities include:

New device control features allowing for integrated power capping, central processing unit (CPU) performance leveling (for Windows, VMware, Linux), and support for VMware vCenter, and Distributed Power Management (DPM).

More granular control using power, temperature and utilization metrics to migrate virtual machines and optimize server performance and energy consumption using VMware vMotion.

Detailed analytics and reporting for sustainable procurement (analysis of device models for energy efficiency, for device replacement planning, and for measuring energy and carbon savings from virtualization projects)

Virtualization and utilization reporting to identify under- and over-utilized devices. This identifies candidates for upgrading, retiring, and virtualizing.

Additional device support for rack and floor power distribution units (PDUs) as well as computer room air conditioning (CRAC) and uninterruptable power supply (UPSs).

JouleX CEO Tom Noonan explains the latest version.

“We continue to accelerate our technology development in terms of creating actionable energy intelligence for our customers to make quicker decisions, optimize their existing infrastructure, and reduce their operating expenses,” said Tom Noonan, president and CEO at JouleX.

But, I think of it as Real-time Energy Control system. Your energy thermostat for the data center.

One approach is JouleX and there are many others. The hardest question that almost no one asks is what is the architecture of your DCIM solution? What problems are you trying to solve?

We have a long ways to go in DCIM.

A view from 10 ft high of my new kitchen and living space

June 21, 2011 Dave Ohara

My Missouri Data Center friends were in town yesterday and they came over for lunch and a bottle of wine in the evening after they had finished their meetings. I've spent many trips in Missouri to their homes and even went on a trip to Northern Missouri for Deer Camp which I renamed as Beer Camp as I saw way more beer than deer. One of the comments my guests made is they have seen pictures of the house on this blog, but the living space and kitchen are hard to grasp until you are in the space. So, let's try to show the space from a different view. A view from 10 ft high.

Here is a view 10 feet in front of my pizza oven at my height.

Here is a view closer - 3 ft looking in the oven.

I backed up got on a ladder and went 3 ft higher on the ladder. I am now about the same height as the light fixtures at 8 ft.

From this angle I am now looking down into the oven. BTW, love my new Canon 24-105 IS F4L lens.

A little better, but let's try higher on the ladder. I go up to the top rung and my head is now at 10 ft. fyi, the ceilings are 12 1/2 ft. When you look at this picture here. You can see my reflection in the frame glass.

Here is this same picture shown from when I am standing on the ground.

Back to the ladder let's look at my pizza oven in the kitchen through a wide angle 38mm lens. The refrigerator to the left is 78" high.

I shoot a level shot across the room from 10 ft in the air.

Coming back to the ground. Let's try a shot with my wife and son for some scale.

I have already made invitations to some of my data center friends to come on over for pizza, wine, and beer, and for them to make an excuse to visit the Redmond/Seattle area. It was good to have my Missouri Data Center Friends as one of the first to see the house now that we are almost done.

For you mechanical and construction guys, here is a view of the structural steel in the ceiling.

11 price reductions over 4 years, Amazon Web Service's James Hamilton thoughts on pace of innovation

June 11, 2011 Dave Ohara

James Hamilton is keynoting at SIGMOD Athens and his presentation description has some good ideas to think about.

Keynote 1: James Hamilton, Amazon Web Services

Internet Scale Storage

Abstract

The pace of innovation in data center design has been rapidly accelerating over the last five years, driven by the mega-service operators. I believe we have seen more infrastructure innovation in the last five years than we did in the previous fifteen. Most very large service operators have teams of experts focused on server design, data center power distribution and redundancy, mechanical designs, real estate acquisition, and network hardware and protocols. At low scale, with only a data or center or two, it would be crazy to have all these full time engineers and specialist focused on infrastructural improvements and expansion. But, at high scale with tens of data centers, it would be crazy not to invest deeply in advancing the state of the art.

Looking specifically at cloud services, the cost of the infrastructure is the difference between an unsuccessful cloud service and a profitable, self-sustaining business. With continued innovation driving down infrastructure costs, investment capital is available, services can be added and improved, and value can be passed on to customers through price reductions. Amazon Web Services, for example, has had eleven price reductions in four years. I don’t recall that happening in my first twenty years working on enterprise software. It really is an exciting time in our industry.

Here is anther thing to keep in mind. From reading this statement it seems Amazon Web Services does not use blades. If Amazon has determined it shouldn’t use blades why should you?

· Datacenter Construction Costs

o Land: <2%

o Shell: 5 to 9%

o Architectural: 4 to 7%

o Mechanical & Electrical: 70 to 85%

· Summarizing the above list, we get 80% of the costs scaling with power consumption and 10 to 20% scaling with floor space. Reflect on that number and you’ll understand why I think the industry is nuts to be focusing on density. See Why Blade Servers Aren’t the Answer to All Questions for more detail on this point – I think it’s a particularly important one.

From 2008 James has discussed blades.

Summary so far: Blade servers allow for very high power density but they cost more than commodity, low power density servers. Why buy blades? They save space and there are legitimate reasons to locate data centers where the floor space is expensive. For those, more density is good. However, very few data center owners with expensive locations are able to credibly explain why all their servers NEED to be there. Many data centers are in poorly chosen locations driven by excessively manual procedures and the human need to see and touch that for which you paid over 100 million dollars. Put your servers where humans don’t want to be. Don’t worry, attrition won’t go up. Servers really don’t care about life style, how good the schools are, and related quality of life issues.

Here is a simple one liner.

Density is fine but don’t pay a premium for it unless there is a measurable gain and make sure that the gain can’t be achieved by cheaper means.

Architecting for Outages, an architect posts on surviving AWS

June 9, 2011 Dave Ohara

Everyone wants to survive a data center outage, but as AWS outage shows, not all do survive. Here is a post that summarize best practices in SW architecture to survive an outage like AWS.

Retrospect on recent AWS outage and Resilient Cloud-Based Architecture

Thursday, June 9, 2011 at 8:19AM

A bit over a month ago Amazon experienced its infamous AWS outage in the US East Region. As a cloud evangelist, I was intrigued by the history of the outage as it occurred. There were great posts during and after the outage from those who went down. But more interestingly for me as architect were the detailed posts of those who managed to survive the outage relatively unharmed, such as SimpleGeo, Netflix,SmugMug, SmugMug’s CTO, Twilio, Bizo and others.

The list of best practices are:

The main principles, patterns and best practices are:

Design for failure

Stateless and autonomous services

Redundant hot copies spread across zones

Spread across several public cloud vendors and/or private cloud

Automation and monitoring

Avoiding ACID services and leveraging on NoSQL solutions

Load balancing

If this seems daunting, there are new services coming to provide scalability and availability services.

The emerging solution to this complexity is a new class of application servers that offers to take care of the high availability and scalability concerns of your application, allowing you to focus on your business logic. Forrester calls these "Elastic Application Platforms", and defines them as:

An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.

Majority and minority reports

Definition of ARCHITECTURE

Great architecture solves problems

Keynote 1: James Hamilton, Amazon Web Services

Internet Scale Storage

Abstract