Checklist expert, Boeing's Dan Boorman who from Data Center industry will contact him?

July 8, 2011 Dave Ohara

Reading the Checklist Manifesto which I posted about last month, there is a discussion how the author Atul Gawande contacted Boeing to find a checklist expert.

Here is the Boeing pdf about Boeing's checklist expertise.

Along with his primary responsibilities, Boorman is the contact for organizations outside of aviation that want to benefit from
checklists. He has worked with the FBI, the American Society
of Radiation Oncologists, Northwestern Memorial Hospital in
Chicago and the Washington State Hospital Association.
One of the most important beneficiaries of Boeing’s checklist
knowledge is the World Health Organization. Using ideas learned in
part from Boorman and the Flight Technical & Safety team, a study
of eight hospitals around the world showed that major complications for surgical patients decreased 36 percent after the introduction of checklists. Deaths fell by 47 percent. The World Health
Organization now is creating and distributing checklists worldwide

I think I am going to reach out to some people who would be interested in meeting with Boeing's checklist expert, and organize a meeting.

Living in the Seattle area, I am surrounded by Boeing people and my kids are fans as well. Here is an old picture from their photo shoot for a Boeing Store poster.

Is ITIL being used as an excuse to slow change?

July 7, 2011 Dave Ohara

In many enterprise discussions ITIL is assumed. But when have you ever heard Google, AWS, Facebook, Twitter, or Zynga say their success is built on an ITIL framework.

Here may be a reason why the most agile companies don't use ITIL.

ITIL and other IT management frameworks can take our genetic tendency to say "no" and codify it. "You want a new application installed? Well, you're going to have to go through the Change Management Process." Dilbert's pointy-haired boss couldn't have come up with anything better. Users who ask for the simplest things can be told "no," simply because the Rules support that position. Worse, in many companies, admins who step out of the change management framework to help a user with something small are chastised, written up, and put at the bottom of the list for promotions and interesting projects.

The author doesn't hate ITIL.

No, I'm not trying to beat up on ITIL. It's actually a pretty solid, comprehensive framework for managing IT. Given that most of us weren't doing much better of a job, ITIL offers some universal structure. My problem is that ITIL pretty muchabhors change. No, not on paper -- on paper, ITIL manages and controlschange. In practice, IT organizations use ITIL as a blunt instrument to haltchange.

How many of you have run into people who are all into process, and don't really focus on the business impact?

Reducing Human Error in the Data Center, checklist manifesto

June 14, 2011 Dave Ohara

Domenico Alcaro, VP of Sales Schneider Electric presented to a full room breakout session on Human Error in the Data Center. Domenic shared his presentation and here it is for your viewing with his permission.

Breakout B: Case Study - Eradicating Human Error: Lessons Learned from the US Nuclear Navy

Human error continues to be cited as a leading cause of data center downtime. The goal of eradicating this blight from the data center can be advanced by studying the US Nuclear Navy. In fact, the similarities between a mission critical data center and a mission critical nuclear propulsion plant are striking and many. This presentation will demonstrate the operational methodologies utilized by the US Nuclear Navy to reduce human error drawing comparison to a modern day data center every step of the way.

Domenic Alcaro, Vice President, Enterprise Sales, Schneider Electric

I was able to get access to Domenic presentation and I shared it with some other people ahead of time, and we started discussing human error in the data center. One slide I especially liked is this one.

Note this last line for "The Checklist Manifesto" by Atul Gawande is a book suggested by a data center executive who I then passed on the information to Domenic. Here is a web site too.

The book’s main point is simple: no matter how expert you may be, well-designed check lists can improve outcomes (even for Gawande’s own surgical team). The best-known use of checklists is by airplane pilots. Among the many interesting stories in the book is how this dedication to checklists arose among pilots.

Can the USN Submarine procedures be applied? Here are Domenic's points on what can be done and obstacles.

Failure Analysis ideas applied to Data Center

May 31, 2011 Dave Ohara

James Hamilton has a post on what went wrong at the Fukushima Nuclear power plant.

What Went Wrong at Fukushima Dai-1

As a boater, there are times when I know our survival is 100% dependent upon the weather conditions, the boat, and the state of its equipment. As a consequence, I think hard about human or equipment failure modes and how to mitigate them. I love reading the excellent reporting by the UK Marine Accident Investigation Board. This publication covers human and equipment related failures on commercial shipping, fishing, and recreational boats. I read it carefully and I’ve learned considerably from it.

James makes the point of how he connects his boating mindset to running IT services.

I treat my work in much the same way. At work, human life is not typically at risk but large service failures can be very damaging and require the same care to avoid. As a consequence, at work I also think hard about possible human or equipment failure modes and how to mitigate them.

In one of my first jobs I worked at HP I worked in quality engineering and spent a lot of time in Palo Alto using their failure analysis facilities and learned ESD issues from Dick Moss.

Discussing Reliability Engineering and Data Centers is not common. Running a search on "reliability engineer data center" turned up this job post at Google.

The role: Data Center Reliability and Maintenance Engineer

The Data Center Operations team designs and operates one of the largest and most sophisticated power and cooling systems in the world. You should have extensive experience being involved in the large-scale technical operations, and demonstrable problem-solving skills to lead the RCM program for the Data Center team with limited oversight. You should possess excellent communication skills, attention to detail, and the ability to create work process and procedures to enable the collection of highly accurate field operational data. You will have access to reliability data for one of the largest data center footprints globally and be expected to interact with other reliability and software engineers to holistically address the reliability issues and develop a program wide data acquisition system to continually increase reliability and PUE while lowering TCO.

Responsibilities:

Develop RCM (reliability centered maintenance) program in collaboration with multiple stakeholders.

Perform Reliability Engineering analysis based on field data collected on the critical systems and equipment through the use of proven industry techniques and principles such as RCA (root cause analysis) & FMEA (Failure Modes and Effects Analysis).

Present data based Reliability Predictions and Reliability Block Diagrams.

Collaborate on the selection of the critical equipment vendors based on past operational data on equipment failures.

Spearhead on all RCA effort through collaboration w/equipment vendors.

Data Center Productivity Plan, lean is better with process improvement

April 6, 2011 Dave Ohara

One of my readers who shares his hockey passion with my family sent this MicKinsey article "data center goes lean" where average problem resolution is reduced 40 to 60 percent.

A data center goes lean

Lean transformations can help organizations supporting IT infrastructure to increase their productivity, improve service, and lift morale.

MARCH 2011 • Henrik Andersson, Gary Moe, and Lawrence Wong

The support units that manage and maintain key elements of IT infrastructure—such as servers, networks systems, and data storage—are prone to performance breakdowns that stem from complex and disordered workflows.

The report has interactive graphics on before.

and after.

the ideas are good, but there is one flaw I quickly found. in the above picture. They show false alerts, valueless activities as being eliminated and going to the trash.

The false alerts and valueless data should be analyzed.