Loggly suffers extended outage after AWS reboot shuts down their service

Loggly a cloud service  that provides as one of its services System Monitoring and Alerting.

Systems Monitoring & Alerting

Alerting on log events has never been so easy.  Alert Birds will help you eliminate problems before they start by allowing you to monitor for specific events and errors.  Create a better user experience and improve customer satisfaction through proactive monitoring and troubleshooting. Alert Birds are available to squawk & chirp when things go awry.

But, Loggly has suffered an extended outage that was caused by AWS rebooting 100% of their servers, but that was only half the time down.  The other half was due to not knowing the service was down.

Loggly's Outage for December 19th

Posted 19 Dec, 2011 by Kord Campbell

Sometimes there's just no other way to say  "we're down" than just admitting you screwed up and are down.  We're coming back up now, and in theory by the time this is read, we'll be serving the app again normally.  There will be a good amount of time until we can rebuild the indexes for historic data of our paid customers. This is our largest outage to date, and I'm not at all proud of it.

...

Loggly uses a variety of monitoring mechanisms to ensure our services are healthy.  These include, but are not limited to, extensive monitoring with Nagios, external monitors like Zerigo, and using a slew of our own API calls for monitoring for errors in our logs.  When the mass reboot occurred we failed to alert because a) our monitoring server was rebooted and failed to complete the boot cycle, b) the external monitors were only set to test for pings and established connections to syslog and http (more about that in a moment), and c) the custom API calls using us were no longer running because we were down.

Combined, these failures effectively  prevented us from noticing we were down.  This in of itself is was the cause of at least half our down time, and to me, the most unacceptable part of this whole situation.

The other half of the outage was caused by Loggly not testing for a 100% reboot of all machines.

The Human Element

The other cause to our failures is what some of you on Twitter are calling "a failure to architect for the cloud".  I would refine that a bit to say "a failure to architect for a bunch of guys randomly rebooting 100% of your boxes".  A reboot of all boxes has never been tested at Loggly before.  It's a test we've failed completely as of today.  We've been told by Amazon they actually had to work hard at rebooting a few of our instances, and one scrappy little box actually survived their reboot wrath.

One of the lessons that Loggly learned that some of my SW buddies and I are using in a SW design is to add more than one monitoring solution.

The second step is to ensure more robust external monitoring.  With multiple deployments, this issue becomes less of an issue, but clearly we need more reliable checks than what we rely on with Zerigo or other services.  Sorry, but simple HTTP checks, pings and established connections to a box do not guarantee it's up!

 

 

Big Data, Hadoop, Dell, and Splunk, where is the connection?

I have been busy working on a Big Data paper, so my blogging is not as often.  Getting into Big Data technical details has been easy, and then it hit me that Big Data in data centers and IT has a lot in common with monitoring and management systems.  Collecting gigabytes or even terabytes of data a day to monitor operations is a big data center problem.

Researching the Big Data topic it was interesting to see the intersection of Dell, Hadoop, and Splunk in Big Data.

Barton George has a post on Splunk.

Hadoop World: Talking to Splunk’s Co-founder 2 Votes Last but not least in the 10 interviews I conducted while at Hadoop World is my talk with Splunk‘s CTO and co-founder Erik Swan. If you’re not familiar with Splunk think of it as a search engine for machine data, allowing you to monitor and analyze what goes on in your systems. To learn more, listen to what Erik has to say:

Barton references a GigaOm post on Splunk and Hadoop.

Splunk connects with Hadoop to master machine data

Splunk has integrated its flagship product with Apache Hadoop to enable large-scale batch analytics on top of Splunk’s existing sweet spot around real-time search, analysis and visualization of server logsand other machine-generated data. Splunk has long had to answer questionsabout why anyone should use its product over Hadoop, and the new integration not only addresses those concerns but actually opens the door for hybrid environments.

 

 

Dell's Barton George is interview himself as well at Hadoop World.

Hadoop World: What Dell is up to with Big Data, Open Source and Developers

Rate This

 

Besides interviewing a bunch of people at Hadoop World, I also got a chance to sit on the other side of the camera.  On the first day of the conference I got a slot on SiliconANGLE’s the Cube and was interviewed by Dave Vellante, co-founder of Wikibon and John Furrier, founder of SiliconANGLE.

-> Check out the video here.

Big Data Webinar, Dec 7, 2011, GigaOm and Splunk

GigaOm Pro has a webinar on Dec 7, 2011 10 - 11a PT on Big Data.

The Big Machine

How the Internet of Things Is Shaping Big Data

 

Even relatively conservative forecasts predict there will be 50 billion connected devices online by the end of the decade. Over time, the majority won’t be laptops or phones, but rather machine-to-machine connections from network infrastructure, sensors in cars, appliances, healthcare monitors and the like. They’ll produce data that needs to be combined and analyzed alongside structured data, application logs, customer info and social media streams. Already today, companies across multiple industries and government agencies are struggling to harness the sheer volume, complexity and variety of the data generated. In this webinar, we’ll look at the various kinds of machine-driven big data, how to develop an analytics and usage framework for them, and how companies can use these data to run their businesses.

Join GigaOM Pro and our sponsor Splunk for “The Big Machine: How the Internet of Things Is Shaping Big Data,” a free analyst roundtable webinar on Wednesday, December 7, 2011 at 10 a.m. PST.

I'll be on the panel along with other GigaOm analysts and Splunk's VP of Engineering, Stephen Sorkin.

Moderator

GigaOM Pro Cloud Curator, Founder,The Cloud of Data

Panelists

GigaOM Pro Analyst, Executive Director, Zettaforce
GigaOM Pro Analyst, FounderGreenM3
VP of Engineering Splunk

ISO 50001 and Data Centers

ZDNET Asia has a post on Singapore, Data Centers, and Green IT.

Pro-biz, green incentives give S'pore datacenter edge

The Singapore Green Data Centre Standard is here with part of the standard built on ISO 50001.

The Green DC Standard helps organisations establish systems and processes necessary to improve the energy efficiency of their DCs. It provides them with a recognised framework as well as a logical and consistent methodology to achieve continuous improvement in their DC facilities. This standard is modelled after the ISO 50001 standard on energy management (currently under development by ISO) but is specifically tailored to meet the needs of DCs in Singapore. The standard adopts the Plan-Do-Check-Act (PDCA) methodology, an iterative, four step problem-solving process used for continuous process improvement. The PDCA cycle forms the basis for many established management standards, which have successfully stimulated substantial, continual efficiency improvements within organisations around the world.

Here is a press announcement on a Taiwan data center being ISO 50001 certified.

TSMC Leads in ISO 50001 Certification for Data Center


Hsinchu, Taiwan, R.O.C. – November 3, 2011 –TSMC today announced that its Fab 12 Phase 4 data center in the Hsinchu Science Park has completed the ISO 50001 Energy Management System certification, becoming Taiwan’s first company to earn this certification for a high density computing data center.

The ISO 50001 Energy Management System was established by the International Standards Organization (ISO) Energy Management Committee (ISO/PC242), and was announced in the second quarter of this year. The Fab 12 Phase 4 data center which completed certification provides data and control systems for factory automation, and supports both manufacturing and R&D. Adoption of the ISO 50001 Energy Management System is expected to reduce the data center’s power consumption by 8%, conserving 2.21 million kilowatt-hours of electricity and eliminating 1,350 tons of carbon emissions per year. In addition to upgrading existing data centers, TSMC also plans to apply ISO 50001 standards to future data centers and implement the most up-to-date energy-saving designs. TSMC estimates that the company can conserve 59.62 million kilowatt-hours of electricity and eliminate 36,490 tons of carbon emissions per year.

ISO 50001 standard has a video.

ISO 50001 — What is it ?ISO 50001:2011, Energy management systems – Requirements with guidance for use, is a voluntary International Standard developed by ISO (International Organization for Standardization).ISO 50001 gives organizations the requirements for energy management systems (EnMS).ISO 50001 provides benefits for organizations large and small, in both public and private sectors, in manufacturing and services, in all regions of the world.ISO 50001 will establish a framework for industrial plants ; commercial, institutional, and governmental facilities ; and entire organizations to manage energy. Targeting broad applicability across national economic sectors, it is estimated that the standard could influence up to 60 % of the world’s energy use.*

Thinking differently about Asset Management, Data Centers, and User useful reporting

I just spent 2 days at International Association IT Asset Management conference.  The first 4 hours I was absorbing what the attendees were like and their roles.  Here is the breadth of range of the subjects presented in this PDF.

NewImage

But, after two days and lots of discussions I felt like this is a tactical approach addressing the short term issue which is important.  But, where is the bigger picture that resonates with the CxO and business unit owners.

An analogy could be FedEx tracking individual packages throughout the system as asset management's job. Where are all the packages and where are the records.  But this approach doesn't answer the business effectiveness of the system empowered by the assets.  To give you an idea watch this FedEx video on plane movement.  What is important from the CxO's perspective is did those planes execute as planned, what is the utilization of the plane, was it on time,  what were  the costs associated to run that plane, etc.

Asset Management is important, but it is only one piece of a bigger problem to run operations at its best.

I have figured out the equivalent to a plane load of assets to track, and so far people like the metaphor to report on data center operations.  Next week I am heading to NYC and 2 weeks after that to 7x24Exchange where I'll test the ideas more on how asset management can fit in an overall system.