I saw this post on GigaOm on Microsoft outages on Monday.
“Whoops,” says Microsoft Azure: cloud service goes down for many users
by
AUG. 18, 2014 - 3:19 PM PDT6 Comments
SUMMARY:Several Microsoft Azure services — virtual machines, cloud services, StorSimple, backup and site recovery — were off line for hours Monday afternoon.
It’s Monday, and it’s already been a pretty bad week for Microsoft Azure. Starting early afternoon Eastern time, the company witnessed partial and full service interruptions to several of its services across multiple regions. The sites were back up again at around 8 p.m. eastern time, according to Microsoft.
Out of curiosity going to Azure History you can see the range of issues that have occurred over the past month. At Microsoft’s scale there it looks like there is a constant stream of issues.
August 2014
8/19SQL Database, Storage, Virtual Machines, Compute - North and West Europe - Partial Service Interruption
From 09:38 - 09:58 UTC on 19 August, 2014 a subset of customers using Virtual Machines, SQL Database, Cloud Services, and Storage in West Europe and North Europe may have been unable to access their resources or perform management operations. This issue is now mitigated.
8/19Virtual Machines \ Service Management and Cloud Services - East Asia - Partial Service Interruption
From 07:38 - 07:57 UTC on 19 August 2014 a subset of customers using Virtual Machines \ Service Management and Cloud Services in East Asia may received an error attempting to access their Cloud Services or Virtual Machines .This incident has now been mitigated.
8/19Websites - North Europe - Advisory (Limited Impact)
From 22:00 UTC on 18 August, 2014 - 02:00 UTC on 19 August, 2104 a small subset of customers using Websites in North Europe may have experienced intermittent connectivity issues to their websites. This incident has now been mitigated.
8/18Virtual Machines, Cloud Services, Mobile Services, Service Bus, Site Recovery, HDInsight, Websites and StorSimple - Multiple regions - Partial Service Interruption
Starting at 18 Aug 2014 17:49 UTC, a small subset of customers are experiencing connectivity issues to some Azure Services which may include Cloud Services, Virtual Machines, Websites, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile Services, StorSimple and possible other Azure Services in multiple regions. This incident has now been mitigated.
8/18Virtual Machines, Cloud Services, Mobile Services, Service Bus, Site Recovery, Websites and StorSimple - Multiple regions - Partial Service Interruption
Starting at 18 Aug 2014 17:49 UTC, a small subset of customers are experiencing connectivity issues to some Azure Services which may include Cloud Services, Virtual Machines, Websites, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile Services, StorSimple and possible other Azure Services in multiple regions. This incident has now been mitigated.
8/18Virtual Machines, Cloud Services, Websites, Service Bus, HDInsight. Mobile Services, Backup, Site Recovery, StorSimple - Multiple Regions - Full Service Interruption
Starting at 18 Aug 2014 17:49 UTC, a small subset of customers are experiencing connectivity issues to some Azure Services which may include Cloud Services, Virtual Machines, Websites, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile Services, StorSimple and possible other Azure Services in multiple regions. This incident has now been mitigated.
8/18Automation, Backup, Site Recovery, - Multiple Regions - Full Service Interruption
Starting at 18 Aug 2014 17:49 UTC we are experiencing an interruption to Azure Services, may include Cloud Services, Virtual Machines Websites, Automation, Service Bus, Backup, Site Recovery, HDInsight, Mobile Services and possible other Azure Services in multiple regions. Customers began to experience service restoration as updates were deployed across the affected environment. Automation, Site Recovery and Backup are mitigated. Next update will be provided in 30 minuites.
8/15Network Infrastructure - Japan East - Full Service Interruption
From 18:52 to 19:16 on 15 Aug, 2014, UTC customers with resources in Japan East would have seen issues accessing all Azure Services deployed in the Region. Engineers have confirmed that full Service availability has been restored to the impacted Region. This incident is now mitigated.
8/15Cloud Services - Multiple Regions - Partial Performance Degradation
From as early as 1 August, 2014 to 06:30 UTC on 15 August, 2014 a subset of customers in multiple regions may have experienced that their auto-scaling ability was reduced or disabled. Errors may have included messages regarding the inability to read auto-Scale metrics data. This incident has now been mitigated.
8/15Management Portal - Multiple Regions - Advisory
From 14 Aug, 2014 22:06 to 23:45 UTC a limited subset of customers attempting to log in to the Azure Management Portal across multiple regions may have experienced issues logging in, or received an Error 500 message. A retry would likely have resolved the issue for any impacted customers. Engineers have verified that full availability has been restored to the Azure Management Portal login services. This incident has now been mitigated.
8/14Visual Studio Online - Multi-Region - Full Service Interruption
Starting 22:45 13 Aug, 2014 UTC, Visual Studio Online customers may have experienced issues with latency and extended Execution times. The initial incident mitigated at approximately 14:00 UTC. During investigation at 13:52 14 Aug, 2014 UTC, engineering teams began receiving alerts for a separate issue where customers were unable to log in to their Visual Studio Online services. From 13:52 to 19:45 on 14 Aug, 2014 UTC, customers were unable to access their Visual Studio Online resources. Engineering teams have validated their mitigation efforts for both issues and have confirmed that full service has been restored to our Visual Studio Online users. These incidents are now mitigated.
8/12Cloud Services - East US - Partial Performance Degradation
From 8/5/14 at approximately 16:50 UTC, to 8/12/14 at approximately 13:50 UTC, customers in East US may have experienced issues with proper data flow to their Auto-Scale configurations. Errors may have included messages regarding the inability to read Auto-Scale metrics data. Engineers have verified a reduction in the errors that were causing the improper flow of information and we are resolving this incident at this time. Updates will now be concluded on this incident.
8/12Network Infrastructure - Brazil South - Partial Service Interruption
From 23:43 UTC on 8/11/14 to 03:23 UTC on 8/12/14 customers may have experienced intermittent connectivity issues when attempting to connect to resources deployed in the Brazil South Region. Engineering teams have validated that this issue has been mitigated and that full Network Connectivity has been restored to the Brazil South Region.
8/11Virtual Machines and Cloud Services - Japan East and Japan West - Partial Performance Degradation
Starting approximately 8/8/14 00:00 UTC to 8/11/14 19:37 UTC customers using Virtual Machines and Cloud Services in Japan East and Japan West may have experienced issues with their Auto-Scaling functionality. Additionally, customers may have seen issues with their Metrics Reporting Services within their Management Portal. Virtual Machine functionality and Service Management capabilities were unaffected by this incident. This incident has now been mitigated and full functionality has been restored to our Cloud Services and Virtual Machines customers.
8/9Cloud Services and Cloud Services \ Service Management - West US - Partial Service Interruption
From 23:42 8 August 2014 to 01:40 9 August 2014 customers using Cloud Services and Cloud Services \ Service Management in West US may have experienced inability to complete service management operations for their Cloud Services. This incident has now been mitigated.
8/9Storage and Azure Backup - West US - Partial Service Interruption
This incident has now been mitigated. From 11:23 UTC on 8 August 2014 to 2:00 UTC on 9 Aug 2014 a subset of customers using Storage in West US may have experienced inability to access their Storage resources. Some azure Backup service customers in West US would have been unable to perform normal operations backups and restores during this time period. Normal functionality is restored for both services.
8/7Storage - East US 2 - Partial Service Degradation
From 12:48 UTC to 22:02 UTC on 7 August 2014 a subset of customers using Storage in East US 2 may have intermittently experienced the inability to access their storage resources. This incident has now been mitigated.
8/6SQL Databases - North Europe - Partial Performance Degradation
From 06:56 - 08:56 UTC on 6 August 2014, customers may have experienced difficulty accessing their resources. Recovery remains ongoing for a very limited subset of customers. Customers that remain impacted will receive further updates directly through their Management Portal. This incident has now been mitigated.
8/6Websites and mobile services - North Europe - Partial Performance Degradation
Starting at 8/6/2014 6:56 UTC a subset of customers using websites and mobile services in North Europe were unable to access site runtime, management and publishing operations. As of 8:29 UTC on 8/6/2014 the websites and mobile services issue is mitigated and those services are back to normal.
8/5account.windowsazure.com - Multi-Region - Partial Service Interruption
From 15:30 - 16:32 UTC on 5 August, 2014 customers using account.windowsazure.com to sign up for new service or complete a purchase may have experienced an error. Customers that experienced an error may now retry their purchase transaction. This incident has been mitigated.
8/5Cloud Services \ Service Management -Auto-scale functionality - Partial Performance Degradation
This incident has now been mitigated. From 7:30 UTC on 8/5/2014 to 9:45 UTC on 8/5/2014 a subset of customers using any Cloud Service deployments with the auto-scale feature enabled may have experienced a problem with the auto-scaling functionality.
8/4Virtual Machines \ Service Management - Brazil South - Partial Performance Degradation
Our investigation of the alert is complete and we have determined the service is healthy. A service incident did not occur for Virtual Machines \ Service Management in Brazil South.
8/4Management Portal - Multi-Region - Partial Service Interruption
From 23:56 on 8/3/2014 UTC to 1:15 on 8/4/2014 UTC customers using the Azure Management Portal may have intermittently experienced high latency with their requests. This incident has now been mitigated.
8/2Cloud Services \ Service Management - Multiple Regions - Advisory (Limited Impact)
From 2 Aug, 13:18 PM to 20:52 PM UTC customers may have encountered issues when attempting to configure, or change RDP settings for Cloud Services. Running services were not impacted. This incident is now resolved.
8/2Storage and Websites - South Central US - Partial Service Interruption
From 2 Aug, 2014 13:30 PM to approximately 15:45 PM UTC a subset of customers using Storage and Websites in South Central US may have experienced intermittent slow performance, timeouts or errors. If you encounter an error please retry your request. This incident is now resolved.
8/2Management Portal - Multi-Region - Partial Performance Degradation
From 2 Aug, 2014 00:04 AM to 00:36 AM UTC customers may have experienced intermittent timeouts or slow loading times when attempting to log in to the Azure Management Portal. There was no impact to active services. This incident has now been mitigated.
8/1Azure Preview Portal - Partial Service Interruption
Starting on 1 Aug, 2014 at 6:15 PM UTC customers using the Azure Preview Portal may have encountered error messages when navigating the website. Our engineering team has completed investigation of this incident and determined that there is no impact to core Azure Preview Portal functionality. You may continue to see error messages, however this will not impact your ability to manage your Azure services. Our engineering teams will continue working to remove these error messages.
8/1Virtual Machines - North Central US - Partial Service Interruption
From 1 Aug, 2014 at approximately 6:30 PM to 7:25 PM UTC a subset of customers using Virtual Machines in North Central US may have experienced intermittent connectivity issues. The incident is now resolved.
8/1Management Portal - Multi-Region - Partial Service Interruption
From 01 Aug, 2014 at 11:28 AM to 13:20 PM UTC customers may have experienced issues when logging in to the Azure Management Portal. Customers may also have experienced issues performing Service Management functions on Azure Services, including Azure Active Directory, HDInsight and SQL Azure. If you encountered an error please retry your request. This incident has now been mitigated.
July 2014
7/31Management Portal - Multi-Region - Partial Service Interruption
From 31 Jul, 2014 19:53 PM to 21:36PM UTC customers may have experienced issues when logging in to the Azure Management Portal. Customers may also have experienced issues performing Service Management functions on Azure Services, including Azure Active Directory, HDInsight and SQL Azure. If you encoutnered an error please retry your request. This incident has now been mitigated.
7/30CDN - Multi-Region - Partial Service Interruption
This incident has now been mitigated. From 30 July, 2014, at 06:00 to 18:28 UTC, engineers received reports of an issue in which CDN customers might experience issues while creating new CDN connections and accessing their existing content. Error messages indicated "Warnings occurred while loading the management data for this resource type" and "Could not retrieve the CDN endpoints in subscription with ID". Preliminary investigations indicate that this incident was caused by a misconfiguration of Network endpoint connecting to the service, and engineers implemented a configuration change to mitigate impact.
7/29Management Portal - Multi-Region - Partial Service Interruption
From 29 Jul, 2014 19:25 to 23:48 PM UTC a subset of customers may have experienced intermittent issues when attempting to log in to the Management Portal. If you encounter an error please retry your request. The incident has now been mitigated.
7/26API Management - All regions - Partial Service Interruption
From 25 Jul 2014 22:26 to 26 Jul 2014 02:50 UTC, engineers identified an issue in which a subset of customers using API Management in all regions might not be able to view and manage their instances of API management services through the Azure management portal or programmatically via the HTTP management interfaces. Existing services are not impacted. Engineers have not observed errors and will continue to monitor the system.
7/25HDInsight - East US, West US, North Europe, West Europe and Southeast Asia - Partial Service Interru
From 18:29 UTC on 7/25/2014 to 20:03 UTC on 7/25/2014 customers using HDInsight in all regions may have experienced failures when they tried to create a new HDInsight cluster or delete an existing HDinsight cluster. This incident has now been mitigated and HDInsight service is fully restored.
7/24Management Portal - Multi-Region - Advisory
We have concluded the investigation of the issue with the Management Portal. We have determined the service is healthy. By design, configuring 'verbose' monitoring is required to see the memory availability metrics option.
7/22Visual Studio Online - Multi-Region - Partial Service Interruption
From 7/22/2014 21:37 UTC to 7/22/2014 22:25 UTC a subset of Visual studio Online customers may have experienced intermittent login failures and timeouts while accessing their accounts. This incident has now been mitigated.
7/22Visual Studio Online \ Build - Multi-Region - Partial Performance Degradation
From 12:45 PM UTC on 7/22/2014 to 8:45 PM UTC on 7/22/2014 VSO Hosted Build Service customers may have experienced delays in starting their builds after queuing, and a few may have seen longer build times as well. This incident has now been mitigated.
7/22Web Sites - West Europe - Advisory (Limited Impact)
Starting at 19:50 PM on 7/20/2014 UTC to 00:45 AM on 7/22/2014 UTC, a small subset of customers using Web Sites in West Europe might intermittently experience failures (502.3 error) or increased latency in accessing their Web Sites. The issue has now been mitigated, and our engineers continue to monitor the service.