Joyent and Adobe have the outages for the month of May.
Joyent post their post-mortem.
Adobe posts an apology.
Both of these outages occurred during maintenance where someone type a command that did what it was supposed to do impacting a service. The human perception problem is the person who typed the command could not see/perceive the system wide impact of their command.
Adobe doesn’t provide any details on what they plan to do. Joyent does.
We will be taking several steps to prevent this failure mode from happening again, and ensuring that other business disaster scenarios are able to recover more quickly.
First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously. We have already begun putting in place a number of immediate fixes to tools that operators use to mitigate this, and we will be rethinking what tools are necessary over the coming days and weeks so that "full power" tools are not the only means by which to accomplish routine tasks.
You can imagine the efforts people will go through to create safeguards to eliminate the possibility of this type of outage. Unfortunately, this will also most likely put a burden on day to day operations.
Another way to solve the problem is to give people the ability to see the impact of their actions. No one in their right mind would execute a command to reboot all the servers at Joyent. And no one in their right mind would delete a directory with all user records.