In light of the Amazon Cloud outages last month, one instantly has to think about what Amazon, or really any cloud provider should be considering to mitigate such risks. In general, cloud service providers tend to focus on pure IT Management and virtualization. But what about the actual infrastructure itself: how is it being managed, and in the event of a catastrophic failure, how should it be brought back online in the most expeditious manner? An area where cloud service providers may very well be at risk is in the process of managing their infrastructure – that is, do they have proper processes, procedures and methodologies in place?
In the most recent Amazon outage, according to the Wall Street Journal, “Generators kicked in but failed to stabilize the load. Power went off to part of the data center. Then a software bug delayed recovery.” No doubt a Data Center Infrastructure Management (DCIM) solution could have helped mitigate this. DCIM can track the power chain and can also remind IT and Facilities teams to test the simulation of a generator failure beforehand, under more controlled conditions when traffic is known to be low, thus identifying and eliminating the risk ahead of time. Redundancy may be mandated by the company, but only good process management, the kind of management that a DCIM solution can provide (among other capabilities), can enforce it.
There has been too much focus lately on DCIM as a monitoring mechanism only. The problem is that real time monitoring is too late to solve for problems like the one that happened at Amazon. For DCIM to provide real value, it has to be more about establishing good process and procedures that help run an efficient, lower cost and lower risk data center by avoiding problems before they happen.