Managing IT Points of Failures

Managing IT Points of Failures: Separating a Fire Drill from a Fire

May 21, 2014 No Comments

Featured article by Deepak Kanwar, Senior Manager, Zenoss

A red flag on the IT operations dashboard at many organizations still means “All Hands on Deck,” a failure has been detected, it’s time to grab the safety gear and ready the hoses for what could be hours of intense fighting. Absence of any context or intelligence within the alerts means that reinforcements in the form of SMEs have to be instantly deployed. When the issue is finally addressed and normalcy is restored, IT personnel can return back to their stations staying at red-alert for the next call. Sometimes between alerts reality hits – the “fire” was a low priority one and given its limited impact to the business, did not justify the cost of the resources allocated. It could have simply been handled as a fire drill. Yes, the issue was addressed, but the price in terms of time and resources was very high. And that brings up the question: How can organizations arm themselves with the intelligence to distinguish between a fire drill and an actual fire?

As a first step, organizations are leveraging modern datacenters to minimize failures and disruptions. These datacenters are architected to be resilient and fault tolerant. There is a conscious effort to eliminate single points of failure, and that means each component failure does not result in service disruption or even degradation. Through purposeful adoption of redundancies, one or more backup systems are put in place to ensure the infrastructure can withstand a few hits before the user experiences an issue. For businesses, this means reduced downtime and improved user experience. And for an IT Ops team, this means every alert is not a three-alarm fire that needs immediate attention.

The benefits of these highly available datacenters are undeniable, however, some malfunctions and some disruptions are inevitable. To guard against the worst-case scenario when issues arise, an organization should have a plan for how to manage such issues and answer the following questions:

– Which alerts require a response now and which ones can wait?

– Was the safety net eroded and will the next failure result in a disruption?

– Which services (if any) will be affected by the latest alert?

– Most important, are IT priorities aligned with business goals?

But I Trust My Tools!

Unfortunately, as numerous IT teams are finding out, their legacy monitoring tools can no longer support their dynamic environment in an efficient manner. These tools, designed for an era when environments were static, have no notion of monitoring services that move around in a fluid manner leveraging physical infrastructure one day, and virtualized the next. Deployed in silos, their span of control is limited and they are unable to provide any environment-wide context to an alert. When disruptions do occur, IT teams still find themselves scrambling to address the issues.

A 2013 Forrester Consulting study indicated that 41 percent of respondents spend anywhere from an hour to more than a week to identify the root cause of service problems. With downtime costs easily exceeding $100,000 per hour for many organizations, these hours can quickly add up, making a significant impact on the bottom-line and invariably draw the wrong kind of attention for IT.

To take full advantage of their shiny new datacenters, IT teams need modern monitoring solutions that not only provide them with timely alerts, but also with intelligent context around the event. It is not enough to know that a component is failing – you also need to know if this is one of 10 redundant components and its failure will not cause a service disruption or if this is the last of your safety plugs and the next failure will mean a service disruption.

Modernize the Datacenter with Cloud-era Monitoring!

Simply upgrading infrastructure or datacenter architecture is not enough. To really manage services in an efficient manner, organizations absolutely must modernize their monitoring tools. It is about using tools that provide unified service insight for the entire environment, not just the health of the individual components. An ideal solution keeps track of services and the underpinning components, so you know what services can be affected by what failures and ultimately speed up root cause analysis. Finally, it is important to keep in mind that today’s datacenter environment is dynamic and a mix of physical, virtual and even cloud. If you’re unable to monitor across your environment, then you will be stuck doing element-level management, which is rarely effective and never efficient!

Deepak Kanwar is a Senior Manager at Zenoss with product marketing responsibilities for the unified IT monitoring and management solutions. He has over 14 years of IT product marketing & management experience with various leadership roles at BMC Software, Dell, Mezeo Software, etc. Deepak has an MBA from Rice University and is Infrastructure Technology Information Library (ITIL) v3 certified.