How IT Managers Can Build IT Resilience

November 16, 2018 No Comments

Featured article by Dave Baker, Managing Consultant, TDS

The pace of change in IT is accelerating. As businesses strive to become more agile and implement digital strategies to drive customer engagement and growth, the pace of change–and the associated pressure–has increased. IT teams must adopt new technologies to keep up with business requirements, while maintaining a stable, secure infrastructure that withstands change without disruption in service. In other words, IT organizations need to build resilience.

In my experience working with clients cross all industries here at TDS, businesses are actively pursuing digital transformation strategies, with financial services companies at the forefront. A recent Fujitsu survey found that 90 percent of financial services companies already have active digital transformation initiatives underway (Wall Street Journal and Fujitsu). Operational efficiency is cited as the main driver for these projects in finance, healthcare, and manufacturing, while industries like retail are seeking growth and other industries are responding to competitive threats. Whatever the rationale, business is depending on IT to quickly adapt and demonstrate its value by rapidly implementing technology and solutions that deliver results. And, IT must be able to understand how it all fits together and respond to both planned and unplanned change–while retaining stability and security, so there is no disruption to service.

Innovation adds complexity to resilience efforts

The technologies that make IT environments scalable, dynamic and agile–such as distributed computing, containers, software-defined data centers, machine learning and AI–also make it difficult to build a resilient IT environment. Consider that any time a change is made in IT, regardless of whether or not it is planned, it can’t happen without considering its full impact across the entire infrastructure. And with the additional complexity of each new technology, the unknown element increases that much more. Consequently, for most companies, change raises the risk of outages and significant business disruption.

The tools and techniques available to support resilience planning are rapidly evolving as well. According to Gartner’s Market Guide for IT Resilience Orchestration, “ITRO automation software products, originally built to automate disaster recovery runbooks, have evolved to support application resilience, as well as migration from on-premises data centers to public clouds.” However, companies have a number of things to factor into the equation when architecting their “resilience stack” of processes and technologies.

Building IT resilience starts with understanding your current environment

Companies can mitigate disruption and reap the benefits of their investment in new technologies if they build a highly resilient infrastructure that ensures the ongoing performance of central business functions. To do so, they need to think beyond a static disaster recovery plan and begin to build a holistic and current view of their environment.

The first key to building IT resilience is to start your plan with a trustworthy data set. Your data can be out of date as soon as tomorrow. If you establish an automated process for keeping your asset and dependency data up to date, you’ll reduce the likelihood of having stale data or of human error resulting in incorrect data. Without this established process, you run the risk of your DR plans being less effective — or even worse, obsolete.

Most DR plans are designed and tested for very particular scenarios–such as the proverbial ‘smoking hole’ where you lose an entire data center, or losing a storage array. In fact, these scenarios actually comprise just 20 percent of all unplanned outages. Most organizations do not have the processes, technology, or formal plans to address the other 80 percent which are due to application or operational user error. To do so, companies must understand their current environment, its applications and dependencies, the varying RTOs/RPOs and any compliance requirements associated with each application. This information needs to be actionable and kept up to date to ensure you are making the proper decisions, and can quickly and effectively analyze the information on a case by case basis. With this complete understanding of both the architecture and dependencies of the applications, they can begin to build an agile infrastructure that can rapidly adapt new technology.

Moving beyond the spreadsheet and leveraging collaborative, real-time methods

Traditional methodologies are insufficient to deal with today’s complexities and result in plans that are too rigid and cumbersome to deal with unpredictable realities. Think about it–in an environment where one change can have a snowball effect, a single unanticipated move can render your entire spreadsheet-based plan unworkable–these plans are also almost immediately outdated when they are not part of a live, collaborative environment.

So, in developing the capability to be more agile and resilient, companies must approach their IT planning differently. They must fully understand their environment and the interplay between applications, hardware and networks and be able to:

– Bridge siloed information sources. First, you should have the ability to automate data ingestion and normalization from market leading ITSM, CMDB and DCIM tools, providing users with an aggregated, consolidated view of applications and their interdependencies across a complex, hybrid IT landscape.

– Create automated runbooks. You also need the ability to automatically generate “runbooks” which can be implemented in the event of an outage. While your runbook provides step-by-step instructions, it also needs to be able to automatically delegate tasks and ensure they are executed in proper sequence to reduce risk and overall event time by up to 50%.

– Manage both humans and computers. In responding to outages and other changes, organizations often get tripped up by the interplay between tasks that must be carried out by human beings, and those that are handled by systems. In reality, it all has to be coordinated with the ability to assign, manage and track tasks to be carried out by both humans and systems being critical.

– Facilitate communication and collaboration. One of the biggest problems with using spreadsheets for resilience planning is that when it’s updated in real time, only the person doing the updating has the correct version. Ultimately, recovering critical applications may be at the mercy of Excel version control! So, it’s imperative that resilience planning leverage a collaborative software platform that facilitates effective communication and doesn’t burden the organization with inefficient, time consuming meetings.

– Adjust to change. As previously mentioned, one unexpected change can throw a resilience plan into chaos. So, it’s critical that you build in the flexibility to quickly identify and analyze unexpected changes “on the ground”, such as the need to shift the downtime window for an application or need to change the event asset inventory (the kind of thing that can upend conventional project management methods). As one of our clients recently admitted, before our software gave them the ability to readily see how changes to one application affected others, they were simply waiting for the inevitable phone call from people saying that there was an outage–not a pleasant position to be in.

– Plan for the unthinkable. Effective resilience planning requires the ability to explore multiple scenarios–even unthinkable ones. So, resilience requires the ability to continually run tests and conduct ‘what if’ analysis and resource planning.

– Meet changing compliance and regulatory requirements. To comply with new and changing government regulations and business level SLAs, IT must understand the impact on each application. When you tie the regulation associated with each application into the execution of tasks generated by dynamic runbooks, there’s less chance of missing the implication of a change. For example, in the case of one client, the organization was able to quickly identify assets that contain HIPAA data, follow the automated procedure of restoring and securing data, and issue relevant notifications.

– Meet ever changing RTO/RPO requirements to meet a continually evolving IT landscape, along with the ability to quickly analyze whether or not those limits are still feasible. In another client situation, we were required to analyze upstream and downstream dependencies to identify which applications would not meet RTO requirements based on app-to-app interdependency.

With these capabilities in place, you’re well on your way to building resilience into today’s increasingly complex IT infrastructures.

Dave Baker is a managing consultant for TDS (Transitional Data Services).