In the age of exploration, cartographers used to navigate around the world and map the coastlines of unexplored continents. The coastline of IT, and moreover the inner landscapes and features, has become much more complex than a decade ago. The cost and effort needed to perform adequate mapping the old way has gone way upwards, and manual mapping is no longer an affordable endeavor, save for a productive one. Organizations and administrators need a solution to the problem, but where to start from?
To continue on this analogy, explorers of old had a few things to help themselves: maps of the known world, navigation instruments and the stars. They also set sail to discover the vast world and uncover its riches, at the price that most of us know now. Back to our modern world: our goal is to understand which services are critical to a business service, and the reason why we want to understand this is clear. We want to ensure the delivery of IT services with the best possible uptime and performance, without disruptions if possible.
It’s essential to start from the business service view. We need to base ourselves, like explorers of old, on existing maps and features as a reference point. Each organization will have its own way of documenting (hopefully), but the most probable starting point would be a service Business Impact Assessment (BIA). The BIA would give a description of upstream and downstream dependencies of a given service, application platforms (and eventually named systems) involved in supporting the service. From there, we can eventually be led to documentation that describes an application, its components, architecture, and systems.
Creating and maintaining a catalog of business impact assessments diverges from the usual kind of works IT personnel does. It might not even be a purely IT endeavor, as compliance departments in larger organizations may own the process. Nevertheless, it is essential that IT is involved because a BIA is the ideal place to capture criticality requirements. It helps articulate how a given process or service impacts the organization’s ability to conduct business operations, assess how the organization is impacted in case of failure, and determine the steps to recover the service. Capturing adverse impact is a key activity because it helps to classify the criticality of the service itself in case of failure. Impact can be financial (loss of revenue, loss of business), reputational (loss of trust from investors/ customers/partners, press scrutiny), or regulatory (loss of trust from regulatory bodies/legislative authorities, regulatory scrutiny, regulatory audits, and eventually even revocation of license to operate in a given country/region for regulated businesses).
The inconvenience with any BIA or written document is that they are a point-in-time description of a service, which is cast in stone until the next documentation revision date. Therefore it is a necessity to engage with the business process owners, and eventually with application teams, to understand if any changes were introduced. While this allows for a better view of the current state, it has the disadvantage of being a manual process with a lot of back-and-forth interactions. Another challenge we might encounter is that the BIA strictly covers a single process, without mentioning any of the upstream/downstream dependencies, or perhaps mentioning them, but without referring to any document (because there was no BIA done for another service, for example). It might also be impossible to even get one done, because a given process could rely on a third-party service or data source, over which we have no control.
There’s also another challenge looming: Shadow IT. Shadow IT broadly characterizes any IT systems that support an organization’s business objectives, but fall outside of IT scope either by omission or by a deliberate will to conceal the existence of such systems to IT. Because these systems exist outside of a formally documented scope, or are not known to IT organizations, it is very difficult to assert their criticality, at least from an IT standpoint. Portions of business processes or entire business divisions may be leveraging external or third-party services, upon which IT has no oversight or control, and yet IT would be held responsible in case of failure.
How can IT understand the criticality of a given application service in the context of a business service when the view is incomplete or even unknown?
- From a business perspective, the organization leadership should assert or reassert IT’s role in the organization’s digital strategy, by making IT the one-stop shop for all IT related matters. Roles and responsibilities must be well established, and the organization’s leadership (CIO / CTO) should take an official stance on how to handle shadow IT projects.
- From a compliance perspective, clear processes must be established about services & systems documentation. The necessity to document business processes and underlying technical systems / platforms is evident, critical services from a business perspective should be documented via Business Impact Analysis and collected/regularly reviewed in the documentation that covers the organization business continuity strategy (usually a Business Continuity Plan).
- From a technical perspective, the IT organization should be involved into compliance / documentation processes not only for review purposes but also to provide the technical standpoint and provide the necessary technical steps that fall under the Business Continuity/Disaster Recovery strategy.
To encompass these three perspectives, regular checkpoints, meetings or review can help maintain the consistency of the view and the strategy. Is this however sufficient? Unfortunately, not always. Those concepts work perfectly with consistent and stateful processes/systems, but the gradual advent of ephemeral workloads that can be spinned up or scaled down on demand becomes difficult to keep full track.
While a well-defined documentation framework is necessary to establish processes that must be adhered to, and while documented processes with prioritization and criticality levels are essential, it is also necessary to complement this approach with a dynamic and real-time view of the systems.
Modern IT operations management tools should allow the grouping of assets not only by category or location, but also by logical constructs, such as an application view or even a process view. These capabilities have existed in the past, but were always performed manually. Advanced management platforms should leverage traffic flow monitoring capabilities to understand which systems are interacting together, and logically group them based on traffic types. This requires a certain level of intelligence built into the tool. For example, in a Windows-based environment, many systems will communicate with the Active Directory domain controllers, or with a Microsoft Systems Center Configuration Manager installation. The existence of traffic between multiple servers and these servers doesn’t necessarily imply an application dependency. The same could be said on a Linux environment where traffic happens between many servers and an NTP server or a yum repository. On the other hand, traffic via other ports could hint at application relationships. A web server communicating with another server via port 3306 would probably mean a MySQL database is being accessed and would constitute plausible evidence of an application dependency.
Knowing which services are critical to a business service doesn’t require the use of a Palantir. It should be a wise blend of relying on solid business processes and on modern IT operations management platforms, with a holistic view of interactions between multiple systems and intelligent categorization capabilities.