Monitoring Business Plan: Part One
So, you finally decided to start monitoring your environment or you’re considering upgrading your existing monitoring solutions. Maybe you already decided on a specific product, the proof of concept worked out fine, and you got a great offer from the vendor. You managed to get buy-in from other relevant IT teams so all of you can use the same system, access the same data, and everyone’s going to be happy, and there’s a shiny unicorn in front of your office door saying “you shall not pass” to anyone who recently requested a new laptop.
But now you’re facing the most significant challenge:
You need to convince the guys with the money.
Let me help you!
Depending on the situation, you’ll either talk to the IT director, or, in larger organizations, to the CIO. This discussion is usually considered less critical, as these guys most likely worked the ranks up in IT, understand what you’re talking about, and, more importantly, what you need.
Still, they’re managing a limited budget, and you’re asking for another slice out of it.
The other possible situation, unfortunately, is a little more complicated, as it might require you to talk to non-IT guys, in finance for example, or the CFO directly, who is, quite often, like the sum of all the villains in the movies. Accountants. Try to avoid.
These people don’t know the difference between TCP and UDP. But that’s OK, as probably haven’t heard of the Modigliani-Miller theorem. And that’s OK as well—I had to look it up.
It’s a language issue. And while it would be nice to learn a new language for one-time use as a holiday, it’s probably sufficient to understand how to order food or, in this case, bring your request up to upper management. I think the idea of dealing with the C-suite as a foreign country is a very valid one, just make sure it isn’t a one-way trip.
Whatever the case is, you should prepare to fight resistance. Simply asking for more money doesn’t cut it, but the more details and data you can provide, the more comfortable and more controllable the discussion will go.
Take a step back and consider the following:
You’re asking for a limited resource, and you’re asking the person managing this resource. It’s their job to spend it wisely, but to spend it nonetheless. One of their tasks is “giving,” and that obviously comes with an element of “taking.”
So, what will you be giving them in exchange?
At some point, the following question will be asked:
“Why do you need extra money for something you’re already doing?”
Now, that’s one mean question as it implies so much. But let’s put the grief aside for a moment.
One right answer could be “To save time (saving money), prevent outages (saving money), and minimize the time to repair (saving money).” So, let’s focus on the “why.”
At this stage, it shouldn’t be your job to provide a full business continuity plan, but let me give you some background and ideas.
Keep one thing in mind, and this is your advantage:
It’s your job to deliver a service (IT), and these deliverables could be delayed or stop altogether.
The first thing you need to do is to define the cost of an outage, including direct and indirect loss, and fixed and variable ones.
Let’s say node/application/connection X just went down. Replace X with something business-critical. No, not YouTube.
This element X could be the worst-case scenario, like an online store turned offline, or the sales department cannot use the phones anymore, meaning the company is unable to function as a business.
In risk management, the calculation starts with a simple formula:
(time) X (number of service/products created in the specified time) X (value of service/products) = loss caused by the outage.
But you may also consider the following:
- Your finance department can tell you how much money your company makes in a day
- HR will know the overall cost of employee time
- Both together with double the value of the IT team’s time (because you need to fix something) and any optional third-party cost should get you an idea of how to approach it
There are various calculations to provide estimates, but the immediate cost per minute can quickly go beyond $5,000 for an SMB and up to six digits for enterprises. Per minute, so you might even need calc.exe to determine the cost of a full hour!
Other elements to consider, although challenging to figure, is the reputation loss with vendors and customers, and possible legal trouble if other businesses depend on yours up to a level where your organization may not be able to survive.
In times of social media, an outage of a well-known company will be visible till the end of all times, or at least till the elders of the internet decide to remove it. The shareholders won’t like it.
Even when you manage to fix the issue, there’s additional downtime involved. What happens in a large office when the phones stop working? Folks go for a coffee break or have a chat somewhere else. At least, that’s what I would do. Anyway, it takes time for everyone to start working again, and order has been restored.
Also, depending on the situation, your company may lose unrecoverable data because it hasn’t been saved yet. We’ve seen it all—backups, lol?
And finally, when the phones stop working, the sales floor will report it to you. Not just one single person. To give you an example; I remember an outage while working for an online gaming company, and we received thousands of tickets stating, “Hey your login servers are no longer working. My online event starts in 20 minutes. FIX IT NOW.” That’s thousands of incidents to reply to till your fingers break from all the copy-and-pasting.
The sum of all of this is the cost for the duration of a single outage, including the mean time to repair.
Don’t forget: the now infamous element X—how often did it fail last year? Always estimate not only the duration, but also the frequency of incidents.
All this intelligence should help to prepare for the “why” question. But there’s more to do—there’s the “how.” Read part two to learn “how.”