As IT professionals, we have all been there, the dreaded help desk call “Is email down?” (Replace with any business critical application). One of the worst things we deal with in this line of work is having an outage. Outages for critical systems can affect the business in many ways. Take for example email, as it is the main communication tool for almost any type of business. Not only is email used for communication but business transactions, such as purchase orders, invoices, your payroll can all be sent via email. When email is down the business is affected beyond being able to communicate, but there is potential loss of productivity as well as dollars loss due to missed transactions. What a nightmare!
Put on your fire suit and grab the hose…
I use to refer to working in IT as being similar to a fire fighter. Putting out fires every day with barely any time to breathe. Having spent a majority of my IT career specializing in Exchange, I know a lot about putting out fires. Exchange fires can have devastating impact to the business. For those of you not aware of what Exchange Servers do, in a nutshell they are application servers that provide you email.
IT fires or some may refer to them as system outages, blips, or even gremlins, can happen to any server or application. There is no system that is immune to it. Even your iPhone may have some blips and hiccups just like servers. After we have put out the fire, the first thing we are asked by management is “What is the root cause and how do we prevent this from happening again?"
That's the million dollar question because historically IT has been a reactive service. What I mean by reactive service is that we often react to issues, aka fires, rather than prevent them. Most IT departments are short staffed with ever increasing workload that by default they have become fire fighters. Just trying to stay afloat by putting out fires day to day with barely enough time to get innovative projects done. Let’s step back and think about this for a minute. Wouldn't it be more efficient in the long run to be proactive? Prevent the fire before it happens and utilizing your time working on projects, like innovation? Or perhaps at the least know about the fire before the dreaded help desk call? If we could get a running start on addressing the fire while it’s still just a flicker, rather than a widespread fire, we could possibly reduce any negative impact that this could cause to the business.
There is a way to be proactive. It’s by utilizing the various monitoring and alerting tools that are currently available in the market. Some of them are even FREE! You can monitor not only your server, but down to the application level. YES!! I can be alerted every time there are RPC requests that take too long on an Exchange Mailbox server or when someone deletes a read receipt without reading it! Each of them have their pros and cons depending on what your budget and scope are.
I downloaded the monitoring tool, now what?
Anybody can download and set up some alerts from any monitoring software. However to get real value from these tools is to do it the right way. What is the right way you ask? Well, it starts with asking yourself or your application team a few of these questions.
- What are you are monitoring and alerting for—be specific, server outage or do you need application issues, such as scheduled SQL queries not running. Failed RPC requests to Exchange mailbox servers?
- What key indicators or triggers do you need to be alerted on?
- What are the thresholds, if any?
- Who needs to be notified and how soon and often?
This process can take days and even months to fine tune. It's not an overnight job if you want to do it the way where you will get the most value from it. If you set all your alerts without any fine tuning you will get overwhelmed with alerts causing white noise. Trust me, been there done that! The Outlook rules in my inbox is proof that alerting can cause white noise. The downside is not enough alerts and you could miss out on preventing an issue, which presents no value to the monitoring tool. What’s the point of alerting if you are not paying attention? The key is finding the right balance to be successful in a proactive IT shop. Shameless plug moment...myself and co-host Theresa Miller did a The Current Status podcast with Leon Adato on Monitoring and Alerting a few months back where we discuss finding this balance and keeping the white noise down. Check it out!
https://youtu.be/dy5qGllGz0s
Monitoring and alerting, let’s take it up a notch…
It's more than making sure that your server is pingable. Monitoring and alerting has stepped up its game and it’s not only about the server, but it’s also getting to know what that application running on the server is doing. It’s taking a view of the whole picture inside and out to help you keep and maintain your SLA.
With the ever-increasing demands of the business and ensuring that uptime is always up, it’s absolutely critical for IT departments to know. We need to understand what our applications are doing when they're doing it and why? Why are RPC request taking too long on a mailbox server? Is it a network issue or disk issue? Why did my scheduled SQL query fail over night? With Application monitoring we can be alerted when these issues happen and correlate them with other events that may have happened on the server to prevent a fire or determine a root cause.
Understanding the behaviors and patterns of our servers and applications may even help predict what will happen. For instance, predictive analysis can already determine failing disk drives for servers and storage arrays. There are even monitoring tools that will predict how much storage growth you will have and when you run out of storage based on your usage patterns. Just think if predictive analysis can examine workload trends, CPU/Hardware utilization, and application errors/warnings and correlate them to determine if there will be an outage. Predicting a potential issue can prevent unwanted outages and less headaches!
In summary, this is by no means a foolproof solution. You will still get fires, but far less fires and you’ll be aware of what your systems are doing. You will get a clearer view where the gremlins are running to and maybe even catch one or two.