Now that you've configured monitoring and alerting for your applications, what are you going to do with the knowledge you’ve gained? Using some of the data that you receive from your monitoring can help you diagnose issues and discover their root cause.
When troubleshooting any issue, regardless of the application or system, you should always check your application and system logs. These should be the first places you go to check for warnings and errors. Often, though not every time, application and systems errors will pop warnings and errors over to the event logs. If your monitoring tool has the ability, you can even trigger it to alert you on specific event IDs. These event IDs are the first indications that you may have a possible issue.
An example:
Event ID: 10027 Source: MSExchangeIS –
There are 5 RPC requests for the mailbox "48016d70-fc4e-417f-a68d-bb3199c82a51" on the database "25ac6881-b8d9-469b-a046-4bfa86a5ec07: / o=DOMAIN/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=XXX-XXXXXXX-XX/cn=Microsoft Private MDB" that have taken an abnormally long time to complete. This may be indicative of performance issues with your server.
If you are seeing a lot of this, it can be indication that you are having performance issues on your Exchange server. Receiving these warnings only periodically indicates something of a harmless nature. Usually it means something is taking longer than Windows® thinks it should to perform an operation remotely on that mailbox. This can be anything from a search or even mailbox maintenance operations.
Now that we have been alerted, let’s take a look at how to use this data and troubleshoot common issues.
My email is slow!
We’ve all been there. We get the dreaded call from the help desk saying someone’s email is slow. “My email is slow” can mean just about anything is wrong. Is it the local computer? Network latency? Server side issue?
When troubleshooting a complex issue such as “my email is slow,” always check the performance counters that you are monitoring. These counters will indicate if there are latency or performance issues. Knowing the right key performance counters to look at can help you determine whether the issue is client-side or server-side.
Key counters to look at:
MSExchangeISRPC Requests – Lower than 70 indicates how many threads are currently in use.
MSExchangeISRPC Operations/sec- Always higher than RPC Requests is the number of operations the server received in the past second.
If MSExchangeISRPC Requests is increasing fast and the MSExchangeISRPC Operations/sec stays stable, this indicates that the server cannot process client operations fast enough, or is having performance issues. This points to the exchange server being the source of the problem. When all RPC threads have been exhausted, clients are not unable to submit new requests to the server until all threads are released.
When
MSExchangeISRPC Requests and
MSExchangeISRPC Operations/sec are either low or at zero, this indicates that the Exchange server is not the cause of the slowness. More likely, it is pointing to an external source, such as an Active Directory®, network, or even client-side issue. You can use additional tools such as EXMon when you're seeing high RPC request. This tool can actually pinpoint the particular mailbox user who is causing trouble and affecting everybody else. A bad mailbox with corrupted items can be located by extremely high RPC requests from the ExMon.
You also will want to look at additional counters, such as Memory and Disk activity. For example, if the disk performance counters are above the thresholds, the Exchange server could be suffering from disk issues. You could have failing hardware or simply not enough spindles.
- Disk-
- Database disks
- PhysicalDiskAverage Disk sec/Read - less than 20ms
- PhysicalDiskAverage Disk sec/Write - less than 20ms
- Transaction Logs disk
- PhysicalDiskAverage Disk sec/Read - less than 5ms , spikes no higher than 50
- PhysicalDiskAverage Disk sec/Write - less than 10ms, spike no higher than 50ms
- Log buffer
- DatabaseLog Record Stalls/sec – below 10 per sec no higher than 100 per sec
If you are seeing high CPU utilization for your Client Access Servers, this can also cause slowness. The RpcClientAccess.Service.exe process consumes excessive CPU resources can be caused by many things, such as bad iOS code on mobile devices, to third-party archiving software that is integrated with Exchange. Sometimes it’s as simple as anti-virus scanning on the Exchange servers that is consuming the resources.
When you’ve ruled out the client-side configurations and the Exchange server side as being the source of issue, as well as network latency, you should look at Active Directory. I won’t get into too much detail here as I will cover Active Directory monitoring in a later blog post, but I did want to point out some things to consider when troubleshooting.
Exchange depends on Active Directory. It uses the Global Catalog domain controllers. When there is an issue on the Active Directory side, this can create negative impacts on Exchange. As part of your troubleshooting, you should investigate CPU, disk and memory bottlenecks on the Active Directory servers, but there are also AD-related counters on the Exchange servers. To determine if there is an Active Directory issue affecting Exchange, use these counters found on the Exchange servers performance monitor:
- SMTP ServerCategorizer Queue Length should not be greater than 10. This shows how SMTP is processing LDAP lookups against global catalog servers. If the value is greater than 10 and is increasing, this can point a slow global catalog servers. Keep in mind that this value can go slightly higher if large distribution lists are being expanded.
- MSExchangeDSAccess ProcessLDAP Read Time (for all processes) – This shows how long LDAP read request takes to be fulfilled. The average value is around 50ms and should not exceed 100ms.
- MSExchangeDSAccess ProcessLDAP Search Time (for all processes) – This counter shows LDAP search request takes to be fulfilled. Similar to the LDAP Read Time the average value is around 50ms and should not exceed 100ms.
Monitoring and alerting are not foolproof, but using some of these tips should guide you in the right direction to resolving common Exchange issues.