Understanding Your SQL Server Monitoring: Part 3 - Memory (RAM)
March 3, 2016
Database
In this series, I’ve tried to help you see the importance of SQL Server® monitoring, and also understand what that monitoring is telling you. So far, we’ve looked at how storage and CPU problems are identified using traditional SQL Server monitoring solutions. I went through how incorrect values for some settings in SQL Server can hurt you as well, like MAXDOP and Cost Threshold for Parallelism. This article will take us to the third major aspect of SQL Server. We’ve talked about storage and CPU, and now we’ll talk about memory, or RAM. As in the previous articles, I am going to show you some of the metrics that help identify instances when your SQL Server needs more memory.
As discussed previously, compilations can be a sign of memory pressure. Query plans need to be compiled when they don’t exist in the query cache. If you remember, compilations are the action that SQL Server takes when it receives a query for the first time, before it creates the query plan. When that plan is created, it is stored in the query cache, which lives in memory. If there is a shortage of memory, SQL Server will keep less query plans in memory than it should. That means that when there are a large number of compilations, it can mean that there is a shortage of memory, and that plans are being compiled because they aren’t living in memory for very long. However, this metric can also show high when users are running queries that don’t exist in the query cache because those queries have never been run before. This metric needs to be viewed as a red flag, but you should also keep in mind that this can alert you to problems that aren’t really there.
Again, as discussed in the previous articles on storage and CPU, SQL Server monitoring is your best friend. It is the same with CPU and storage: wait stats are one of the metrics that are useful for helping to identify memory pressure, it’s just different wait stats that can help identify memory pressure. One wait stat that can show memory pressure is PAGEIOLATCH, though it requires more investigation. PAGEIOLATCH is the tag for the time that SQL Server waits for a page of data to be read into memory. The most common thought is that storage issues or slow storage cause this. It can also be caused by a couple of other problems, such as memory shortage, or poorly written queries that read more data from disk than they should. That said, for now we are focusing on memory. PAGEIOLATCH can highlight a memory problem because SQL Server can be waiting for memory to free up to load a page of data from disk to memory.
Page Life Expectancy (PLE) is another metric that shows memory pressure. PLE is the measure of time that a page of data will stay in the buffer pool without references. The importance of this metric goes back to how memory is managed within SQL Server. When a query is submitted to SQL Server, the pages of data that the query is referencing are loaded into memory for easy access. The data is filtered as per the WHERE statement and data is returned. The data will continue to sit in memory, where it can be referenced again, until it needs to be flushed out to make room for other data that needs to be loaded into memory for another query. If pages are getting flushed from memory pretty regularly, that is usually a red flag signaling one of two issues. One issue could be that the query is written in a way that it pulls too much data from the disk. In this case, either the query needs to be reworked, or there is an opportunity for an index change. The other possible issue, and the one that we are concerned with here, is that there isn’t enough memory to store the data that we need to process with SQL Server. PLE is a standard metric that most SQL Server monitoring software has on their dashboard by default. It can also be pulled using Performance Monitor on the SQL Server. Related to PLE, Buffer Cache Hit Ratio is also a good metric that can point to memory pressure. The Buffer Cache Hit Ratio shows the percentage of pages found in the SQL Server buffer pool of all of the data pages that were requested by queries. This metric can be pulled using Performance Monitor, just like PLE. You can’t look at Buffer Cache Hit Ratio alone to determine if there are problems with your SQL Server. Buffer Cache Hit Ratio can have small drops due to users requesting data that’s already been flushed from memory. The real value of it comes when you pair it with PLE. When you look at Buffer Cache Hit ratio and it shows a drop that PLE doesn’t, there is no problem. When they both drop, that is normally a sign of memory pressure. The thought is that if pages are not living as long, and there are fewer pages in memory when requested, that means there isn’t enough memory to hold all of the necessary data for SQL Server.
When you are experiencing memory pressure on your SQL Server, another metric that will show evidence of it is memory paging. When a computer is out of available memory, but still has data that it has to put into memory, it will put this data on disk. This is called paging. So your next question might be, “Josh, why is memory paging a problem?” There is an easy answer to that: Memory is faster for accessing data than disk, so when memory is paged to disk, there is an enormous performance slowdown that you want to minimize.
So far in this series, we have talked about how to identify hardware pressures at the server level using metrics that most SQL Server monitoring solutions track. We went over ways to identify pressure on storage, CPU, and memory using metrics that you can pull from your server, or from most SQL Server monitoring software packages. In the next post in the series, we’re going to talk about how we can identify the worst performing queries that are in the cache on the server, and then talk through some basic query tuning techniques.