Sql Server's Buffer Pool is a wonderful thing. It's smart enough to handle all sorts of situations in a fairly intelligent way. Here are a couple examples showing how at first glance the buffer pool behavior seems strange, but is actually fairly clever.
Consider a 400 GB clustered index on a server with 64 GB of Memory available for the buffer pool. If a query scans the entire 400 GB index, how much buffer pool should the index use after the query is complete?
Probably not very much. If Sql Server aggressively cached all the data it accessed maybe the last 50GB that it read during the scan is in memory. However, the next time you run the same query, that 50GB is useless because Sql will have to read the first 350GB from disk before it gets to the last 50GB. This is a common enough scenario that Sql handles it intelligently. Instead it is more likely that all of the intermediate level index pages will be cached along with a few of the leaf pages.
I've got a server with 512GB of memory, but it never seems to use all of it, or it uses it..strangely?
Sql Server is NUMA aware. On NUMA (Non-Uniform Memory Access) systems, some memory is closer/faster to a CPU than other memory. For example, a 4 socket system might have 4 memory controllers (1 per socket). It is faster for a CPU to access memory on the controller attached to it vs accessing memory through some other cpu's controller. (I've seen 1.4-1.7 times slower access).
Sql Server knows this, and will prefer to allocate local node memory before remote node memory. Since requests can run on any cpu (they don't typically jump between them) and hence allocate memory on any node, the first request to access an index will cache it's pages in local memory. In most cases, you'll end up with a fairly equal distribution of cached pages, however, each node is managed individually by sql server, so effectively, your 512 GB is really four 128GB buffer pools. You could have the situation where node 1 is using all 128GB, node 2 is using 80GB, and pages are getting discarded from node 1 to make room while node 2 has lots of free space.
Why would my heavily read from indexes be cached less than indexes only receiving writes?
One possible reason is checkpoints. Depending on your recovery interval, it may be several minutes after a data change occurs before a checkpoint writes the page to the data file (it's already in the log file on disk). During this time, the page can't be discarded to make room in the buffer pool.
Besides setting the size, are there other ways to change how the buffer pool behaves?
There isn't a whole lot of control over the buffer pool and the decisions it makes. That's fine because you typically don't need to tweak it. Watching perfmon counters for "SQLServer:Buffer Manager" and "SQLServer:Buffer Node" can provide insight into what the buffer pool is doing, whether it is sufficient in size, whether your IO subsystem can keep up with the checkpoints.
Regarding a couple of items you mentioned:
- Fragmentation on disk doesn't affect memory usage directly. However, if an index has 50% free space on the pages due to deletes, it will take up twice a much room on disk and in memory, but that's related to free space, not fragmentation.
- PAGE_IO_LATCH Waits measure the time it takes for the IO subsystem to load pages into memory. They are an indicator of disk performance and not memory.
- You are correct, high page lock waits are due to the writes. The writes make pages dirty, and dirty pages have to stick around in the buffer pool until they get checkpointed to disk.
Best Answer
Playing a bit with pg_buffercache, I could get answers to some of your questions.
pg_buffercache
shows, the answer is a definitive YES. It is worth to note that temporary table data don't get cached here.EDIT
I've found Jeremiah Peschka's terrific article about table and index storage. With information from there, I could answer (2) as well. I set up a small test, so you can check these yourself.
All in all, this shows that indexes and tables can be cached page by page, therefore the answer for (2) is NO.
And a final one to illustrate temporary tables being non-cached here: