Is there a deterministic relation between the page life expectancy and disk IO throughput?

  • Start with Page Life Expectancy, which should be well above the 300. This tells you how long pages are staying in the buffer pool, and a value of 300 equates to 5 minutes. If you have 120GB of buffer pool and it is churning over 5 minutes, that equates to 409.6 MB/sec sustained disk I/O for the system which is a lot of disk activity to have to sustain.


    The formula used seems to be -> Disk IO throughout = Buffer pool memory * 1024 / PLE

    Is this a correct way to determine the disk IO throughput?

  • It’s pretty far off the mark to suggest such a direct relationship between database data file bytes/s throughput - which can be measured directly over a period of time - and SQL Server page life expectancy - which is an estimate that is updated over time based on a multi-factor model that Microsoft has never publicly disclosed πŸ™‚

    I'll go into more detail here than you probably initially had in mind, but i think it's important for folks to think about what factors may be involved in a composite measure like Page Life Expectancy, or Buffer cache hit ratio, etc in order to have a more robust picture of how workload activity and resources interact - especially when resources are adaptively or predictively provisioned.

    One thing known about the model: on a system with multiple (v)NUMA nodes each having (v)RAM provisioned as expected(and without the intervention of any startup trace flags telling SQLOS to do differently), SQLOS will create one SQLOS memory node for each (v)NUMA node(in addition to the very small ever-present SQLOS memory node with ID 64, created just to accommodate the DAC).

    When there are multiple SQLOS memory nodes each with a different β€œhome” (v)NUMA node, each such SQLOS memory node will be managed based on resource monitor threads, a lazy writer, and its own memory targets. By coordinating memory management of each of the those SQLOS memory nodes, SQL Server governs system-wide SQLOS shared memory according to system-wide memory targets. As part of that package deal, each of those SQLOS memory nodes has its own estimated PLE updated over time, according to the same undisclosed model used to estimate PLE for the one user-accessible SQLOS memory node on systems that have only SQLOS memory nodes 0 and 64.

    But there is one thing we know about PLE on systems with multiple (v)NUMA nodes and multiple SQLOS nodes. The system-wide PLE is calculated as an harmonic mean of the individual PLEs of the constituent SQLOS memory nodes.

    Paul Randal discussed this detail several years ago in this blog post.

    And I worked through an example a few years later, after Paul Randal included the detail that the system-wide PLE was an harmonic mean rather than an arithmetic mean.

    All right. So the detail about the harmonic mean for systems with multiple (v)NUMA nodes and SQLOS memory nodes is really the only detail we know. But we can think about what factors are probably included in calculating PLE for a single SQLOS node and it will be very instructive for thinking about database behavior.

    As a thought experiment, let's imagine a a DBAsCHOICE database system. The database files contain 10 tb of data - 1000 heaps each 10 GB in size. No b-tree indexes or user indexes of any kind in the database. It's on a server with 150 gb RAM, and the database cache stays constantly sized at 100 GB. The database cache is managed with a strict LRUQ. The imaginary workload is read-only: SELECT * FROM x WHERE ID = -1 For each of the 1000 different heap names. Not a single one of those heaps has a row with ID = -1. Each of the thousand tables has a fullscan performed - and then doesn't have another fullscan started until all 999 other heaps have had a fullscan started (with nearly all of them completed by the time the next fullscan of that table occurs). Since I'm making it up, the workload is perfect. No write activity. No sorts or hashes. No temp tables. No dropping tables. All reads, all the time, and looking for something those queries will never find.

    Since the database cache is a constant size, and the only activity is read activity, between any two points in time the number of database pages read from storage into the database cache is equal to the number of database pages evicted from cache. To do a fullscan of a 10GB heap, 10GB worth of the oldest database pages had to be evicted.

    The database cache size is fixed and known. The imaginary workload guarantees the cache will just keep churning and churning. Each database page simply marches monotonically from the MRU end of cache tracking to the LRU end, then gets evicted from cache. In this scenario, there is a precise relationship between the amount of time a database page is held in database cache, and the read bytes/s throughput from the database pages.

    Its just a fairy tale. SQL Server doesn't work like that. πŸ™‚

    First of all, the database cache is not a fixed size. Database cache is adaptively managed as are stolen and free memory with SQLOS Target Server Memory in mind. The following graph shows how memory adapts to use and predicted use over time. A significant reduction in the size of database cache - assuming a near constant rate of database non-disfavoured page reads(and neither disfavoured reads nor write activity - haven’t even gotten there yet LoL) - would result in a much shorter β€œcache hold time” for database pages and also lower PLE.


    Hm. What’s that about disfavoured page reads? Well, let’s talk about additional factors beyond β€œregular” page reads which affect database cache behaviour and management, and we’ll get to disfavouring (and beyond).

    Describing a queue discipline details how a queue is managed - entry and overflow, prioritisation, lifetime, eviction. The SQL Server database cache queue discipline is more robust than a simple least recently used queue where items to be evicted are always those that were most recently touched longest ago. The overarching discipline is an LRU2 - an implementation of a general strategy of LRUk algorithms that primarily rely on not just the single most recent reference for a database page but the k most recent references.

    The bpool LRU2 algorithm is pretty slick i think; it acknowledges some database pages are more valuable in cache than most recent touched-timestamp alone indicates. In general when pages need to be evicted from cache to maintain or achieve target size (as opposed to eviction by heuristic like when a database is dropped), the oldest second-to-last touch is where we expect eviction to take place.

    Huh. That’s definitely starting to depart from easy prediction based on read rate. If there’s some number of gb of database pages from hot tables that are always saved from eviction because those pages are touched so often - that’s going to effect cache hold time for other pages and i expect that to be reflected in PLE (depending on PLE algorithm specifics).

    Ok, let’s get closer to disfavouring. What if Gil from accounting is always running a fullscan of a 2TB heap in an instance that has Max Server Memory set to 100 gb? (Gil’s fullscan query is magic, too, and it always touches each page of that heap twice - as if it were purposely designed to wreak havoc on LRU2 cache management.) Is Gil going to keep churning everything else out of database cache with his doggone 2TB of heap pages?

    When that happens - when a cache is overrun by a large amount of low-value stuff - it’s called cache pollution. Database page disfavouring tweaks the bpool queue disciple, in order to mitigate the risk of cache pollution. It’s a manipulation of the touch timestamps of relevant pages, to get them evicted from cache pretty quickly. Gil’s fullscan of that 2 TB heap isn’t likely to pollute the database cache of that instance with Max Server Memory = 100 gb because those pages will be disfavoured and evicted from database cache fairly quickly - allowing other stuff to stay in cache longer. Fullscan of an HoBT that exceeds a threshold percentage of database cache size is one example where disfavouring manipulates the system and results in a different cache hold time for database pages than would be predicted based on bytes/s read throughput and database cache size alone. And I’d expect PLE to change as a result.

    Above I mentioned that bpool eviction by heuristic can have a significant impact on database cache size, and thus would be expected to have a huge impact on PLE that can’t be predicted by read throughout alone. Dropping of a database is an example, and that includes dropping a snapshot database. Running a dbcc checkdb by default in a regular user database will create an internal snapshot db and perform its reads against the snapshot database. In the bpool those pages are associated with the snapshot db, not the base database. As the checkdb completes, the internal snapshot db is dropped. Every database page associated with the suddenly-dropped snapshot db gets evicted from cache. The resulting database cache will be smaller; cache hold time would be lower at a constant read rate. And I’d expect PLE to be lower, in a way that can’t be tied to disk bytes/s throughput.

    Now let’s talk a little about database page writes, dirty pages in the bpool, and potential affects on PLE. A dirty database page in the bpool might have been read first from disk to be altered - if it had previously been allocated and had contents.

    But imagine in the same SQL Server instance with Max Server Memory = 100 gb, there is no non-system activity at all. Nothing is forcing any evictions from the bpool. So PLE just keeps climbing until it’s pretty high. Add a session just stuffing a newly created uncompressed temp table in tempdb with rows that are just 5000 repeated - dash characters. The table is new. The allocations will be brand new - and SQL Server knows that. No need to read an unallocated and empty database page (or 64 kb extent of eight pages) just to create the pages in the bpool - then they can be written later, right? But pages have now newly entered the bpool. That’s gonna lower the PLE - without any reads (well, maybe a couple of metadata reads from system objects) - and read bytes/s won’t have a thing to say about it.

    Will we catch that bpool activity in tempdb data file writes, which could be connected to the PLE values and changes? Maybe. But only maybe. Compared to to user databases, tempdb now has a delayed write behaviour. If that session drops or truncates the temp table before delayed write kicks in? Voila!! The perfect crime! PLE changes that leave no trace in disk read bytes/s or write bytes/s.

    All right - three more things to mention here as confounders of the relationship between database data file bytes/s traffic and PLE. How about bytes/s data file traffic that bypasses the bpool? I’ll give a read example and a write example - I’m sure there are others, too. For reads - how about backups? Gonna get a lot of read traffic for the data files. Won’t have any direct affect on PLE because that read traffic goes into backup buffers which are totally separate from the database cache.

    Now a write example: imagine instant file initialisation is not enabled in an instance and a database file grows by 50 gb. Gonna be a lot of bytes written to that database file. But the kind of effect on PLE that reading 50 one gb heaps into cache? Or creating and populating 50 new one gb heaps and calling a manual checkpoint? Nope.

    Next let’s consider the algorithm or model for calculating the PLE. I don’t think it’s as straightforward as linear extrapolation of the recent past N minutes. I have a feeling the model is predictive, incorporating estimated accelerating or decelerating future activity.

    For example, 3 minutes ago active request count was c0, bpool-inflow rate was y0, and bpool-eviction rate was z0, and bpool size was x0. 1 minute ago c1 active requests, bpool inflow-rate y1, bpool eviction rate z1, and bpool size x1. c1, y1 and z1 are each greater than their respective previous counterparts c0, y0, and z0. bpool size x1 is smaller than previous counterpart x0. The number of active requests grew. Traffic in and out both grew. Database cache size shrank. i bet the PLE model would predict some continued acceleration - expecting still more concurrent active requests in the near future, higher in and out rates and an even smaller database cache size due to increased pressure from stolen memory. We’d only know for sure if Microsoft divulges more about the model for PLE. But I bet there’s something like that.

    Predictions can and often are wrong, though. At some point, that instance will hit a peak number of concurrent requests, as well as peak in rate, peak out rate, and minimum database cache size. And after the peak, activity will plateau or decline. But if SQL Server had predicted continued acceleration, the PLE will reflect the unreal prediction more than reality. The same thing can happen when a prediction of deceleration meets a surprise acceleration of activity.

    All right, last one. Thanks for sticking around :-). A bug could make PLE simply unexplainable based on bytes/s traffic for database data files. In the following link I explore a bug present in early SQL Server 2016 (since fixed) on systems with multiple (v)NUMA nodes. A significant portion of SQLOS memory could be double-counted and double-accounted. When this bug flexed, the same portion of memory would appear as database cache for node A and as stolen memory for node B. That misrepresentation caused all kinds of difficulty trying to understand activity and behaviour in the instance. And caused PLE to be a bit farther from representative of activity than normal.

    This contemplation of caveats to correlation between PLE and bytes/s traffic to database files is certainly not exhaustive, even if the dear reader has long been exhausted πŸ™‚ But I want to make sure people really think about what’s being measured and represented in various metrics that describe database resources and activity. I personally view PLE and other calculated metrics(such as cache hit ratios) as something to be explained, rather than something that does explain.

    I’d rather look at the factors that go into the PLE as a starting point. I almost always look at the SQLOS memory over time to understand an instances behaviour. A stacked graph of database cache, free, and stolen memory can be very helpful. The database page reads and writes per second, etc.

    Although the bpool broker clerk, like column store object pool broker clerk, has perfmon counters for evictions/s(pressure and periodic iirc) I’ve never seen those counters populated. Maybe they have very high observer overhead to track, in which case they were probably hidden behind a trace flag.

    Wow if those eviction rates were available we’d probably have lots of insight into PLE.

Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2