I was inspired by Aaron's post on a similar topic a while back to dig deeper into how I could automagically adjust the memory allocation for multiple instances of SQL Server in a cluster-failover scenario. So, suppose you have:
- A two-node cluster
- Four SQL Server instances, two intended to run on each physical server (call it "deluxe active-active")
- An environment where it's feasible to run all four instances on one node, in case of an emergency or in a maintenance window
- An expectation that some performance degredation is acceptable in the emergency scenario, and you want to take full advantage of your hardware investment when both nodes are up and running
- Stuff running that is tolerant of reconfiguration of max memory in SQL Server "on the fly"
A typical deployment would have each SQL Server instance's memory limited to about 1/4 the usable RAM on one node (that is, 1/4 the RAM remaining after the overhead required for OS and processes outside what is covered by SQL Server's max memory configuration) so that in a failover, their memory footprint, collectively, would not overwhelm one physical server. This is certainly safe, but you also have to buy a whole lot of extra memory that will sit idle practically all the time.
Instead, I wondered if I could set each instance to use half the usable RAM on each physical host, but in the event of a failover, automatically adjust the max memory setting on each instance to an approriate value for them to share a node and still be "happy." This would be a bit like a new person bellying up to a crowded bar: some of the other customers have to "move over" - have their RAM allocation reduced - to make space.
In pseudo-code, I want to:
-
Detect that a failover has occurred
-
Check the max memory setting on each instance on the cluster against a reasonable value for where they are running
-
If needed, change the max memory setting on each instance, to prevent the total memory allocated from exceeding the usable memory on each host
I'd been mulling this over during the past few months as we design and test a SQL Server consolidation / data center move at work. I finally settled on a design like this:
-
Create a PowerShell script that can check and set the max memory value for all instances on a cluster
-
Create a SQL Server Agent job to call the PowerShell script on Agent startup, running under a proxy account that has access to the other instances
Disclaimer I don't know that this is a great idea, I only have it in test at this point. Your mileage may vary. Never download and use scripts from the internet. If you experience dizziness, nausea or you turn purple or your hair falls out while using this script, see your doctor right away. Some people taking this script have reported dry mouth, memory loss or profuse bleeding from the nose. All bananas shall be placed on top of refrigerators for ripening.
So, the PowerShell script starts with a function to go get the memory configuration and host name for a running SQL Instance:
function Get-SQLInstanceConfig( [string]$SQLInstance, [ref]$hostName, [ref]$maxServerMemory ) {
# Function to establish a connection to a clustered SQL Server instance,
# read max server memory configuration value and current physical host name
$con = New-Object System.Data.SqlClient.SqlConnection( `
"Data Source=$SQLInstance;Initial Catalog=master;Integrated Security=SSPI;")
$q = "SELECT ( SELECT serverproperty('ComputerNamePhysicalNetbios') ), " `
+ "value_in_use FROM sys.configurations " `
+ "WHERE Name = 'max server memory (MB)';"
$cmd = New-Object System.Data.SqlClient.SqlCommand( $q, $con )
$con.open()
$reader = $cmd.ExecuteReader()
if( $reader.read() ) {
$hostName.Value = $reader.GetValue(0)
$maxServerMemory.Value = $reader.GetValue(1)
}
$reader.close()
$con.close()
}
Next, we need a function that can change the memory setting for an instance if we find that it's incorrect:
function Set-SQLInstanceMemory( [string]$SQLInstanceName, [int]$maxMemSetting ) {
# Function to set max server memory on a given SQL instance
write-host " Reconfiguring" $SQLInstanceName to $maxMemSetting
$con = New-Object System.Data.SqlClient.SqlConnection( `
"Data Source=$SQLInstanceName;Initial Catalog=master;Integrated Security=SSPI;")
$q = "EXEC sys.sp_configure N'show advanced options', N'1'; " `
+ "RECONFIGURE WITH OVERRIDE; " `
+ "EXEC sys.sp_configure N'max server memory (MB)', N'" + $maxMemSetting + "'; " `
+ "RECONFIGURE WITH OVERRIDE; " `
+ "EXEC sys.sp_configure N'show advanced options', N'0'; " `
+ "RECONFIGURE WITH OVERRIDE; "
$cmd = New-Object System.Data.SqlClient.SqlCommand( $q, $con )
$con.open()
$cmd.executeNonQuery()
$con.close()
}
Having established those two functions, it's then a matter of "walking" all the SQL instances to check their currently running values, then correcting any that are inappropriate for the location of instances on cluster node one or two. I need a couple of containers to store the results and make decisions about what to change:
# List the instances on the cluster:
$SQLInstances = @( "SQL01", "SQL02", "SQL03", "SQL04" )
# Make a hashtable to sort out which SQL instance is running on which host,
# with how much memory. One table entry for each host, with each entry containing
# an empty hash table we will load with SQL instance configuration info later:
$hostsTable = @{ "NODE1HOST" = @{}; "NODE2HOST" = @{} }
# Allowed memory for SQL Server = total server memory (e.g. 16 GB), less 3 GB to leave for OS
# and SQL tasks outside the SQL configured boundary:
$aggregateMemory = ( 16 - 3 ) * 1024
Working from those variables, we can then use two loops, one to collect the current state from each instance, and a second to validate/correct the values based on what instance is running on what node:
# Get the configs for each SQL Server instance
foreach( $SQLInstance in $SQLInstances ) {
$hostName = $null
$maxServerMemory = $null
Get-SQLInstanceConfig $SQLInstance ([REF]$hostName) ([REF]$maxServerMemory)
# If we really can't see one of the instances, it's best to bail at this point,
# rather than reconfiguring memory settings without complete information
if( $hostName -eq $null ) {
throw ( "Could not connect to one of the SQL instances on the cluster." )
}
# Put the current SQLInstance and its max memory value into the right slot
# in the hosts table, to classify the instance by host
$hostsTable[$hostName].Add( $SQLInstance, $maxServerMemory )
}
# For each physical cluster node, calculate a reasonable memory limit
# per SQL instance, then verify or correct the max memory value for
# each SQL Server instance running on that host
foreach( $hostEntry in $hostsTable.getEnumerator() ) {
write-host
write-host $hostEntry.Name
write-host " Num SQL instances on this host:" $hostEntry.Value.count
if( $hostEntry.Value.count -gt 1 ) {
[int]$maxMemSetting = $aggregateMemory / $hostEntry.Value.count
} else {
[int]$maxMemSetting = $aggregateMemory / 2
}
write-host " Max memory allowed per SQL instance:" $maxMemSetting
write-host
foreach( $SQLInstanceEntry in $hostEntry.Value.getEnumerator() ) {
write-host " " $SQLInstanceEntry.Name is set to $SQLInstanceEntry.Value
if ( $SQLInstanceEntry.Value -ne $maxMemSetting ) {
Set-SQLInstanceMemory $SQLInstanceEntry.Name $maxMemSetting
}
}
}
In implementing this on my test cluster, I went to each SQL Instance, and added a proxy account and then a SQL agent job to call this PowerShell script. The script had to be copied into an identical location on the physical disks in each cluster node, like the SQL Server binaries, so that it could be run by any SQL Agent, regardless of which node the agent is running on.
I then set the schedule for the jobs to execute once on SQL Agent start up.
Lastly, I reasoned that at least one common scenario could make this process fail (among other things): when a four-instance/two node cluster has a node "croak," the other node will attempt to bring two instances online as quickly as possible. If one failed SQL instance comes up before the other one, and this script runs immediately, the script will fail to connect to the second failed SQL instance, and if allowed to complete would set incorrect values. When the second node came up, then it would call the script again and correct the problem, but you've then reconfigured all the instances twice. It's a race condition, but in order to try to get this right most of the time, I put a "wait" step in each of these agent jobs, so that the agent comes up, the job fires, then there's a delay to give the other instances time to come online before the script runs.
As extra insurance, I also trap for a failed connection in the script and have it terminate. That way, if the agent runs it and one of the instances is still down, it should have no effect. When the down instance comes online, then it'll execute the script and succeed.
So far in testing, this seems to work OK. The main concern I have, with what I know about our workload and applications, isn't that we'll have a problem running on one node in an emergency, it's that there could be some side effect from changing max server memory, on the fly, while the instances are running.
I'd welcome comments or suggestions, if you've tried this or perhaps see a hole in my logic.