This is purely a question and not a proposal, as I know how Microsoft is so very thrilled every time I have a suggestion. Could the Parallel Data Warehouse concept work on a single NUMA system?
Technically all multi-socket systems today are NUMA because the processors now have integrated memory controllers. The path to memory is different between local access and memory attached to a
differentanother processor socket. But I am more interested in 8-socket, and potential 16+ socket systems. The single system definition here is that of a single operating system image. Some large systems allow partitioning to support multiple OS images to be running, but that configuration is not discussed here.
Both Windows Server and SQL Server are NUMA aware in that there are distinct memory nodes (that are closer to the local processor cores than remote processor cores). But there is no workable mechanism to fully benefit from knowledge of the underlying hardware organization. Say for example we have a large table resident in the buffer cache evenly distributed across the nodes. We would like for a connection to SQL Server issue a query to this table to be farmed out to one thread on each node with that thread accessing only rows stored locally.
But this is not what happens. More likely, starting from an empty buffer cache, a query accesses the table. The table is initially loaded into memory on the same node as the connection, then switching to other nodes? I have no idea what actually happens so it would be nice if someone would like to investigate this. Now another connection that happens to be assigned to a thread on the different node would have to cross nodes to remote memory for this same query? So the expectation is that there would be apparent inconsistent variations in query time depending on both the node distribution of the participating buffer cache pages and the node affinity of the execution thread.
In the PDW environment, there are multiple compute nodes, each a 2-socket system running its own Windows operating system, and (modified?) SQL Server instance\process with fact tables partitioned by node and dimension tables replicated to all nodes. A front-end sends a query to each node which processes data on its node. The purpose of this architecture is to allow scaling the computational resources (CPU, memory and IO) beyond the limits of a single system. But it also happens to achieve a degree of locality optimization.
The limitation of the PDW concept is that any single query is a more involved process. Furthermore, the query execution plan elements cannot cross nodes without a remote server mechanism. Even with the InfiniBand protocol, communications between nodes are expensive and long latency relative to the processor cycle time scale. This is why PDW is Parallel Data Warehouse and not Parallel Database.
So now what? My question is can we apply the PDW concept to a NUMA system, one big 8-socket NUMA system running a single Windows operating system. Instead of PDW with a SQL Server instance on each system (PDW compute node), there is now a SQL Server instance localized to each node of the NUMA system. Communications between processes on a system is lower latency than across systems over InfiniBand.
Can we go further than this? How about bringing the Parallel concept inside to a single process? Now each “node” is a collection of threads affinitized to a node on the NUMA system. Communication between threads is even faster than between processes. And now, potentially execution plan elements can cross “nodes”?
We might ask why bother. The nature of server system is that the 8-socket system is more than twice as expensive as four 2-socket system. In addition, the 8-socket system may have processors one generation behind the 2-socket system. The fact is that by the time we assemble a collection of systems or one big system, the full costs of hardware are irrelevant to a really important mission. The real question is which meets the objectives better.
Below is a completely separate discussion but is on the topic of NUMA systems.
A Better NUMA System Memory Strategy?
As far as I know, Windows or SQL OS creates 2 memory nodes from processor node 0, then 1 additional memory node on each of the other nodes. One of the memory nodes on node 0 is for generally allocations. All other memory nodes, including the second one on node 0, are for NUMA aware allocations? The immediate problem is that the memory nodes are asymmetric in size.
There is one possibility to consider for system not at maximum memory. This is to put maximum memory in node 0 (good luck figuring out which one this is) and the normal memory in the other nodes. But of course, the nature of component pricing today is that is makes more sense to fill the DIMM sockets with the second largest capacity memory module, today 16GB. Today, Crucial shows 16GB ECC DIMMs at $200 each and 32GB DIMMs at $800 each. By sometime in the next year or so, expect the 32GB DIMM to trends towards $400 and a new 64GB DIMM to introduce at several thousand dollars initially.
HP-UX on the other hand has a very clever memory strategy, which is to configure both interleaved and node memory, or LORA in their terminology. Say for example that there are 8 nodes and 1TB memory total, 128GB per node. Interleaved memory totaling 128GB is allocated from all nodes at 16GB per node. The remaining 114GB on each node are formed into node local memory. The addressing scheme might be something like this. For interleaved portion in the first 128GB, the memory is interleaved in 256 bytes stripes across the nodes. The first 256 bytes is from node 0, the next 256 bytes is from node 1, and so on before wrapping back to node 0. Above this are separate 114GB contiguous chunks on each node.
This arrangement recognizes the fact that not all consumers want local node memory. Now the NUMA local memory nodes are uniform in size. I happen to think this is an excellent idea. I do not know if there is an institutional bias at MS against ideas not invented in happy valley.
Edit based on Lonny Niederstadt's comment
on further thought, I think it might be better if the OS did not interleave a single memory page (4KB on X86) across NUMA nodes, perhaps even for the 2MB large page mode, but not sure on the 1GB page.
The HP-UX scheme allows the sys admin to adjust the interleaved / LORA ratio, with 1/8:7/8 being recommended. This would allow NUMA to be turned off.
The SQL OS could also override the Windows scheme, simply by choosing how it allocates pages.
For NUMA controls, the default should be off, so people do not encounter anomalies unless they make the decision to go for it.
I am thinking on a transaction server, the scheme should be to prefer allocating buffer cache pages from the local node memory (of the thread making the call). The app server would connect over different port values. SQL Server would have affinity set by port. So each group would be likely to have their data in the local node.
For DW, the strategy should be to round-robin extents (or large pages?) by NUMA node. Even better would be if the clustered index could be on a hash key. Then two tables on the hash key, ORDERS and LINEITEM for example, would have the same node as preferred buffer location.
Taking a step further, if storage were also distribute by node, and the FG has on or more file on each node, then the preferred file location matches. And the index rebuild would reset to match this scheme.
There should be some parallelism options beyond MAXDOP. One would be Local node. So parallel execution should use local node cores up to the socket limit. Another would be Global, so allocate 1 thread per node, then round robin. An independent option would be Hyper-Thread or HT, in which both or all logical processors of 1 core would be allocated before moving to the next core. Default of course would be off, but this could be a powerful tool for investigating HT issues.
Perhaps to local node parallelism option should be default for the OLTP scheme, and the Global be default for the DW scheme
I am thinking something needs to be done about processor affinity bit mask. There should be 3 values, which could be done with 2 sets of bitmasks. One is preferred processor affinity. The other is secondary affinity, and finally excluded.
This might help manage multi-instance, each instance could get separate preferred, but the secondary would allow activity on other nodes as well.
Since you mention CPU per node, I will plug my ExecStats tool,
which is PerfMon mode display individual CPU, annotating the node boundaries. Note the 3rd party tools vendors - learn what a real server is, not the developer's desktop!
The high CPU on node zero might be related to this
FIX: CPU spike when there is no load on a server after you install SQL Server 2012 on the server, sp1 cu3.
Now having thought about it some, I think this should be a request,
we could call it SQL Server Data Center Edition