THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

SSD and Internal SQL Server Tuning Strategies

Microsoft does not like to make meaningful tuning parameters in SQL Server. In part, I understand this because‌ in the past I have seen really bad settings being used in other products that do expose more tuning parameters based on seriously mistaken understanding of the inner workings of the database engine. My own investigation have also shown that static settings do not have adequately broad general validity, hence more complex dynamic strategies are necessary to be effective.

With SSD at sufficient level of maturity and cost viability, it is possible to build immensely powerful storage within an acceptable price for high-value database systems. (Whether storage system vendors will allow this is an entirely different matter) My preference is to make judicious use of the characteristics of NAND, specifically the lower cost of MLC versus SLC, and accepting the lower write endurance of MLC over SLC. (SLC NAND also has much better write performance over MLC) In certain situations, the database will contain data for an extended period of time, which is another way of saying the write load relative to capacity is low, hence MLC write endurance is perfectly adequate.

What could potentially exceed MLC write endurance is tempdb activity. Internally, SQL Server employs rules on how much memory to allocate for hash and sort operations prior to spilling to tempdb. In SQL Server 2008 or so, the estimated execution plan would show IO cost beyond a certain point determined by system memory and degree of parallelism following a formula based on excess size. However, I did not ascertain whether the actual execution followed the spill rules evident in the estimate plan.

For a database on an MLC SSD storage, the preferred tuning strategy might then be to favor discarding cache, prioritizing memory for temporary data, to avoid a write that will just be discarded afterwards. Reloading permanent data for subsequent queries is fast for both random and sequential IO. Of course, this strategy is specific to SSD MLC storage. Expensive SLC can support either strategy. Some other future nonvolatile storage might not have restrictive endurance limits. On hard disk storage, the strategy should be to discard pages taken from a contiguous sequence, i.e., a scan. Priority on retention being given to index upper levels, and randomly loaded pages.

None of these should be normal exposed tuning parameters. However, the SQL Server engine should understand the nature of different types of storage media. It may be possible via some future mechanism for the OS to know whether storage media is HDD or SSD. But it may not know if it is MLC or SLC, so we might have file location directives on whether HDD or MLC performance strategies should be employed.

There is one other option. Samsung described a trick for MLC, allowing certain internal operations to using a portion as SLC! Recall that the NAND storage cells are just that, suitable for either SLC or MLC. Something in the logic portion of the NAND chip must understand SLC or MLC voltage levels. Apparently there is little overhead to add SLC logic to a MLC chip? This could be incredibly useful if it were also exposed to the OS? Perhaps we could create two partition on an MLC/SLC capable SSD, one partition as MLC and the second as SLC? When the device is near the MLC write endurance limits, the entire device can then be set to SLC only mode?

Published Monday, September 09, 2013 11:42 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Theron Knapp said:

By entire device, do you mean the proposed SLC partition?  Wouldn't the SLC mode reduce the storage by half and require the MLC data to be read into cache first?

September 9, 2013 11:25 PM
 

jchang said:

Why are you so locked-in on capacity?

Do you not know from HDD days that by the time you get performance correct, capacity is a non-issue.

The storage vendor idiots talk capacity, and that is why if you let a storage vendor do your configuration, you end up with really crappy performance, especially after spending $1M++

Consider two options: 1) buy MLC SSD at $3K/TB or 2) SLC at $6K/TB. Suppose your workload is such that the MLC write endurance would put the storage system at risk after 1 year of operation.

Do you want to change out the storage after 1 yr with MLC, or do you opt for the 2X cost (or more) of SLC?

Now consider that the majority of writes come from tempdb. With the split MLC/SLC option, we can operate as MLC for low write data, and SLC for high write SLC.

lets suppose the storage system consists of 64 SSDs (plus additional hot-spares) each rated at 256GB in MLC, 128GB SLC or some mixed combination

such that 1*(MLC) + 2*(SLC) = 256GB

Suppose we elect for 192GB of each device to be MLC leaving 64GB remaining MLC capacity to be configured as 32GB SLC. (The device itself may elect some SLC area for internal operation?)

There are 8 RAID Groups in 7+1 RAID5, the MLC capacity is 8x7x192 = 10.7TB

and the SLC capacity is 8x7x32 = 1.79TB. (1TB = 1000GB for simplicity)

So why is there a matter of first reading MLC data to cache when we have not put any data on the device yet?

Note that I explicitly do not want the alternative configuration nitwits feel compelled to suggest: that is 48 x 256GB MLC SSD and 16 x 128GB SLC SSD. This is a database! When we access data, we want to hit all 64 SSDs simultaneously, when we access tempdb, we want all 64 SSDs to be contributing. At the end of the day, we might find that 75% of IO went to data files, and 25% to temp, but at any given point in time, from a single DW query, it is either all data or all temp.

September 10, 2013 10:54 AM
 

jchang said:

My apologies for being snappy. Every time I hear capacity, I get chills because on so very many occasions the storage vendor told my client to buy on capacity, and performance was just a disaster that I automatically recoil on hearing capacity.

Oh, never mind. I see you are talking about the very end.

My bad again.

After years of use, it is possible the MLC SSD is near end-of-life in terms of write endurance. For us DW types, being smart, we could use these EOL MLC SSDs for as archival storage. So we will wipe it clean, then do one last write for the archive data.

The other possibility that I mentioned in the post is to wipe it clean, convert it to SLC mode - at half capacity, then continue to use it. So there are two options depending on what you need. Note, just because the NAND cannot provide quality resolution of 4 voltage levels for MLC, it can still resolve 2 voltage levels for SLC operation.

There is no talk of an online op. Backup the DB or file, filegroup, restore to new device, then reuse the EOL MLC SSD.

ps- ordinary MLC, say for example a 128GB SSD rated for 75TB write endurance, ie, about 600 cycles. You could use the tool provided by the SSD vendors to read the wear info. (Oh yeah, we need server/RAID controller vendors to enable us to read SSD info through the RAID controller). Do you replace the SSD at 75% or 90% or 100% of rated life? I don't know. It might be if there were some tool that could check the actual voltage/charge accuracy of the NAND cells to assess replacement recommendation?

September 10, 2013 3:00 PM
 

Theron Knapp said:

Thanks for the clarification.  Yes, I only misunderstood the end and was not trying to counter the benefit of performance over capacity.  :-)

September 10, 2013 7:47 PM
 

Manson said:

Great idea on read/write, IO tier/partition on storage

September 10, 2013 9:37 PM
 

byron f said:

very interesting post.  i particularly like the idea (in your second comment) of sending MLC to the data warehouse for one last wipe/write cycle...very clever.

do you have a link/citation/more information for this:

"Samsung described a trick for MLC, allowing certain internal operations to using a portion as SLC! Recall that the NAND storage cells are just that, suitable for either SLC or MLC. Something in the logic portion of the NAND chip must understand SLC or MLC voltage levels. Apparently there is little overhead to add SLC logic to a MLC chip?"

September 11, 2013 5:51 AM
 

jchang said:

Start here, Anandtech

New Elements to Samsung SSDs: The MEX Controller, Turbo Write and NVMe

http://www.anandtech.com/show/7152/new-elements-to-samsung-ssds-the-mex-controller-turbo-write-and-nvme

Anandtech has several more Samsung tech articles too

September 11, 2013 9:57 AM
 

adopilot said:

Does that mean that We shoud wait more time before start building our infrastructure on SSDs for small offices (datacenters).  

By the way, Great tex and explanation

September 13, 2013 11:06 AM
 

jchang said:

adopilot: Let me expand the general concept of your question to both software and hardware, because I do not want people to expand what I say.

On software - OS, I was never a fan of wait for sp1. Sure there might issues in brand new major releases, but software is very complex and operator error is a bigger source of problems. So my view is the IT pro needs to start evaluating at RTM, but its ok if deployment is around sp1. I raise this because people slant toward not bothering to learn until after sp1.

On SQL Server specifically, the new features in each major release are hugely valuable, more so on the DW/BI side. So my view is jump on SQL Server RTM or even CTP for DW. Of course MS is saying DW is now a critical enterprise resource, which is taken to mean that enterprises should apply stupid "high availability + reliability" principles - typically this means putting the DW on the big SAN controlled not by the DBA. Personnaly I evaluated such principles to be high CYA rather than HA.

But it is hard to argue with MS or corporate IT, so I say get a private DW with direct-attach storage and starting learning right away.

Now to hardware. I think there is something to waiting a few months after brand new hardware comes out. Occasionally there are PITA firmware issues. I am more familiar with large scale. I fight with corporate IT all the time. There gut principle from having been burn so very many times and more so, is standardization. Pick a mature product, test it, and stick with it for several years.

My preference is rolling acquisitions. If product life-cycle 4 years.

Pick a standard each year, replacing 25% of old equipment every year.

This way you don't miss out on new technology by more than 1 year, yet their is still some standardization.

One other thing: in the old days, people talked about investing in computer/storage hardware. Back then HW was so expensive it had to amortized over 10 years. You were not investing in HW, you were depreciating it! Technically, the value of hardware depreciates 40% per year per Moore's law, in that you can get new HW 40% at approx. the same cost, more or less.

So I treat HW as consumable/disposable. Use it, then lose it. HW is a great consumable - like food and fuel, but a crappy investment.

I think you should go ahead evaluate how SSD should be used. Don't be afraid to experiment, don't be afraid to fail - learning is more valuable - assuming you try to keep the scale of failures small.

Hope this helps.

September 13, 2013 3:37 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement