THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

  • Performance impact: diminishing marginal return on the degree of parallelism

    In commenting on my previous post, Greg Linwood and GrumpyOldDBA raised questions about various implications of parallelism. In this post, I’ll look at the impact of different degrees of parallelism on the query performance. I’ll limit my examination on the same query that uses a Cartesian product temp table, as discussed in the previous two posts ([1] and [2]). The details of the test setup can be found in [1]. The SQL Server instance is SQL Server 2008 R2 (10.50.1600) Enterprise X64 Edition. I’ll look at the performance impact of concurrent executions in a future post.

    The law of diminishing marginal return dictates that as you increase the number of processors in a parallel plan, sooner or later you’ll reach a point after which you’ll see a decrease in the performance gain from adding another processor to the parallel plan. Note that it is the decrease in the performance gain, not the decrease in the performance itself.

    The question is, for a specific query, does SQL Server know where that inflection point is and therefore make the right decision in picking the optimal degree of parallelism? SQL Server Books Online thinks so. It says:

    “When SQL Server runs on a computer with more than one microprocessor or CPU, it detects the best degree of parallelism, that is, the number of processors employed to run a single statement, for each parallel plan execution.”

    But how does that work out in practice? This is a valid question to ask because determining the optimal degree of parallelism is an incredibly difficult problem to solve. We obviously can’t answer this question in generality. Nor should we even attempt.

    So let’s take a look at the following query (reproduced from the previous post, and see that post for more details):

    select id_server_config,
              id_thread_name,
              id_tran_type,
              id_tran_status,
              id_users,
              id_cpus,
              COUNT(*) as cnt
      from tran_stats t1 join #tmp t2 
           on t1.server_config = t2.server_config and
              t1.thread_name = t2.thread_name and
              t1.tran_type = t2.tran_type and
              t1.tran_status = t2.tran_status and
              t1.users = t2.users and
              t1.cpus = t2.cpus 
    group by id_server_config, 
             id_thread_name,
             id_tran_type,
             id_tran_status,
             id_users,
             id_cpus

    Let’s run this query on a server with 80 processors with the max degree of parallelism (maxdop) set to different values, and see it performs. (Note that the tran_stats table has 200,000,000 rows, and the query runs with all the data already cached in memory). The following chart summarizes the result:

    degrees_of_parallelism_DL580_80cpus

    A couple of things stand out on this chart.

    First of all, the diminishing marginal return on parallelism is no mistake. After maxdop = 8, the additional performance gain by increasing degrees of parallelism is rather small. And after maxdop = 24, the gain is virtually non-existent. Arguably, the optimal degree of parallelism for this query on this server is somewhere between 8 and 24.

    The second salient point is that when maxdop is set to 0, SQL Server sets the actual degree of parallelism to 64 when running the query. As per Books Online, 64 is the maximum value SQL Server would set maxdop to.

    Why is setting maxdop to 64 not optimal for this query? Let’s look at the following chart that shows the % total CPU usage on the server (with 80 processors) at various maxdop settings:

    CPU_usage_Degrees_of_parallelism

    For this query, as we increase the maxdop setting, the total CPU usage steadily goes up. Even though after maxdop = 24, we practically gains no performance, the total CPU usage still goes up. We just end up wasting a tremendous amount of CPU cycles for nothing.

    Obviously we need to gain more experience to be sure, but it doesn’t look like a good idea to leave maxdop at 0 on a box with a lot of processors.

  • Performance impact: parallelism comes to the rescue

    For the specific scenario discussed in my previous post,  we observed that using an intermediate Cartesian product was about three times faster than doing joins with multiple lookup tables. To keep the discussions manageable, we limited the max degree of parallelism to 1. In other words, we did not allow query parallelism. For any query that touches a large amount of data, that was silly, especially when it’s on a server with 80 logical processors.

    To be more realistic, I then looked at the following four scenarios:

    1. No parallelism, using the Cartesian product materialized into a temp table
    2. No parallelism, using joins with the six lookup tables
    3. Maxdop set to 20, using the Cartesian product materialized into a temp table
    4. Maxdop set to 20, using joins with the six lookup tables

    The following chart summarizes the test results:

    cartesian_parallelism

    It’s clear that the queries can benefit greatly from parallelism. In fact, in the two cases where a Cartesian product was used, setting maxdop to 20 sped the query up ~13 times. The query duration dropped from 248 seconds to 19 seconds!

    It’s intuitive that enabling parallelism would give us a boost.

    It’s also interesting to note that the query with multi-table joins did not receive the same kind of performance boost from parallelism as did the query with the single table join using the Cartesian product temp table. The query duration dropped from 740 seconds to 80 seconds when parallelism was enabled on the query joining directly with the six lookup tables, about ~9 times speed up (compared with ~13 times speed up in case of the Cartesian product).

    Even when maxdop was set to 20, going from multi-table joins to a single table join using the Cartesian product temp table still gave us a 4 times performance boost. The duration went from 80 seconds to 19 seconds.

    At least for this specific scenario, parallelism plus the Cartesian product temp table gave us the best performance. In the next post, I’ll look at the impact of parallelism more closely, in particular, how different degrees of parallelism affect performance.

  • Performance impact: Cartesian product comes to the rescue

    Although Cartesian product as a concept is essential to the relational database theory, it is often a dirty phrase – something that is associated with bad performance and therefore should be avoided. But there are cases where a Cartesian product is highly desirable. This post shows you an example query—from the real world although presented here with artificial data—that can be significantly optimized with a manually created Cartesian product.

    Here is the query in question:

    select id_server_config,
          
    id_thread_name,
          
    id_tran_type,
          
    id_tran_status,
          
    id_users,
          
    id_cpus,
          
    COUNT(*) as cnt
    from tran_stats t1
        
    join Dim_server_config t2 on t1.server_config = t2.server_config
        
    join Dim_thread_name t3 on t1.thread_name = t3.thread_name
        
    join Dim_tran_type t4 on t1.tran_type = t4.tran_type
        
    join Dim_tran_status t5 on t1.tran_status = t5.tran_status
        
    join Dim_users t6 on t1.users = t6.users
        
    join Dim_cpus t7 on t1.cpus = t7.cpus
    group by id_server_config,
            
    id_thread_name,
            
    id_tran_type,
            
    id_tran_status,
            
    id_users,
            
    id_cpus

    This query joins a large table called tran_stats with six small lookup tables (or dimension tables). The dimension tables have dimension id’s and their corresponding values. The large table—which stores transaction statistics—is sort of a fact table but include the actual dimension values instead of dimension id’s. There is no index on any of these six dimension columns in the large tran_stats table. The following is a summary of the cardinalities of these tables:

    Table Name

    Row Count

    Comment

    tran_stats

    200,000,000

    The table is not very wide, and the total size of the table is about 36GB

    Dim_server_config

    9

    Corresponding to the 9 distinctive values in the tran_stats.server_config column

    Dim_thread_name

    100

    Corresponding to the 100 distinctive values in the tran_stats.thread_name column

    Dim_tran_type

    3

    Corresponding to the 3 distinctive values in the tran_stats.tran_type column

    Dim_tran_status

    3

    Corresponding to the 3 distinctive values in the tran_stats.tran_status column

    Dim_users

    5

    Corresponding to the 5 distinctive values in the tran_stats.users column

    Dim_cpus

    8

    Corresponding to the 3 distinctive values in the tran_stats.cpus column

    In this post, I’ll remove query parallelism from consideration. I’ll cover how parallelism contributes to the discussion in a follow-up post. So for now let’s assume that max degree of parallelism has been set to 1.

    When I run the above query on my test server, it comes back in about 740 seconds (or just over 12 minutes) with all the data already cached. The query plan is a series of hash matches shaped like a straight line going from the lower right corner to the upper left corner:

    hash_match_cascades

    Question: Can we do better than this?

    Answer: Yes! An approach is to create and materialize the Cartesian product of all the dimension first, and then join the large table with this Cartesian product.

    On the test server, the following Cartesian product query takes about 1 second and populates #tmp with 324,000 rows (that is 9*100*3*3*5*8).

    select * into #tmp
     
    from Dim_server_config,
          
    Dim_thread_name,
          
    Dim_tran_type,
          
    Dim_tran_status,
          
    Dim_users,
          
    Dim_cpus

    Then, the following join takes about 248 seconds to complete.

    select id_server_config,
              id_thread_name,
              id_tran_type,
              id_tran_status,
              id_users,
              id_cpus,
              COUNT(*) as cnt
      from tran_stats t1 join #tmp t2
            
    on t1.server_config = t2.server_config and
               
    t1.thread_name = t2.thread_name and
               
    t1.tran_type = t2.tran_type and
               
    t1.tran_status = t2.tran_status and
               
    t1.users = t2.users and
               
    t1.cpus = t2.cpus
    group by id_server_config,
             
    id_thread_name,
             
    id_tran_type,
             
    id_tran_status,
             
    id_users,
             
    id_cpus

    The query plan is a single hash match:

    hash_match_single

    To help drive home the performance difference between these two queries, let me re-present their elapsed times in the following bar chart:

    hash_match_multi_vs_single

    In this particular scenario, using a materialized Cartesian product results in a query that is three times faster than the query that directly joins with each of the lookup tables.

    I have uploaded the DDL scripts and the script for loading the data into the large table and into each of the dimension tables. You can download and run the attached scripts and check out the performance difference for yourself.

  • Performance impact: synchronous audit can trash your database performance – III

    This is the third installment of my posts on the performance impact of enabling SQL Server 2008 R2 Audit. My previous two posts (part I and part II) show that whether it’s on a two-processor machine or an eight-processor machine, enabling an audit with the queue_delay property set to 0 can destroy the database performance. In particular, those two posts report the test results in the worst case scenario when the system was already under severe processor stress.

    In this post, I highlight the results of running the tests when the system was under moderate load on the same eight-processor machine. The workload was the same modified version of the TPC-C benchmark with the two read-only transactions: OrderStatus and StockLevel, and with the entire database cached in memory. Note that in the test scenarios of the two previous posts, the simulated users did not wait before submitting next transaction, and that forced the system under test into a severe processor bottleneck. To achieve a moderate load level, I used the same test database and simulated 100 users, but with each of the users waiting for 10ms after the completion of the current transaction and before submitting the next transaction.

    This 10ms wait between two consecutive transactions resulted in the following performance behavior on the server:

    • The total CPU usage fluctuated between 2% and 50% with an average of roughly 20%.
    • The transaction throughput in terms of the batch requests per second was steady at about 3200.
    • There was no processor queue, no queue on the processors, and no signal wait on the schedulers.

    And when I pushed the system harder either with more simulated users or with a reduced wait time between consecutive transactions, I obtained higher transaction throughput until it saturated at about 14,000 transactions per second. So the load level on the system was very moderate with a lot of capacity to spare.

    Again the same three test cases were examined:

    1.       Without any audit setup,

    2.       With an audit tracking all the SELECT statements on the three tables. The queue delay was set to 1000ms (the default value),

    3.       With an audit tracking all the SELECT statements on the three tables. The queue delay was set to 0

    The following table summarizes the test results for each of the three test cases:

    Test Case

    Throughput (transactions/second)

    Total CPU usage

    1

    3200

    2% ~ 50% (avg =~ 20%)

    2

    3200

    2% ~ 50% (avg =~ 20%)

    3

    410

    ~3%

     

    Bad news is that we still observed severe performance impact when we enabled the audit with the queue_delay property set to 0. The throughput dropped from 3200 transactions per second to 410 transactions per second.

    However, good news is that when the queue_delay property was left at the default value of 1000ms, we did not see any visible performance degradation at all when we enabled the audit. And since we do expect the default setting to be used in most, if not all, the real world environments, this observation is most relevant and reassuring. In fact, this should be the key takeaway.

  • Performance impact: synchronous audit can trash your database performance – II

    In my previous post on this subject, I showed the test results from a two-processor machine. Naturally, you may wonder: do we expect to see the same dramatic throughput drop when we enable a synchronous audit (i.e. with the queue_delay property set to 0) on a more powerful machine?

    The answer is yes.

    The following table shows the test results for the same three test cases except that (1) the test database was on an eight-processor machine, and (2) the workload was generated with 50 concurrent users.

    Test Case

    Throughput (transactions/second)

    Total CPU usage

    1

    14,500

    ~98%

    2

    12,300

    ~98%

    3

    500

    ~4%


    To refresh your memory, the three test cases are:

    1.       Without any audit setup,

    2.       With an audit tracking all the SELECT statements on the three tables. The queue delay was set to 1000ms (the default value),

    3.       With an audit tracking all the SELECT statements on the three tables. The queue delay was set to 0

    The audit was on the SELECT statements applied to the three tables accessed by the test workload. With the queue_delay property set to 0 in the audit object, we observed a dramatic impact on the transaction throughput when we enabled the audit, a whopping 96% drop from 14,500 transactions/second to 500 transactions/second. With the two-processor machine, the throughput drop was about 95% from 3500 transactions/second to 160 transactions/second.

    We also observed a much less decrease in the throughput when the audit was configured to use a 1000ms queue delay. With the two-processor machine, the performance drop was about ~11%. With the eight-processor machine, the decrease was about 15% (from 14,500 transactions per second to 12,300 transactions per second).

    So the behavior in terms of the performance impact was similar whether it's on a two-processor machine or on an eight-processor machine.

    I don’t know any practical requirement that may demand a synchronous audit. But since the feature is there in SQL Server and it is highly unlikely that it’s there by accident, I suspect such a requirement probably exists. I’m curious about it.

    In both this post and the previous post, we are exploring the worst case scenario with the system already under heavy processor stress. In a future post, we’ll look at the performance impact of enabling the same audit setups when the processors are not the bottleneck. This would be a more realistic scenario.

  • Multiple independent results from a single SELECT statement

    You cannot output multiple independent results from a single SELECT statement. But sometimes I wish that could be done. An extremely simple case is when you need to build the initial dimensions from a very large table.

    Typically, these dimension columns have very low selectivity and are often not indexed. When you want to get all the distinct values from each of these columns, you may want to do the following:

    SELECT DISTINCT c1 FROM VeryLargeTable; 

    SELECT DISTINCT c2 FROM VeryLargeTable; 

    SELECT DISTINCT c30 FROM VeryLargeTable; 

    The problem is that this would require scanning the table 30 times (assuming you have 30 such columns to query). And if each scan takes 30 minutes, these queries would 30*30 = 900 minutes or 15 hours to finish.

    Wouldn’t it be nice to be able to say something like (I’m making up the syntax) the following?

    (SELECT DISTINCT c1) ||

    (SELECT DISTINCT c2) ||

    (SELECT DISTINCT c30)

      FROM VeryLargeTable; 

    And have it produce 30 independent results, one for each column, with a single scan through the large table. This could be done in 30 minutes instead of 15 hours.

    Yes, there are tricks you can play with to reduce the amount of time in getting these 30 resultsets, but none of which is as elegant as producing the results with a single scan.

    By the way, I didn't just make up the above example. Scanning the very large table multiple times, each for a separate column, is how SQL Server Analysis Server does the initial dimension processing (although there are options to tweak it).

  • Performance impact: synchronous audit can trash your database performance

    The SQL Server Audit feature introduced in SQL Server2008 is useful in many scenarios. Although it still has much room for improvement, it has made SQL Server auditing easier to manage, and in particular, it has reduced performance impact as compared with a similar audit setup using SQL Trace.

    However, this does not mean that you should ignore performance considerations when configuring SQL Server Audit. In this post, I give you some empirical evidence to show that if you configure it incorrectly, you can still trash your database performance. In particular, I’ll focus on the impact of configuring the audit property called “queue delay”.

    When you configure an audit object using SSMS Object Explorer, you can set the property in the textbox labeled as “Queue delay (in milliseconds)”. If you use the CREATE SERVER AUDIT statement, the option in the WITH clause is called Queue_Delay. By default, it is set to 1000ms. Generally speaking, the default setting is good enough, and you should not change it. At least, I have not found a case where a different value offers visible advantage.

    But you can do serious damage if you set the queue delay property to 0, i.e. setting the audit to log data synchronously without any wait.

    To demonstrate the damage, I ran a series of tests on a small server with two processors running SQL Server 2008 R2. The workload was OLTP, primarily on three user tables: customer, orders, and stock. I’ll skip the details about the workload as they do not change the general message of this post. If you are interested, I can give you all the details over email.

    I tested three cases, all with 10 simulated users submitting transactions non-stop (i.e. without any wait between two consecutive transaction submissions for each user):

    1.       Without any audit setup,

    2.       With an audit tracking all the SELECT statements on the three tables. The queue delay was set to 1000ms (the default value),

    3.       With an audit tracking all the SELECT statements on the three tables. The queue delay was set to 0ms

    The following table summarizes the test results:

    Test Case

    Throughput (transactions/second)

    Total CPU usage

    1

    3500

    ~97%

    2

    3100

    ~98%

    3

    160

    ~8%

     

    First thing to note is that when the audit was configured and enabled with no queue delay at all, the negative performance impact was dramatic, causing the throughput to drop from ~3500 transactions/second (when no audit was enabled) to a mere ~160 transactions/second. In fact, when I first had the test going with no audit and the throughput was staying at 3500 transactions/second, and when I enabled the no-delay audit, I saw the throughput drop like a stone on the chart.

    When there was no audit, the system was completely CPU bound with the total CPU usage pegged at ~97% and a processor queue was observed. When the no-queue-delay audit was enabled, the waits were observed to be primarily on the wait type XE_SERVICES_RWLOCK, and the total CPU usage dropped to ~8%. The system spent most of its time waiting on some extended event locks.

    By the way, when I checked the exposed SQL Server performance counters under SQLServer:Wait Statistics, I saw all waits to be 0. In other words, there was no wait as far as the perfmon counters were concerned. Apparently, the XE waits were not reported under SQLServer:Wait Statistics in SQL Server 2008 R2.

    From the above result table, it’s clear that when an audit with the queue_delay set to 1000ms was enabled, the transaction throughput dropped from 3500 to 3100, about 11% decrease in performance. This can be considered significant. However, it’s worth noting that I was testing a rather extreme performance scenario in which the total CPU usage was already driven to almost 100% and there was a clear and sustained processor bottleneck.

    In real world scenarios, you probably should not turn on any audit to track busy activities when the system is already terribly bottlenecked on the processors. Good news is that even when the system is under extreme CPU load, enabling an audit (with the queue_delay set to 1000ms) that does not actually track a lot of data appears to have very little adverse performance impact. For instance, I re-configured the database audit specification to track all the DELETE statements (instead of SELECT) on the same three tables. Since the test workload did not perform any DELETE, the audit did not actually record any audit data. The throughout did seem to drop slightly to around 3450 transactions/second when this audit was enabled, a rather insignificant drop within the margin of observation error.

    I’ll follow up with the test results that show the performance impact when the system under test is not bottlenecked on the processors.

    I hope the message is loud and clear: do not configure an audit with the queue-delay property set to 0.

     

  • Creating features for 1% of the users

    While at PASS Summit 2011 in Seattle, I had a conversation with an attendee from one of the largest US companies. As expected and like any of the largest US companies, this company has multiple DBMS platforms.

    The conversation then turned to how SQL Server was being used, and the attendee told me that they target SQL Server primarily for their second tier applications and Oracle is the platform of choice for the more critical tier 1 applications. The key reason, according to this person, is that SQL Server does not support active-active clustering and therefore when a server fails, there can be significant down time to fail over all the connections, a situation not acceptable to some critical applications.

    Although it is arguable (1) how much a down time a SQL Server failover may incur, (2) to what extent that may not be acceptable, and (3) to what extent an active-active cluster may reduce the down time, the attendee was not the first one expressing this view (or bias if you will) and this did remind me of the comments that I constantly hear from the SQL Server community that Microsoft should not focus too much on creating features just for the largest enterprise customers that probably account for only 1% of its customer base (although revenue wise the percentage is probably much much higher).

    I have always thought the comments on creating SQL Server features for only 1% of the customers to be very misleading and not helpful.

    I don’t see Microsoft focusing too much on the largest enterprise customers, not to the extent that is detrimental to its smaller customers. In fact, I think Microsoft needs to focus more on the large enterprise customers in terms creating features they want. These features may not be needed by smaller customers today. But they lead the way, and the same customers very often will find their requirements change and start to use more higher-end enterprise features.

    More importantly, these features are a must to narrow the gap between SQL Server and more established DBMS platforms, neutralizing any argument against its adoption on critical applications in large enterprise data centers.

    Most importantly, these high-end and more challenging enterprise features (such as the aforementioned active-active clustering for a single instance) help make SQL Server a better product over all. After all, if it can meet the more stringent challenges of largest enterprises, it should have no problem serving easily as a platform for smaller customers.

    I hope that in not too distant future when this same attendee reviews his DBMS platform strategy, he’ll still use not-having-active-active-clustering as the key differentiating factor between a tier 1 DBMS platform and a tier 2 DBMS platform except that he can’t no longer classify SQL Server as a tier 2 DBMS platform.

  • Is the SELECT permission checked in an UPDATE statement?

    I was at PASS Summit 2011 in Seattle and had a chat with Jack Richins of Microsoft on the SQL Server audit feature after his presentation. A question came up with regards to whether the SELECT permission should be checked when an UPDATE statement is being executed, and whether the behavior (i.e. you may get UPDATE even if you only want to audit SELECT and the UPDATE statement does not have an explicit SELECT subquery) that I discussed in my previous blog post was a bug or expected.

    I don’t know the answer to the ‘should’ part of the question because I don’t know the specifications for this feature. But I became curious about how it actually behaves – is the SELECT permission checked in an UPDATE statement?

    Here are some tests to see how it actually behaves in SQL Server 2008.

    I granted login test_user permission to update table TestDB.dbo.TestTable, but did not grant the SELECT permission on the table.

    The test_user was able to execute:

    UPDATE TestTable

       SET j = 2;

     

    But the user failed on the following statement with error 229 (the SELECT permission was denied on the object):

    UPDATE TestTable

       SET j = 2

     WHERE i = 1;

     

    If SELECT was audited in the database audit specification, the audit log had a record for the second statement, but had no record for the first one (i.e. the one without the WHERE clause). No apparently no SELECT permission was checked if an UPDATE statement does not have a WHERE clause. Well, it’s not just the mere presence of the WHERE clause.

    The following UPDATEs would trigger a SELECT permission check and cause a record to be logged in the audit log:

    UPDATE TestTable

       SET j=2

     WHERE i=i;

     

    UPDATE TestTable

       SET j = 2

     WHERE exists (SELECT * FROM TestTable);

     

    The following UPDATE statements do not trigger any SELECT permission check on TestTable and won’t leave any audit trail if SELECT on TestTable is the only audited action type.

    UPDATE TestTable

       SET j=2

     WHERE 2=2;

     

    UPDATE TestTable

       SET j = 2

     WHERE exists (SELECT * FROM sysobjects);

     

    By the way, if we determine whether the SELECT permission is checked by the presence of an audit record for the SELECT audit action type, then the SELECT permission is always checked even for a sysadmin. The difference is that even if you explicitly deny the SELECT permission to a sysadmin, the check will always succeed, but an audit trail is left behind in the audit log.

  • Fully transparent table partitioning for OLTP

    No, this is not something announced at PASS Summit 2011. But it’s something that has been on my wish list for a long time.

    Table partitioning is a great feature, and overwhelmingly, table partitioning is touted as a data warehousing feature for both sliding a partition into a table as a means of loading new data quickly and for sliding a partition out of a table as a means of purging old data quickly.

    The pattern of the sliding-window data access, however, is not limited to data warehousing applications. For many OLTP applications, new data is constantly being inserted into a table from the upstream systems (e.g. mid-tier application servers). The focus of the application is mostly on the more recent data, and the older data in the table may be purged or archived to somewhere else. For an OLTP table, we probably would never slide any pre-populated partitions into the table. But we could benefit greatly from being able to slide a partition out of the table, relieving us of having to purge old data with a series of DELETE statements, a slow and painful process.

    In the real-world, the table partitioning feature is not widely used on OLTP tables, at least, not on the systems I’m aware of. The main deterrent seems to be the amount of maintenance overhead in managing new partitions.

    Come to think about the table partitioning feature. It is really a physical level concept. From the application or logical level, we want to have its benefits, but rarely if ever do we want to go through the laborious process in maintaining it. In other words, we’d like the feature to be transparent at the application and logical level.

    For instance, why can’t we just say something like:

    CREATE TABLE foo (c1 datetime …) WITH PARTITION ON c1 BY WEEK;

    And expect the SQL Server engine to do the right thing in creating the partitions as necessary, and put data into the correct partitions. As long as we can completely express our intent in how the data should be partitioned in that table, we really don’t care how SQL Server does it internally as long as all the other table operations continue to behave as is. Of course, we’d like to be able to execute:

    ALTER TABLE foo DROP PARTITION for <week>;

    And expect the partition for that week be dropped as a metadata operation.

    As a new week approaches, SQL Server knows that it needs a new partition to accommodate the data coming into the table, and will just automatically create one under the hood without bothering the human user. That’s total transparency, and would make the life a lot easier when it comes to purge old data from an OLTP table.

    Yes, I know you can automate the process of managing new partitions with a script and you can probably pre-create enough partitions to be sufficiently safe. But building that transparency into the database engine as a feature relieves the user from having to worry about how robust our scripted automation may be.

  • A quick look at SQL Server Audit in Denali CTP3

    I have been using the SQL Server 2008 audit feature on a regular basis and generally like it a lot. No doubt there is still room for improvement to make it practically more useful. Here are three changes I would love to see.

    ·         You can’t really just audit the SELECT statements. Yes, you can pick SELECT as an audit action type. But because of the way the audit feature is implemented, you most likely would also end up with a lot of UPDATE or DELETE in the audit log. Note that SQL Server audit logs its data at the points when permissions are checked and a SELECT permission check is almost always performed when UPDATE/DELETE is executed. In other words, practically you can’t specify that you only want to audit SELECT queries on a table but don’t care about any DML on the same table.

    Why is this even an issue? One of primary reasons I use the audit feature is to track usage for deprecation purposes. On one reporting server, we suspect that a table is not really being used and would like to confirm that’s the case and drop the table. So I set up the table for audit on all SELECT queries by specifying SELECT as the audit action type on the table, expecting to find none or very few audit records. But I end up with millions of records in the audit log, and virtually all of them are updates. Note that this is a reporting server, and the table is being replicated from an OLTP server, on which the table is heavily updated. These updates are faithfully replicated by a transactional replication distribution agent. For our purpose, we know the table is being updated by the distribution agent and we don’t care about. We just want to know if anyone is querying the table for reporting. But we don’t seem to be able to do that efficiently with the SQL Server 2008 audit feature.

    ·         When you have a server with a lot of stored procedures calling each other, it can be difficult to find out where a statement in an audit record comes from. To me, the top most stored procedure is a key piece of audit information in addition to the exact statement being applied to the table directly. It should be captured, but it is not.

    ·         Much of the useful session level information is not captured in the audit record, making it difficult to correlate what (e.g. program name) did what (i.e. the info in the audit record) from where (e.g. hostname). There may be a perfect reason for not providing more/tighter correlation with the session level information. But this makes the audit feature less useful.

    So the question is: do we have any good news from Denali CTP3? The answer is yes and no. Denali CTP3 has some nice improvement, but in my opinion it isn’t going far enough.

    Good news is that in Denali CTP3, (1) we now can selectively retain audit records by using any or a combination of the columns in the audit records to setup a filtering predicate, and (2) the Additional Information column of the audit log now may include the T-SQL stack that has information on upper level calling stored procedures.

    Not so good news is that these changes are not as effective as they could be or I’d love them to be.

    I see two deficiencies in the audit filter capability. First, it is an attribute of the server audit object. It would give us more flexibility to add it as part of a database audit specification so that, for instance, we could create filters on what action type to include and what action type to filter out. The second issue with the current filter design is that it’s difficult to create an effective filter to retain only SELECT queries that are actually submitted instead ‘bogus’ SELECT as a result of permission checks on UPDATE or DELETE.

    The inclusion of the T-SQL stack information is excellent. But my testing with CTP3 shows that it only provides information going up one level (I don’t know if this will change in the final release). For instance, if you audit SELECT on table foo, and you have the following two stored procedures:

    CREATE PROC p_foo_1

    As

       SELECT * FROM foo

    go

    CREATE PROC p_foo_2

    AS

    EXECUTE p_foo_1

    go

     

    When you execute p_foo1 and p_foo2, the statement captured in the audit records will be the same:

    SELECT * FROM foo

    The T-SQL stack for p_foo_1 is as follows, clearly identifying the calling proc:

    <tsql_stack>

    <frame nest_level = '1' database_name = 'test' schema_name = 'dbo'

           object_name = 'p_foo_1'/>

    </tsql_stack>

     

    And the T-SQL stack captured for p_foo_2 is:

    <tsql_stack>

            <frame nest_level = '2' database_name = 'test' schema_name = 'dbo'

                   object_name = 'p_foo_1'/>

    </tsql_stack>

     

    In both cases, the stored procedure is identified as p_foo_1, the one doing SELECT on the table directly. In the case of p_foo_2, you’ll need to pay attention to the value in the nest_level attribute. The nest_level value is 2, and you know that you must search one level up to find the calling stored procedure. This definitely helps. But if the nest_level value is 5, you end up having to do a lot of code search. So why not just identify the top calling stored procedure and save the user from having to search through the code?

    With respect to the third issue I mentioned earlier, I have not found any change in CTP3 to better integrate/correlate the audit records with the SQL Server session level attributes.

     

  • Transactional replication and massive data updates

    SQL Server transactional replication has been a rather solid feature and works well for delivering data to reporting servers (among other things) in near real time. That said, it may not work too well when you need to perform massive updates on a published table, for instance, when you need to archive a large amount of data. It can take a long time for all the changes to be replicated to the subscribers.

    For some requirements, a perfect solution for this scenario would be to temporarily disable the transactional replication, perform the large scale data archiving concurrently on both the publisher and all the subscribers, and then resume the replication. That is, this would work very well if you can keep the published table from being modified by any other process for the duration.

    But wait! This won’t work because there is no built-in SQL Server feature to disable/enable transactional replication so that it becomes completely transparent to data updates while it’s disabled.

    A potential workaround is to remove the replication completely before making massive data changes and then re-create it afterwards. Before I actually started using this approach, it sounded to me like a rather dicey way of getting better performance.  A number of things could go horribly wrong to help ruin your otherwise peaceful weekend:

    • Replication you re-create afterwards may not be identical to what you have removed.
    • Your table(s) could get out of sync among the servers before you re-create replication.
    • Coordinating concurrent updates on multiple servers is inherently more complex than just updating the publisher.

    But now that I have put this approach in practice and battle tested for many months, I can report that this can be a very effective approach to gaining performance in your archiving process. Here are some of the tips I’d like to share in order to achieve that effectiveness:

    • You must have a robust mechanism to quiesce all the processes that update your published data. If your data is updated (by a non-archiving process) on the publisher while replication is removed, your subscribers will be out of sync with the publisher and you must re-sync them with the publisher, which defeats the purpose of removing replication in the first place.
    • Once the data update processes are quiesced, you must verify that the tables are indeed in sync among the publisher and the subscribers. If this fails, there is no point of proceeding further.
    • You’ll need a program or a script to generate the T-SQL scripts for deleting and creating the replication. I do this with a C# program via RMO, but a Poweshell script will do just fine. The T-SQL scripts need to be generated immediately before you remove the replication so that any change to the replication is captured.
      • If you are archiving a single table or a few tables, you can just script out the replication setup on the table(s). In many cases, you are archiving a database and you should script out all the replication setup for that database.
      • Among other things, your replication script generator should take the name of the table or database as an input and automatically find out all the replication setup that needs to be scripted. This ensures that you don’t need to change the script generator as publications and/or subscriptions are being added, dropped, or modified.
      • You’ll also need a program to execute the generated replication delete scripts and another program to execute the generated replication create scripts. These programs need to be aware of where a given script must be executed. For instance, a script to create the pull subscript needs to be executed on the subscriber. My solution is to the replication script generator record this information. As it goes through each publication and pull subscription to generate scripts, the script generator records the generated T-SQL script locations and the target server names in a table. The programs that execute the generated scripts are driven by this table to find each script and what server on which that script needs to be executed.
    • Once the tables are verified to be in sync and the replication is removed (by executing the delete scripts), you can start to archive data concurrently on all the servers. This step is best driven by a common list of what need to be archived so that you are guaranteed that archiving will be done on all the server on exactly the same data, and at the end of archiving, you are guaranteed that the tables will still be in sync.
    • Since archiving is done on separate servers, things can obviously go wrong to crash it on one server but not on the others. Regardless what may go wrong, before you execute the generated T-SQL scripts to re-create the replication, you must verify that:
      • The archiving processes are indeed done on all the servers, and
      • The data is still in sync across all the servers
    • All the subscriptions need to be re-created with no-sync. Otherwise, you have just archived the subscriber for nothing. This means that the replication script generator needs to modify the generated T-SQL scripts so that the value of the @sync_type parameter for the sp_addsubscription procedure is set to ‘none’ regardless of how it is currently set.
    • To ensure that the re-created replication is the same as the original replication (before it’s removed), you can take a snapshot of the syspublications, sysarticles, and syssubscriptions tables before the replication is removed, and take another snapshot of the same three tables after the replication is re-created. By comparing the before and after replication configurations recorded in these three tables, you can determine whether you have changed any replication setup.
    • I would also suggest that you run a test change and see it get replicated from the publisher to the subscribers. Nothing gives you more comfy feeling than see it in action and see it works.
    • The distribution agents should run under the security context of the SQL Agent service account. Otherwise, because the generated scripts don’t include any password text, the script containing sp_addpushsubscription_agent or sp_addpullsubscription_agent will fail and no distribution agent would be created.

    As mentioned, this approach has been proven to be effective in practice, and it is also quite maintainable. Still this amounts to regular schema changes with a lot of moving parts. It would be much better that replication can be disabled with a single DDL statement, just like you can disable an index or a constraint.

  • Multi-victim deadlocks – a simple example

    Recently, I ran into a series of deadlocks that had multiple victims in each deadlock. In other words, SQL Server chose more than one spid to terminate in resolving a deadlock.

    Maybe I need to brush up my web search skills, but I have not found much information on multi-victim deadlocks. About the only information I could find is an excellent blog post by SQL Server MVP Jonathan Kehayias from more than two years ago. And in his post, Jonathan provided a demo to create a multi-victim deadlock.

    The multi-victim deadlocks that I ran into appear to be a simpler case because it involves only a single table and does not require setting the deadlock priority to different levels. In addition, I think this is a rather prototypical multi-victim deadlock in that more than one victim must be killed in order for the deadlock to be removed. Here are the steps to create such a multi-victim deadlock.

    First, create a table. Any table will do. Let’s make it real simple:

    use tempdb

    go
    create table junk(i int)

     

    Then, open three connections to the SQL Server 2008 instance and run the following in each of the three connections in tempdb:

     

    begin tran

      select * from junk with (tablock, holdlock)

     

    Note that it is tablock, not tablockx, in the hint.

    Now, in each of the three connections, run this UPDATE statement:

    update junk

       set i = 1

    You must run the above statement in all three connections fast enough to see a multi-victim deadlock. If you run it in two connections and wait, you’ll get a deadlock between those two connections before you have a chance to run the UPDATE statement in the third connection.

    Essentially, you need to run the UPDATE statement in the third connection before SQL Server’s deadlock detection mechanism has a chance to run. And by the time the deadlock detection mechanism runs, it’ll see that two connections must be killed to get rid of the deadlock. Killing any one of the connections would still leave the deadlock in place.

  • Performance impact: Don’t assume they are the same

    When you deploy a production cluster, you want any failover to be as transparent to your users as possible. It follows that the nodes within the cluster should be identical in terms of their hardware and software configurations. It further follows that the nodes are expected to perform the same or virtually the same.

    In practice, performance parity among the nodes is often implicitly assumed, and this implicit assumption is rarely, if at all, checked.  After all, if everything is the same among the nodes (other than the fact that they are different machines), why shouldn’t you expect them to perform the same or similarly? And since these machines are made of literally millions of elements—hardware components, software code lines, and program configurations, it is actually very difficult to verify whether two machines indeed perform the same or similarly enough. Furthermore, in many cases when we compare different computers, we are not even sure what performance should be measured or whether the performance we are measuring is relevant to what really matters at the database application level.  Consequently, implicitly assuming identical or similar performance among the nodes may not be a lazy cop-out, but rather a practically cost effective decision.

    That is all fine until the performance difference among the nodes starts to hurt. And it turns out that identically spec’ed machines can perform differently more often than you may expect.

    In fact, I have run into enough cases where the nodes had significantly different performance that I believe each/every node should always go through a standard benchmarking test before the cluster is deployed into production. The benchmark is not expected to spot all the performance difference, but it can give us a higher level of confidence that when a SQL Server instance fails over from one node to another, the application performance will not degrade.

    The question then is: what kind of benchmark should you run? Different people would offer different answers. None is necessarily wrong, nor is any necessarily the best in all scenarios.

    Here is my own experience.

    Since we are talking about SQL Server, it needs to be a SQL Server specific benchmark. I have run into many cases where multiple generic hardware tests show no difference between two nodes, but the application database behaves differently after a failover.

    In addition, if we are going to run a benchmarking test regularly, it needs to be sufficiently simple to administer and does not require an excessive amount of time before we can determine whether the nodes do in fact perform the same or not. Ideally, we should be able to make that determination within an hour. Obviously, there is a tradeoff between simplicity and coverage, and it is not realistic to expect a benchmark to be both simple and to be able to uncover all the difference that your application may be sensitive to.

    So it makes sense to focus on common or likely scenarios. A common set of scenarios that I have run into is related to differences in memory modules. Memory modules that are spec’ed and expected to be the same are in fact significantly different in performance, resulting in database performance being different on different nodes. In many of these scenarios, I tried straightforward memory benchmarks without much success. Note that these memory benchmarks typically perform repeated memory copies/writes, either single threaded or multi-threaded. When I ran them, they reported no difference between the nodes. But the database applications suggested otherwise.

    Increasingly, I have found these simple memory performance benchmarks to be less than effective in predicting how databases performance may vary with different memory modules. I think the following factors may be in play here:

    • Servers are having more memory chips. And the chance of some of them not performing well or not performing well with other modules increases.
    • Simple memory benchmarks may not exercise all the modules, or may have no option to configure how much memory to exercise.
    • Simple memory benchmarks may not sufficiently exercise the interactions among the memory modules, making it very different from how SQL Server may exercise them.

    My solution is to use an entirely cached database with a sufficient number of concurrent users randomly querying the cached data. This forces all the processors to be used ~100%, doing pure page lookups covering all the cached pages.

    I have been using this approach for many years to benchmark server performance, and found it to be sufficiently effective. Note that this is not strictly a memory test. Rather, it tests the entire host computer (including CPU, memory, memory controller/chipset). Most recently, I went through a series of test cycles with this benchmark in both controlled and uncontrolled environments to check the impact of memory modules on the SQL Server performance, and I’m happy to report that the benchmark was able to report the difference that simple memory copy tests often failed to see.

    One of the nodes in your cluster may already be different (due to differences in memory modules or any other components), but when your database fails over between the nodes, it not only performs fine but exactly the same.  Although the performance difference may never show up at the database level, it often shows up when it really hurts, i.e. when the load has reached a certain capacity limit. It’s far better to find out in advance any difference in node performance before it becomes visible at the database level.

  • Find the distribution agent jobs

    There may be times when you want to look up the job names for a group of distribution agents. For instance, you may want to be able to stop all the distribution agent jobs for a given publication. How do you look them up in a script?

    A quick web search on finding the distribution agent job name comes back with suggestions similar to the following query:

    SELECT a.publication, j.name  
      FROM distribution.dbo.MSdistribution_agents a
           JOIN msdb.dbo.sysjobs j ON a.job_id = j.job_id

    Unfortunately, this doesn’t quite work. Well, it works only if the distribution agents are running on the distributor, i.e. when the subscriptions are push. It does not work with any pull subscription.

    The following is a demo script that works regardless whether the subscriptions are pull or push or a mix of both. This script is written to stop all the distribution agents on demand for a given publication named MY_PUBLICATION:

    declare @publication sysname,

            @local_job int,

            @publisher_db sysname,

            @job_id uniqueidentifier,

            @job_name sysname,

            @subscriber sysname,

            @sql varchar(2000)

     

    if object_id('tempdb..#sql') is not NULL

      drop table #sql

     

    create table #sql (s varchar(2000))

     

    declare cr cursor for

    select d.publication, d.local_job, d.publisher_db, d.job_id, s.name

      from distribution.dbo.MSdistribution_agents d

          join master.sys.servers s on d.subscriber_id = s.server_id

     where d.publication = 'MY_PUBLICATION'

    order by d.publication, s.name

     

    open cr

     

    fetch next from cr into @publication, @local_job, @publisher_db, @job_id, @subscriber

     

    while @@fetch_status = 0

    begin

       if @local_job = 0

         select @sql = '

            select ''EXECUTE ' + @subscriber + '.msdb.dbo.sp_stop_job '''''' + j.name + ''''''''

              from ' + @subscriber + '.' + @publisher_db + '.dbo.MSsubscription_properties sp

                  join ' + @subscriber +
                       '.msdb.dbo.sysjobsteps js on sp.job_step_uid = js.step_uid

                  join ' + @subscriber + '.msdb.dbo.sysjobs j on js.job_id = j.job_id

             where sp.publication = ''' + @publication +
            ''' and publisher = ''MY_PUBLISHER'' and publisher_db = ''' + @publisher_db + ''''

       else

         select @sql = '

            select ''EXECUTE msdb.dbo.sp_stop_job '''''' + j.name + ''''''''

              from distribution.dbo.MSdistribution_agents d join msdb.dbo.sysjobs j

                  on d.job_id = j.job_id
                  join master.sys.servers s on d.publisher_id = s.server_id

             where d.publication = ''' + @publication +
              ''' and s.name = ''MY_PUBLISHER'' and d.publisher_db = ''' + @publisher_db + ''''

      

       --print @sql

       insert #sql

       EXECUTE (@sql)

      

       fetch next from cr into @publication, @local_job, @publisher_db, @job_id, @subscriber

      

    end

    close cr

    deallocate cr

     

    declare cr_sql cursor for

    select s from #sql

     

    open cr_sql

     

    fetch next from cr_sql into @sql

     

    while @@fetch_status = 0

    begin

          print @sql

          print ' '

          EXECUTE (@sql)

          fetch next from cr_sql into @sql

    end

    close cr_sql

    deallocate cr_sql

    The script assumes that the publisher is named MY_PUBLISHER and the distribution database is named distribution. In addition, the script must be run on the distributor. Obviously, you can modify the script to apply other search patterns (e.g. find all the distribution agent jobs for a published database) and/or apply other actions on the found jobs.

    One quick note on the replication system table MSsubscription_properties. If you look up this system table in Books Online, you won’t find the column job_step_uid on the documented column list. But column is there. I don’t know if that is a documentation bug, or is left out on purpose. This table is the key to finding the distribution agent jobs of the pull subscriptions.

    By the way, for a pull subscription, I have no idea what the value in the job_id column of the MSdistribution_agents table is for. If you know, I'd certainly appreciate your sharing the info.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement