THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server'

  • Bleeding Edge 2012 – session material

    As promised, here are the slide deck and demo code I used for my presentation at the Bleeding Edge 2012 conference in Laško, Slovenia. Okay, I promised to have them up by Tuesday or Wednesday at worst, and it is now Saturday – my apologies for the delay.

    Thanks again to all the attendees of my session. I hope you enjoyed it, and if you have any question then please don’t hesitate to get in touch with me.

    I had a great time in Slovenia, both during the event and in the after hours. Even if everything the tour guide said during the tour of the Laško brewery was lost on me (in his defense, he did offer to translate the Slovenian explanations to Russian), I still liked it – especially the part where we got to sample some of the produce!

    I truly hope that there will be another Bleeding Edge conference next year. And if there is, I definitely want to speak there again!

  • SQLRally Nordic 2012 – session material

    As some of you might know, I have been to SQLRally Nordic 2012 in Copenhagen earlier this week. I was able to attend many interesting sessions, I had a great time catching up with old friends and meeting new people, and I was allowed to present a session myself.

    I understand that the PowerPoint slides and demo code I used in my session will be made available through the SQLRally website – but I don’t know how long it will take the probably very busy volunteers to do so. And I promised my attendees to make them available through my blog as well.

    So, here they are, for everyone to download, see, and play with. Though, in all honesty, I expect people who were not present at my session to get very little out of this material.

    (I’m not sure, but I *think* that the SQLRally organization will also make a recording of my session available. If that is true, then once it’s up the slides and demo code should make a lot more sense to anyone who is able to view the recording).

    I know I’ve said it before, but I can’t repeat this often enough – thanks to everyone involved with the organization of SQLRally for creating a truly wonderful event, and thanks to all the attendees who took the time and energy to come to my session and listen to me.

  • T-SQL User-Defined Functions: the good, the bad, and the ugly (part 4)

    Scalar user-defined functions are bad for performance. I already showed that for T-SQL scalar user-defined functions without and with data access, and for most CLR scalar user-defined functions without data access, and in this blog post I will show that CLR scalar user-defined functions with data access fit into that picture.

     

    First attempt

     

    Sticking to my simplistic example of finding the triple of an integer value by reading it from a pre-populated lookup table and following the standard recommendations and templates that Microsoft publishes for CLR scalar table-valued functions, my first attempt for this function looked like this:

     

    [Microsoft.SqlServer.Server.SqlFunction(DataAccess=DataAccessKind.Read,

        IsDeterministic=true,SystemDataAccess=SystemDataAccessKind.None,IsPrecise=true)]

    public static SqlInt32 CLRTripFromTbl1(SqlInt32 Value)

    {

        SqlInt32 Result;

        using (SqlConnection conn = new SqlConnection("context connection=true"))

        {

            conn.Open();

     

            SqlCommand cmd = new SqlCommand("SELECT Triple FROM dbo.Triples WHERE Value = @Value;", conn);

            cmd.Parameters.Add(new SqlParameter("Value", Value));

     

            Result = (SqlInt32)((int)cmd.ExecuteScalar());

        }

     

        return Result;

    }

     

    Below is the code I used to test the performance of this version in comparison to the T-SQL scalar user-defined function and the version without UDF I used earlier (in part 2 – where I also explain why the query without UDF is more complex than needed). Note that I added a WHERE clause that limits the test to just 10% of the table (one million rows instead of all ten million) to limit the testing time.

     

    SET STATISTICS TIME ON;

     

    -- T-SQL UDF

    SELECT MAX(dbo.TripFromTbl(DataVal)) AS MaxTriple

    FROM   dbo.LargeTable

    WHERE  KeyVal <= 1000000;

     

    -- CLR UDF

    SELECT MAX(dbo.CLRTripFromTbl1(DataVal)) AS MaxTriple

    FROM   dbo.LargeTable

    WHERE  KeyVal <= 1000000;

     

    -- No UDF

    WITH  TheTrips

    AS   (SELECT (SELECT t.Triple

                  FROM   dbo.Triples AS t

                  WHERE  t.Value = l.DataVal) AS TheTrip

           FROM   dbo.LargeTable AS l

           WHERE  l.KeyVal <= 1000000)

    SELECT MAX(TheTrip) AS MaxTriple

    FROM   TheTrips;

     

    SET STATISTICS TIME OFF;

     

    If you run this and check the CPU and elapsed time, you’ll find a huge (and very nasty) surprise. I expected the CLR version to be about as bad as the T-SQL user-defined function with data access, but I was wrong – it is more than ten times slower! For processing just these one million rows (with the data already in cache), the version without UDF took about 0.2 seconds elapsed, 0.8 seconds CPU (the CPU time being higher than the elapsed time is because parallelism); the version with T-SQL UDF took about 26 seconds (elapsed and CPU – no parallelism in this execution plan) – and the CLR version took over 350 seconds! This is also the reason why I added the WHERE clause; without it, the CLR version would probably have taken more than half an hour to finish.

     

    For completeness sake, I also checked the execution plan and the SET STATISTICS IO output of the CLR versions. All the issues I saw with the T-SQL scalar user-defined function with data access plague the CLR versions as well – no indication at all of the cost of the function in the execution plan, and no report of the amount of I/O in the SET STATISTICS IO output (though it is still visible in a profiler trace). Also, like the T-SQL scalar function with or without data access (but unlike the CLR scalar function without data access), you will never get a parallel plan on a query that invokes a CLR scalar user-defined function with data access. And, unlike a T-SQL UDF with data access, even the estimated execution plan will not show any hint of the data access performed in the function.

     

    The parallel loopback misunderstanding

     

    When I discussed these results with some people I consider more knowledgeable on CLR than myself, someone told me that by using a loopback connection instead of the context connection, it is possible to forgo the DataAccessKind.Read switch, which impedes parallelism.

     

    Well – if I could give a price for the worst advice ever (okay, it later turned out to be the most misunderstood advice ever – more about that later), this would definitely qualify. I created and tested this UDF:

     

    [Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.Read,

    IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None, IsPrecise = true)]

    public static SqlInt32 CLRTripFromTbl2(SqlInt32 Value)

    {

        SqlInt32 Result;

        using (SqlConnection conn = new SqlConnection(@"Data Source=perFact\SQL2012; Initial Catalog=udf; Integrated Security=True;"))

        {

            conn.Open();

     

            SqlCommand cmd = new SqlCommand("SELECT Triple FROM dbo.Triples WHERE Value = @Value;", conn);

            cmd.Parameters.Add(new SqlParameter("Value", Value));

     

            Result = (SqlInt32)((int)cmd.ExecuteScalar());

        }

     

        return Result;

    }

     

    You may note that the DataAccessKind.Read switch is still present. That’s because, when testing this with DataAccessKind.None, I got a run-time error. After restoring the DataAccessKind.Read switch, I at least got this code to run – but not very fast! When testing this UDF, I had to reduce the amount of rows to process even further – processing just 1% of the table (a hundred thousand rows) took 168 seconds elapsed (though “only” 54 seconds CPU), so the one million row test I used earlier would have taken almost half an hour, and if I had used the full testset, the test would have run for almost 5 hours! That makes this test with the loopback connection almost five times slower than the one with the context connection – which was already about ten times slower than the T-SQL UDF (which, in turn, was well over a hundred times slower than the version without UDF).

     

    It later turned out that I had misunderstood the advice. For the record, this advice was given by Adam Machanic, who is a true crack at everything CLR. It was not his advice that was bad; it was my ability to understand what he meant. He later clarified that his suggestion was to move the data access to a separate thread. He even pointed me to the library he created for this – see this link.

    I have to be honest. I looked at the code. And looked at it again. Then I looked at the sample code. And again. And yet another time. And then I admitted defeat. My understanding of CLR is simply too limited to enable me to adapt me “triples” example to use this technique, so I will not be able to include it in performance comparisons. If anyone is willing to give it a try, then by all means do – and please let me know the results!

     

    Cache as cache can

     

    Another tip, also from Adam (and this time not misunderstood by me!), is to avoid multiple lookups of the same value by implementing a cache in the CLR code. This is very easy to do for anyone with a fair bit of CLR skills – and I think that I have proven that you can even replace those CLR skills with a judicious amount of Google (or Bing). Here is how my code (after several failed attempts) eventually looked:

     

    // This is the initial size of the cache; it will grow when needed.

    private const int InitialSize = 40;

    public static TripleCache tc = new TripleCache(InitialSize);

     

    [Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.Read,

    IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None, IsPrecise = true)]

    public static SqlInt32 CLRTripFromTbl3(SqlInt32 Value)

    {

        SqlInt32 Result = tc[Value].Value;

        return Result;

    }

     

    public class TripleCache

    {

        // Dictionary to contain the cache.

        static Dictionary<SqlInt32, WeakReference> _cache;

     

        public TripleCache(int count)

        {

            _cache = new Dictionary<SqlInt32, WeakReference>();

        }

     

        // Access a data object from the cache.

        public Triple this[SqlInt32 index]

        {

            get

            {

                Triple tr;

                // Try to read the triple from the cache.

                try

                {

                    tr = _cache[index].Target as Triple;

                }

                catch

                {

                    // Triple not yet in cache; read from table and add to cache

                    tr = new Triple(index);

                    _cache.Add(index, new WeakReference(tr));

                }

                if (tr == null)

                {

                    // Triple has been in cache but was evicted - read from table again and renew cache

                    tr = new Triple(index);

                    _cache[index] = new WeakReference(tr);

                }

                return tr;

            }

        }

    }

     

     

    // This class reads the triple value from the table

    public class Triple

    {

        private SqlInt32 _triple;

     

        public Triple(SqlInt32 single)

        {

            using (SqlConnection conn = new SqlConnection("context connection=true"))

            {

                conn.Open();

     

                SqlCommand cmd = new SqlCommand("SELECT Triple FROM dbo.Triples WHERE Value = @Value;", conn);

                cmd.Parameters.Add(new SqlParameter("Value", single));

     

                _triple = (SqlInt32)((int)cmd.ExecuteScalar());

     

            }

        }

     

        // Simple property.

        public SqlInt32 Value

        {

            get

            {

                return _triple;

            }

        }

    }

     

    With the test data used so far, this method really performs very well. The test with one million rows finished in 3.5 seconds (again, both elapsed and CPU – we still do not get a parallel plan for this query). That is a lot faster than all other UDF versions tested so far – though still not even near the performance we can get by not using a UDF at all.

     

    To be fair, the test data is *very* nice for this specific method. The lookup table used holds two hundred thousand rows, but the sample data is skewed to use only ten distinct values. So this last CLR version reads those ten values into its cache and then keeps serving them from cache over and over again. While using only a minimal amount of memory for its cache, it reduces the amount of actual I/O to only ten queries. In short, my test case was the ideal use case for the caching technique.

     

    I wanted to know how the cached version holds up when the circumstances are less ideal, so I also executed this query:

     

    SELECT MAX(dbo.CLRTripFromTbl3(KeyVal % 90000)) AS MaxTriple

    FROM   dbo.LargeTable

    WHERE  KeyVal <= 1000000;

     

    By using “KeyVal % 90000” instead of “DataVal”, I now ensure that 90,000 different rows from the lookup table are accessed, each 11 or 12 times. This means that, even with the cache, a lot of database queries are still needed – though still than without the cache. The results were devastating! The above query ran for almost seven MINUTES (415 seconds, to be exact, using 405 seconds CPU time). At this point, I started to think that maybe the cache didn’t work properly with this amount of entries. The WeakReference class I used is supposed to allow allocations to be freed in order to free up some memory; maybe that was happening so aggressively that there cached data was evicted every time before it could be reused? To test this, I ran the same query another time; now the elapsed time was down to “only” 326 seconds, almost 100 seconds less. So it looks like the cache is still working, saving the need to fetch the same data multiple times – but the overhead for creating and using the cache costs more than the saved database calls; the simple CLR UDF that just calls the database every time is lots faster!

    (To be fair, all methods see some performance loss when tested with the equivalent of the above query, just not as much. The T-SQL UDF takes 27 seconds, the CLR UDF takes 364 seconds, and the version with inlined logic instead of a UDF now takes almost 0.5 seconds – slower, but still comfortably running rings around all the competitors!)

     

    Now obviously, the two tests I used are both extreme cases. In many cases where you might consider implementing a CLR UDF with data access, the amount of distinct rows read will be somewhere between ten and ninety thousand – so you really should run your own tests, on your own system and with your own typical data distribution.

     

    There are also a few other caveats you should consider before deciding to implement this caching technique for your scalar CLR user-defined functions:

    ·         The cache persists across query executions. So if I execute the same query again, I will get an even better performance, because all the data is still available in the CLR cache. If you run a profiler trace while executing this query a second time, you will see no data being read at all. For performance, this can be a good thing – but beware! It also means that you will have to find a way to deal with stale data. For triples, this is not really an issue. But there are also many lookup tables where the data may change. Like, for instance, a table with exchange ratios for foreign currencies. How would your bank do if, even after entering the new exchange ratios for the failing European currencies, lookup queries still return the old values?

    ·         The assembly that implements the cache has to be loaded with PERMISSION_SET = UNSAFE. Your DBA might not like that. And for good reasons. If you consider implementing techniques such as this cache, or Adam’s parallelizer, be aware that these techniques trade safety for speed. You may crash your AppDomain, which may cause the loss of all data in the cache or even kill executing queries. You may have to implement code to detect and repair this – if performance is really important, and you really have to use a UDF with data access, you could consider accepting the risks and downsides. But I won’t. Even after discussing this with Adam, I still don’t feel that I really understand the risks – and if I don’t understand the risks, then there is no way I will allow the code to run on my machine (except in test environments).

     

    Bottom line

     

    Out of the box, CLR user-defined functions will not perform well when data access is involved. They can be up to ten times slower than their T-SQL counterparts, and those were already pretty bad.

     

    Previously, we have seen that CLR user-defined functions can offer a good performance improvement when heavy computations are involved. So, what to do if your required functionality includes a combination of data retrieval and heavy computations? My recommendation would be to try to separate the two parts. Retrieve all the data in the T-SQL query, without using any user-defined data, then pass the required arguments to a CLR user-defined function without data access to do the computation. Or do the computation inline – even with heavy computations, inlining the logic sometimes is still faster!

     

    One way to speed up a CLR user-defined function with data access is to implement a cache, to prevent unnecessary roundtrips to the server. This does require you to think very carefully about the risk of serving stale data, and you may also find it very hard to convince your DBA to allow you to load an unsafe assembly – but when implemented successfully, it can result in a tremendous boost in performance, especially if the typical use of the function involves repeatedly reading a small set of data. But beware – when the number of distinct rows that are used (and will hence be read in cache) increases, and the number of cache hits decreases, caching can easily become a bane rather than a boon, because of the overhead involved.

     

    This concludes my discussion of scalar user-defined functions. In the next episodes of this series, we will look at table-valued user-defined functions. And we will find that those come in different flavors, with VERY different performance characteristics. Stay tuned!

  • Upcoming speaking engagements – want to meet me?

    I have a very busy time ahead of me, with lots of travel, lots of speaking engagements, and hence lots of opportunity to meet and catch up with what has become known as the SQL Family. (An excellent term, by the way – it describes exactly how it has always felt to me!)

    So, for everyone who want to know when and where they can meet me (as well as for everyone who wants to make sure to stay as far away from me as possible), here is my schedule for the rest of the year, in chronological order:

    · September 8, SQL Saturday #162, Cambridge, England. I will be presenting on Columnstore Indexes. I have given that presentation before, a.o. at SQL Bits, in a 75-minute time slot – and I had enough material to fill two of those slots. In Cambridge, I have only 50 minutes for my presentation, so that will be a challenge!

    · September 15, SQL Saturday #170, Munich, Germany. Here I will deliver a presentation on the MERGE statement. Also a session that I have presented before (in Portugal and in the Netherlands), but I plan to invest some time to improve and perfect the session.

    · October 1-3, SQL Rally Nordic, Copenhagen, Denmark. In Copenhagen, I will present a session on the dangers of user-defined functions. This session is of course inspired by my current blog-series-in-progress. For the title of the session, I had two ideas. The final title is “The evils of user-defined functions”, but the other contender was “User-defined functions, or how to kill performance in one easy step”. I hope that gives you an idea of what to expect!

    · November 3, SQL Saturday #172, Portland, OR, United States of America. The schedule has not yet been announced, so I have no idea if one of my sessions will be chosen. But even if I am not presenting, I will be there, meeting old and new friends, attending other people’s sessions, and making a nuisance out of myself by asking just the right question at just the wrong time. Or, if one of my sessions is picked, I will be presenting myself, and you all can make a nuisance out of yourself by asking the wrong question at just the right time.

    · November 5-9, PASS Summit 2012, Seattle, WA, United States of America. The biggest of the bunch. After a few years where I was either absent or a regular attendee, PASS has decided to try their luck with me as a speaker again this year. They, too, have chosen to let me speak about the MERGE statement. Obviously, I will improve my slides and demo code even further after delivering this session in Munich, so I hope to really rock the stage!

    · November 10, SQL Saturday #166, Olympia, WA, United States of America. Since I am in the neighborhood anyway, I decided to stick around one more day and go to this fine SQL Saturday event. I submitted the same sessions I also submitted for the Portland event, and they have given me the same feedback (i.e., none yet). In other words, I don’t know yet if I’ll be speaking or what I’ll be speaking about, but I will be there, even if I am not elected as a speaker.

    For those of you who desperately want to see me in real life (honestly? why??) but are not able to make it to any of the above events, there is still hope! Because, as long as it already is, the list above is still incomplete. There are two more events that will take place in 2012 and where I might be speaking. Yes, “might be” – in one case, it is almost certain, but I’ll still keep quiet until the schedule is officially announced. In the other case, I first have to work out some issues in the schedule of myself and my family before I can commit myself to going there – and once that is worked out, I will still keep quiet about that one too until the speakers for the event are officially announced.

    So with at least six and possibly eight (or more? my mail is open to anyone who is organizing events!) speaking engagements for the rest of the year, there is every chance for you to run into me. Are you going to one or more of the listed events? In that case, I hope you’ll come to my sessions. But whether you do, or pick a competing session, I hope even more that you will take the time to step up to me, introduce yourself, and chat.

    I am looking forward to meeting you all!

  • T-SQL User-Defined Functions: the good, the bad, and the ugly (part 3)

    I showed why T-SQL scalar user-defined functions are bad for performance in two previous posts. In this post, I will show that CLR scalar user-defined functions are bad as well (though not always quite as bad as T-SQL scalar user-defined functions).

    I will admit that I had not really planned to cover CLR in this series. But shortly after publishing the first part, I received an email from Adam Machanic, which basically said that I should make clear that the information in that post does not apply to CLR functions. So I dutifully added a comment to that post, and included a similar disclaimer in the second part – but I also planned to run some tests, for it is my strong belief that you should never take anything for granted, no matter how knowledgeable the person telling you it is. And if I am running those tests anyway, why not share my findings with you?

    No data access

    The first part of this post focused on scalar user-defined functions (UDFs) that don’t include any data access; only computations, string handling, etc. I used a very unrepresentative example: multiplying an integer value by 3. Here is a CLR equivalent (using C#) of that function:

    [Microsoft.SqlServer.Server.SqlFunction(DataAccess=DataAccessKind.None,

        IsDeterministic=true,SystemDataAccess=SystemDataAccessKind.None,IsPrecise=true)]

    public static SqlInt32 CLRTriple(SqlInt32 Value)

    {

        return Value * 3;

    }

    In order to test the performance of this version of the function and compare it to the alternatives (T-SQL function and inline logic), I ran this batch against the LargeTable table, which is still populated with ten million rows.

    SET STATISTICS TIME ON;

     

    SELECT MAX(dbo.Triple(DataVal)) AS MaxTriple

    FROM   dbo.LargeTable;

     

    SELECT MAX(dbo.CLRTriple(DataVal)) AS MaxTriple

    FROM   dbo.LargeTable;

     

    SELECT MAX(3 * DataVal) AS MaxTriple

    FROM   dbo.LargeTable;

     

    SET STATISTICS TIME OFF;

    The execution plans immediately show a huge advantage of CLR UDFs over their T-SQL counterparts: they don’t inhibit parallelism!

    image

    But other than that, this execution plan is doing pretty much the same as all other execution plans in this series of blog posts: lying until it is black in its face. Because that is what execution plans do when UDFs are involved, and CLR functions are no exception. For the real comparison, I ran the same batch without outputting the execution plan (to eliminate their overhead) and checked the amount of CPU time used and the time elapsed. For the T-SQL function, CPU time was 35,709 ms, and elapsed was 38,621 ms. The CLR version used less CPU time, only 13,072 ms – a huge saving. And thanks to the parallelism, the saving in elapsed time was even more: only 2,071 ms left.

    However, the version that avoids any UDF and instead places the logic inline still wins, with a CPU time of only 5,741 ms, and an elapsed time of 768 ms – almost three times as fast as the CLR version, and over 45 times as fast as the T-SQL version.

    Given the total lack of complexity in the UDF, the huge performance difference between the CLR and T-SQL versions of the function cannot be explained by CLR code being faster than T-SQL code (though that is definitely the case – see below). The only explanation I can come up with is that invoking a T-SQL involves a lot more overhead than invoking a CLR function. However, the CLR function is still not entirely free of overhead, otherwise it should have performed about the same as the version with the logic inline.

    Complex calculations

    Up until now, the conclusion is that, even though scalar CLR functions definitely perform much better than scalar T-SQL functions, the best performance is still achieved by placing the logic inline. But there are exceptions to this rule. As already mentioned above, CLR code executes a lot faster than T-SQL code – so if you have a function that performs extremely complicated logic, it may well be worth your time to do a thorough comparison between inlined logic and a CLR function.

    Because I wanted to use a realistic example, I decided to dig up the code I wrote many years ago, when I needed to compute the distance between points on the earth on SQL Server 2005 (before spatial data types were introduced). The formula used to calculate the distance between two points, given their latitude and longitude, is as follows:

    image

    This formula assumes that the input and result are all measured in radians. The data set I used for testing has locations with latitude and longitude measured in degrees, and I prefer to see the distance reported in kilometers, so we’ll have to add some unit conversions to get the correct results.

    I first implemented this calculation as a T-SQL scalar user-defined function:

    CREATE FUNCTION dbo.Distance

                   (@Lat1 FLOAT,

                    @Lon1 FLOAT,

                    @Lat2 FLOAT,

                    @Lon2 FLOAT)

    RETURNS FLOAT

    AS

    BEGIN;

    DECLARE @Lat1R FLOAT,

            @Lon1R FLOAT,

            @Lat2R FLOAT,

            @Lon2R FLOAT,

            @DistR FLOAT,

            @Dist FLOAT;

     

    -- Convert from degrees to radians

    SET @Lat1R = RADIANS(@Lat1);

    SET @Lon1R = RADIANS(@Lon1);

    SET @Lat2R = RADIANS(@Lat2);

    SET @Lon2R = RADIANS(@Lon2);

     

    -- Calculate the distance (in radians)

    SET @DistR = 2 * ASIN(SQRT(POWER(SIN((@Lat1R - @Lat2R) / 2), 2)

                                  + (COS(@Lat1R) * COS(@Lat2R) * POWER(SIN((@Lon1R - @Lon2R) / 2), 2))));

     

    -- Convert distance from radians to kilometers

    -- Explanation: Distance in radians = distance in nautical miles * (pi / (180 * 60)), so

    --                 distance in nautical miles = distance in radians * 180 * 60 / pi

    --              One nautical mile is 1.852 kilometers, so

    --                 distance in km = (distance in radians * 180 * 60 / pi) * 1.852

    --              And since 180 * 60 * 1.852 = 20001.6, this can be simplified to

    --                 distance in km = distance in radians * 20001.6 / pi

    SET @Dist = @DistR * 20001.6 / PI();

    RETURN @Dist;

    END;

    GO

    To test its performance, I used a table that Microsoft included as part of the product samples with SQL Server 2005 (I believe these samples are still available on CodePlex). This table includes geographic data for almost 22,993 places in the United States of America. For my test, I chose to calculate the distance between each pair of places in the state Texas; since there are 1,276 places in this state, the function will be invoked 1,276 * 1,276 = 1,628,176 times. To make sure that my test is not influenced by the time to send results to the client or render them on my screen, I included the MAX function, so that only a single number is sent to the client.

    SET STATISTICS TIME ON;

     

    SELECT     MAX(dbo.Distance(a.Lat, a.Lon, b.Lat, b.Lon))

    FROM       dbo.Place AS a

    CROSS JOIN dbo.Place AS b

    WHERE      a.State    = 'TX'

    AND        b.State    = 'TX';

     

    SET STATISTICS TIME OFF;

    This query took over half a minute to finish. 31,449 ms CPU time, and 34,392 ms elapsed time. The time per execution of the user-defined function is decent enough (only about 21 ms), but because of the sheer number of executions, the total running time is still pretty long.

    Before moving to the CLR version of the distance calculation, let’s first try what happens if I avoid the user-defined function and instead place the calculation inline. After all, this proved to be a useful technique in the first part of this series. For this query, the downside of inlining the logic is that it results in a pretty much unmaintainable query – but it does indeed result in an impressive performance boost!

    SET STATISTICS TIME ON;

     

    SELECT     MAX(2 * ASIN(SQRT(POWER(SIN((RADIANS(a.Lat) - RADIANS(b.Lat)) / 2), 2)

                              + (COS(RADIANS(a.Lat)) * COS(RADIANS(b.Lat))

                                * POWER(SIN((RADIANS(a.Lon) - RADIANS(b.Lon)) / 2), 2)

                                ))) * 20001.6 / PI())

    FROM       dbo.Place AS a

    CROSS JOIN dbo.Place AS b

    WHERE      a.State    = 'TX'

    AND        b.State    = 'TX';

     

    SET STATISTICS TIME OFF;

    The CPU time is way down, from over 31 seconds to less than 8: 7,956 ms. Elapsed time is down even more, because the optimizer can parallelize this plan – only 1,557 ms! That is a very nice reduction, but still no reason to stop. As the amount of data increases, even this version can run into performance problems. For instance, when I simply omitted the WHERE clause to make SQL Server find the maximum distance between any of the almost 23,000 places in the database (involving over 500 million distance computations), the query ran for almost five minutes, with an elapsed time of 282,580 ms, and a CPU time of 2,155,995 ms! That’s a good reason to try to optimize this further.

    And that brings us back to the subject of this post: CLR scalar user-defined functions. I have implemented the same distance calculation as a C# function:

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

     

    public partial class UserDefinedFunctions

    {

        [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true)]

        public static SqlDouble DistCLR

                     (SqlDouble Lat1, SqlDouble Lon1,

                      SqlDouble Lat2, SqlDouble Lon2)

        {

            // This is just a wrapper for the T-SQL interface.

            // Call the function that does the real work.

            double Dist = SpatialDist(Lat1.Value, Lon1.Value,

                                      Lat2.Value, Lon2.Value);

            return new SqlDouble(Dist);

        }

     

     

        public static double SpatialDist

                     (double Lat1, double Lon1,

                      double Lat2, double Lon2)

        {

            // Convert degrees to radians

            double Lat1R = Lat1 * Math.PI / 180;

            double Lon1R = Lon1 * Math.PI / 180;

            double Lat2R = Lat2 * Math.PI / 180;

            double Lon2R = Lon2 * Math.PI / 180;

     

            // Calculate distance in radians

            double DistR =

                2 * Math.Asin(Math.Sqrt(Math.Pow(Math.Sin((Lat1R - Lat2R) / 2), 2)

                 + (Math.Cos(Lat1R) * Math.Cos(Lat2R)

                  * Math.Pow(Math.Sin((Lon1R - Lon2R) / 2), 2))));

     

            // Convert from radians to kilometers

            double Dist = DistR * 20001.6 / Math.PI;

     

            return Dist;

        }

    }

    The SQL for testing this is almost identical to the SQL for testing the T-SQL scalar user-defined function – only the function name is changed.

    SET STATISTICS TIME ON;

     

    SELECT     MAX(dbo.DistCLR(a.Lat, a.Lon, b.Lat, b.Lon))

    FROM       dbo.Place AS a

    CROSS JOIN dbo.Place AS b

    WHERE      a.State    = 'TX'

    AND        b.State    = 'TX';

     

    SET STATISTICS TIME OFF;

    The results show that this version is even faster than the T-SQL query with the computation inline, although only marginally. The CPU time is 7,798 ms, and the elapsed time is 1,459 ms. This means that the distance of the maximum calculation across all the places in the United States should also become slightly faster, and indeed it does – the elapsed time is 257,081 ms, and the CPU time is 2,010,774 ms. Still very slow, so if I had to tackle this problem for a real client I would start looking for other optimizations – but in the context of this blog post, I only wanted to show that using CLR scalar user-defined functions for complex computations can sometimes perform better than trying to do those computations in T-SQL only.

    Bottom line

    In the first two parts of this series, I have shown that T-SQL scalar user-defined functions, both with and without data access, are terrible for performance. In this part, I have looked at CLR scalar user-defined functions without data access. As a rule of thumb, those are bad for performance too – but unlike their T-SQL counterparts, when complex calculations are involved, the performance hit of having to invoke the function for each individual row processed by a query can sometimes be offset by the performance gain by moving the complex calculation to the faster CLR environment. If you consider embedding a complex calculation in a CLR scalar user-defined function, I recommend extended performance testing of both the UDF and the equivalent query with the same logic in inlined T-SQL.

    Of course, performance is not always the only deciding factor. Maybe you already have the CLR function because it’s also called from your .Net code; in that case, you can save a lot of development and maintenance effort by reusing that component in your T-SQL code. Maybe you need to do something that is only possible from CLR code (like calling a web service – as long as you are aware that this will have a REALLY detrimental effect on performance!), or something that is simply so much easier in CLR code (like using standard methods for regular expression pattern matching, that would be extremely complicated to code in T-SQL code). Or maybe your UDF will only be called five times per day, and never in a query that processes more than a single row. At the end of the day, you will have to make your decision, based on all the factors that are relevant in your specific scenario.

    The next part of this series will look at CLR scalar user-defined functions with data access; after that, we can finally start to cover the various kinds of table-valued user-defined functions available in SQL Server (the good and the ugly).

  • T-SQL User-Defined Functions: the good, the bad, and the ugly (part 2)

    In a previous blog post, I demonstrated just how much you can hurt your performance by encapsulating expressions and computations in a user-defined function (UDF). I focused on scalar functions that didn’t include any data access. In this post, I will complete the discussion on scalar UDFs by covering the effect of data access in a scalar UDF. Note that, like the previous post, this all applies to T-SQL user-defined functions only. SQL Server also supports CLR user-defined functions (written in a .Net language like C# or VB.Net); those are not in the scope of this blog post.

    Data access

    So far, I have been looking at an example of a UDF that only does a calculation. A very simple calculation, but this UDF stood model for other, probably more complex UDFs that manipulate their arguments without reading additional data. However, in a database system, these are not the most common UDFs – in my experience, a majority of the UDFs in any database will read some data from some tables. Are these UDFs that do data access as bad as those that don’t? Let’s find out. In the code below, I create a new UDF that doesn’t compute the triple value, but instead looks it up in a table that holds the triple values of all numbers between -100,000 and 100,000. This mimics what I think is a rather common data access pattern in a UDF: reading a single row from a lookup table, based on the primary key value.

    CREATE TABLE dbo.Triples

       (Value int NOT NULL PRIMARY KEY,

        Triple int NOT NULL);

     

    -- Triples of numbers 1 - 100,000

    WITH Digits

    AS (SELECT d FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) AS d(d))

    INSERT INTO dbo.Triples (Value, Triple)

    SELECT 10000 * tt.d + 1000 * st.d

         + 100 * h.d + 10 * t.d + s.d + 1,

          (10000 * tt.d + 1000 * st.d

         + 100 * h.d + 10 * t.d + s.d + 1) * 3

    FROM   Digits AS s,  Digits AS t,  Digits AS h,

           Digits AS st, Digits AS tt;

    -- Triples of numbers -1 - -100,000

    INSERT INTO dbo.Triples (Value, Triple)

    SELECT -Value, -Triple

    FROM dbo.Triples;

    -- And add the number 0

    INSERT INTO dbo.Triples (Value, Triple)

    VALUES (0, 0);

    go

    CREATE FUNCTION dbo.TripFromTbl(@Input int)

           RETURNS int

           WITH SCHEMABINDING

    AS

    BEGIN;

      DECLARE @Result int;

      SELECT @Result = Triple

      FROM   dbo.Triples

      WHERE  Value = @Input;

      RETURN @Result;

    END;

    To compare the performance of this UDF versus the inline equivalent, I would normally replace the UDF call with a correlated subquery. But my test query has embedded the call to the UDF in an aggregate function, and SQL Server apparently doesn’t like to have a subquery as the argument to an aggregate function – so I have to make the code a bit more complex to work around that limitation. (I normally would have rewritten the query as a join, but that would also be an optimization, and in this case I wanted to stay as close as possible to the original and see what the optimizer does – the CTE is the direct replacement of SELECT dbo.TripFromTbl(DataVal) FROM dbo.LargeTable;).

    SELECT MAX(dbo.TripFromTbl(DataVal)) AS MaxTriple

    FROM   dbo.LargeTable;

     

    WITH  TheTrips

    AS   (SELECT (SELECT t.Triple

                  FROM   dbo.Triples AS t

                  WHERE  t.Value = l.DataVal) AS TheTrip

           FROM   dbo.LargeTable AS l)

    SELECT MAX(TheTrip) AS MaxTriple

    FROM   TheTrips;

    Let’s first take a look at the execution plans. More lies? Yup! But in this case, the worst culprit is not the execution plan; we’ll see far worse lies in a bit. But let’s not get ahead of ourselves. I selected the two queries, hit the execute button, and after waiting 7½ minutes, I was able to make this screenshot of the execution plans for the queries above:

    image

    The plan for the query without UDF is pretty smart – though not optimal. The optimizer has reversed the order in which tables are accessed. It will first read the Triples table, and if you drill down in the properties of the Clustered Index Seek operator, you will see that it has used its knowledge of the CHECK constraint on the DataVal column that will be used in the join iterator – only rows with Value >=1 and < 10 are read. In the parallel part of the plan, the large table is then read entirely and joined to the rows from the Triples table. A hash match join iterator is used, so that the table is read only once, at the cost of some CPU and memory overhead. The iterator uses right outer join semantics, to ensure that rows from the LargeTable are not lost if a value is missing from the Triples table – we know this is not the case, but the optimizer doesn’t. The results are then aggregated (to find the maximum triple in each thread), and then those maximums are aggregated again to find the maximum between all threads.

    This is not an optimal plan. The value coming from the Triples table will be the same for all rows with the same DataVal in LargeTable, so instead of reading them all, then aggregating them back to a single row per Triple value, it would suffice to read only a single row to check for the existence of a particular DataValue in LargeTable. But that is apparently not a transformation that the optimizer is currently able to make – yet? (Are you reading, Conor?)

    The execution plan for the query with UDF is a lot simpler. It is, in fact, exactly the same as it was for the earlier versions of the UDF. And even though SQL Server cannot take advantage of parallelism for this query, it is still considered cheaper than the version without UDF: 44% of the total batch. Lies, of course – to see how bad the lies are in this case, I disabled the actual execution plan option and added the SET STATISTICS TIME ON statement, but in this case I also added a SET STATISTICS IO ON statement to see how much I/O the two alternatives use. I then clicked the Execute button and waited for another 7½ minutes.

    The output of the statistics IO is clear. For the version without UDF, there are 21,394 logical reads on the LargeTable (slightly more than its 21,088 pages; the parallelism causes a few pages to be read more than once), 2 reads on the Triples table, and it also mentions two WorkTables, but they have no logical reads. Pretty good. But the version with UDF again appears to win hands down: it uses 21,088 logical reads on LargeTable (obviously) – and nothing more! According to this output, the query with UDF manages to read all those triple values from the Triples table without ever accessing it! More lies? You bet! In this case, the lie is that the statistics io output never includes any I/O that is done within a UDF. So if I rewrote the UDF to do twenty table scans on a table with a billion rows each time it is called, the statistics io output would still insist that that table wasn’t read at all – though the execution time of the UDF would tell a quite different story. To really see the data access done by a UDF, you’ll have to have a profiler trace running while executing the query that invokes it. If you repeat this query with an active profiler trace that includes the relevant events, you’ll get an enormous lot of output – so please don’t do this on a production box! In the collected output, you’ll see a block that starts with an SP:Starting event (the function is invoked), then has two sets of SP:StmtStarting and SP:StmtCompleted events (the two executable statements in the function), and finally an SP:Completed event (the function has ended) – and this block is then repeated ten million times! In the Reads column of SP:StmtCompleted, you can also see the number of logical reads: only 2 per execution of the function, since all it does is a lookup on the clustered index key in a small table. And the best news is that you don’t even have to do any of the math yourself, because when the query finally finishes, the SQL:StmtCompleted event will show you the total I/O in the relevant columns – in this case the Reads column, which shows a grand total of 20,021,088 reads. That’s 20 million for the UDF (2 reads in each of the 10 million executions) that STATISTICS IO doesn’t report, plus the 21,088 reads on the LargeTable table that STATISTICS IO does report.

    (Note: If you need to do serious data collection for trouble-shooting a potential UDF issue, I recommend only including the SQL:StmtCompleted event with the Reads column and not including SP:StmtStarted, SP:StmtCompleted, SP:Starting, and SP:Completed. That makes the trace much more lightweight, and still gives you the information you need. Only include the other events if you need to drill down to the level of statements within the UDF, and if you do, then only run tests on very small numbers of rows).

    So, time for the ultimate measure (and the only measure that we can get a truthful report on without having to go outside Management Studio): the actual execution times. For the version without UDF, the CPU time is 7021 ms, and the elapsed time is 1406 ms. And the version with UDF took 488,470 ms CPU and 494,533 ms elapsed – a 70x increase in CPU, and even a 350x increase in elapsed time, from 1½ seconds to over 8 minutes! Do you now understand why database consultants have trouble suppressing their smile when their new customer turns out to have littered the database code with UDFs? It’s the ultimate low-hanging fruit!

    Lies, damned lies, and estimates

    The original version of the UDF, that didn’t use data access, already profited enormously from a rewrite that capitalized on the fact that there are only 10 distinct values for the argument with which it is called. This version would profit even more – rewrite the query to the one below, and the execution time is down to 3838 ms CPU, 3871 ms elapsed – still no parallelism anywhere in the plan, but nevertheless a great improvement.

    SELECT MAX(dbo.TripFromTbl(DataVal)) AS MaxTriple

    FROM  (SELECT DISTINCT DataVal

           FROM   dbo.LargeTable) AS d;

    The UDF does not contain any elements that would make it non deterministic, and indeed, the IsDeterministic property of this UDF is 1. So that is not the reason why the optimizer would not consider this transformation. So what is the reason? This time, it is not a zero-cost assumption for the UDF – take a look at the simple query below and the associated estimated execution plan:

    SELECT dbo.TripFromTbl(3);

    image

    If you drill down the properties of the iterator, you will see that the estimated subtree cost for the UDF is 0,0032831. Not entirely accurate – it is the same cost I get for a query that directly does the corresponding select from the Triples table, so the overhead for invoking the UDF is not taken into account; but I think we can safely assume that this overhead is minimal in relation to the I/O cost anyway.

    Since the UDF is executed once from the main query, I would expect the estimated operator cost for the Compute Scalar in the query to be about the same number – but it isn’t. This cost shows up as only 0,0000001. According to the query plan, the iterator that invokes the UDF costs less than 1/100th percent of that UDF itself. Clearly, in this case the optimizer does know something about the cost of the UDF, but still refuses to take this into account. “Yes, I know that this UDF is not free, but for the purpose of optimization, I’ll assume it is anyway”. Why? I have no idea – but I did report it as a bug on the Connect website.

    The remedy

    In the previous post, I told you to avoid using scalar UDFs that don’t do any data access. In this post, I showed that the performance hit of scalar UDFs that do include data access is even far worse, so you won’t be surprised when I tell you that you should avoid these as well, maybe even more – with the same exceptions (a SET statement without subquery, so that the UDF is processed only once; maybe a SELECT statement on a small table in a part of your application that is not on a time-critical path).

    If the logic in the UDF is relatively simple, you can often replace the call to the UDF with a simple subquery, possibly with a CASE expression somewhere to mimic an IF … ELSE in the UDF. This is not possible if the UDF is called inside an aggregate function, as we saw at the start of this post. And if the UDF is complex, replacing it with a single subquery may also prove to be too hard, or it results in a monster query that nobody will be able to maintain, or that due to its sheer complexity manage to throw off the optimizer. In cases like that, the best advice is to bite the bullet and do the work of completely rewriting the original query without using the UDF. Depending on the complexity of the problem, this may eventually result in breaking up the query in multiple steps, using temporary tables or table variables to store intermediate results. Not my first choice, but almost bound to perform better than any solution that calls a scalar user-defined function with data access.

    For the query I used in this blog post, my actual workaround would not be the complicated construction above, but a simple rewrite with a join:

    SELECT     MAX(t.Triple)  AS  MaxTriple

    FROM       dbo.LargeTable AS  l

    INNER JOIN dbo.Triples    AS  t

          ON   t.Value         =  l.DataVal;

    This rewrite has several advantages over the original version. The most obvious is that it’s much easier to understand and maintain. A second advantage is that the execution plan does not have to take into account the possibility that values may be missing from the dbo.Triples table (and indeed, the only difference between the plan for this query and the plan for the first query without UDF is that the Hash Match join iterator has been changed from outer to inner join). This also gives the optimizer more options to find better plans – it doesn’t in this case and both queries perform equal, but if you add a nonclustered index on the DataVal column in the LargeTable, this query will benefit the most.

    Or, if performance is really important, I would use a version that forces the optimizer to first do a DISTINCT on all the DataVal values before joining to the Triples table – less obvious to read and maintain, but in this specific case (because there are so few distinct DataVal values and so many rows in LargeTable) it is about 50% faster than the alternative:

    SELECT     MAX(t.Triple) AS MaxTriple

    FROM      (SELECT DISTINCT DataVal FROM dbo.LargeTable) AS l

    INNER JOIN dbo.Triples    AS  t

          ON   t.Value         =  l.DataVal;

    And a completely different way to tackle the performance impact of a scalar UDF with data access is to rewrite it as an inline table-valued function. As already promised in the first part of this series, I will cover this technique and discuss its benefits and drawbacks later.

    What’s next?

    I have not yet decided on the next part. My original idea was to limit this series to T-SQL user-defined functions (hence the title). But a comment I received by email has made me reconsider this plan; I am now considering the possibility of discussing the effects of CLR user-defined functions as well (though the title of the blog post series would then be slightly incorrect).

    If I do decide to cover CLR functions too, then the next part will be on the scalar variety of CLR functions (or, if you wish, the CLR variety of scalar functions) to wrap up the discussion on scalar functions. If I decide to leave out CLR, my next post will be about multi-statement table-valued functions. In that post, I will show that, while scalar functions may be bad, multi-statement table-valued functions are downright ugly!

  • T-SQL User-Defined Functions: the good, the bad, and the ugly (part 1)

    So you thought that encapsulating code in user-defined functions for easy reuse is a good idea? Think again!

    SQL Server supports three types of user-defined functions. Only one of them qualifies as good. The other two – well, the title says it all, doesn’t it?

    The bad: scalar functions

    A scalar user-defined function (UDF) is very much like a stored procedure, except that it always returns a single value of a predefined data type – and because of that property, it isn’t invoked with an EXECUTE statement, but embedded in an expression where the returned value is immediately used. I won’t explain all the basics, but assume that you are either already familiar with the concept, or that you at least have read the description in Books Online. It is allowed to read (but not modify!) table data from within a scalar UDF, but in this blog posts I will focus on scalar UDFs that include computations and other expressions only, without doing any data access.

    The code below defines and then uses a very simple scalar UDF that simply triples the input:

    CREATE FUNCTION dbo.Triple(@Input int)
           RETURNS int
    AS
    BEGIN;
      DECLARE @Result int;
      SET @Result = @Input * 3;
      RETURN @Result;
    END;
    go
    SELECT DataVal, dbo.Triple(DataVal) AS Triple
    FROM   dbo.LargeTable;

    This example is obviously overly simple – even the most enthusiastic devotee of isolating and reusing code will never bother to define and use a function for something so simple. But if the calculation in the function is actually very complex, it’s easy to see how code that defines the calculation once and then simply invokes the function every time it’s needed is easier to build, understand, debug, and maintain than code that repeats the complex expression at several locations. In traditional programming languages, like C# or VB.Net, it’s easy to see why using functions to encapsulate and reuse common computations is considered a best practice.

    But SQL Server isn’t a traditional programming language. It’s a declarative language, with an optimizer that has been designed to optimize the execution order within queries to make sure that the results are returned as fast as possible – without impacting correctness, of course. And that optimizer simply cannot optimize code that calls scalar UDFs as well as it can optimize code that has the same logic inlined. Let’s first take a look – in order to test the performance, I’ll create a table with 100,000 rows with random values from 1 to 10 in the DataVal column.

    CREATE TABLE dbo.LargeTable
      (KeyVal int NOT NULL PRIMARY KEY,
       DataVal int NOT NULL CHECK (DataVal BETWEEN 1 AND 10)
      );

    WITH Digits
    AS (SELECT d FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) AS d(d))
    INSERT INTO dbo.LargeTable (KeyVal, DataVal)
    SELECT 10000 * tt.d + 1000 * st.d
         + 100 * h.d + 10 * t.d + s.d + 1,
           10 * RAND(CHECKSUM(NEWID())) + 1
    FROM   Digits AS s,  Digits AS t,  Digits AS h,
           Digits AS st, Digits AS tt;

    The code may be a bit complex and you may be tempted to write a simple loop to insert 100,000 rows. But that would take a lot more time – the code above runs in less than 1 second on my laptop, whereas a loop takes almost five seconds. When we need more rows (later), this difference becomes even more noticeable.

    The first test

    Now it’s time to test. Below are two queries. The first query is designed to calculate the triple of each DataVal value in the table. (Note that I added the MAX aggregate to ensure that the actual performance would not be impacted by the overhead of returning 100,000 rows to SSMS and rendering them in the result display). The second query is exactly the same, except that it doesn’t use the scalar UDF, but includes (“inlines”) the actual formula in the query.

    SELECT MAX(dbo.Triple(DataVal)) AS MaxTriple
    FROM   dbo.LargeTable;

    SELECT MAX(3 * DataVal) AS MaxTriple
    FROM   dbo.LargeTable;

    If you select the two queries, activate the option to include the actual execution plan, hit execute, and then switch to the execution plan tab, you may be pretty happy:

    image

    I must add that I didn’t get these plans all the time. In my first tests, the plans were equal, with the only difference being the actual content (visible only in the Defined Values property) of the Compute Scalar iterator – the arithmetic formula 3 * DataVal vs. invoking dbo.Triple; in those cases both plans were said to have a cost of 50%. In later tests, the plans changed to the above; the call to dbo.Triple is now hidden several levels deep in the Defined Values property of the Stream Aggregate iterator, and though the same work is still done, the first query is now said to be suddenly slightly cheaper than the second. But either way, whether 50% each or 49% vs 51%, the scalar UDF seems to be a very good choice.

    However, you may not be aware that the “Actual Execution Plan” is a dirty rotten liar. Or maybe I should say that the terms “Actual Execution Plan” and “Estimated Execution Plan” are misleading. There is only one execution plan, it gets created when the queries are compiled, and then the queries are executed. The only difference between the “Actual” and the “Estimated” execution plan is that the estimated plan only tells you the estimates for how many rows flow between iterators and how often iterators are executed, and the actual plan adds the actual data for that. But no “actual operator cost” or “actual subtree cost” is added to the corresponding estimated values – and since those costs are the values that the percentages are based on, the percentages displayed in an actual execution plan are still based only on the estimates.

    To get a better idea of the actual query cost, let’s run these queries again – this time without the execution plan, but with the actual duration displayed. You can enable this display with the option Query / Query Options / Advances / SET STATISTICS TIME. Or you can simply add the statement SET STATISTICS TIME ON; at the start of the batch (and SET STATISTICS TIME OFF; at the end). Now, if you run the queries, you’ll get some extra information returned (in the Text tab of the results pane) that tells you exactly how long the query took (elapsed time) and how much CPU time it used (CPU time). On my laptop, the query that uses the UDF takes 889 ms CPU and 900 ms elapsed, and the query with the logic inlined takes only a fraction of that: 47 ms CPU and 52 ms elapsed! Not 49% versus 51%, but 95% versus 5%.

    Why?

    This huge performance difference is caused by the overhead of calling and executing a scalar UDF. The computation of 3 * DataVal in the second query is entirely executed inside an iterator (the Compute Scalar), which is very efficient. The computation of dbo.Triple(DataVal) in the first query is also executed in an iterator (the Stream Aggregate, in this case) – but since this is a call to a separate module, SQL Server will have to invoke this module for each of the 100,000 rows flowing through that iterator. Each of those 100,000 calls introduces some overhead: start up the execution, step through the two executable statements of the function body, and clean up afterwards. Some sources claim that the function text is interpreted (compiled) on each call; I found that this is –at least on SQL Server 2012– not the case; when executing this code with a profiler trace running, only a single cache hit (or cache insert if the function is not in the procedure cache) event is fired.

    This overhead is invisible in the “Actual Execution Plan”, but the execution time and the profiler trace tell a different story. And so does the “Estimated Execution Plan” – if I select the query that uses the function and then request at “Estimated Execution Plan”, I get two execution plans: one for the query that we saw before, and one for the function body, with two iterators that represent the executable statements: a SET with a calculation, and a RETURN.

    image

    But note that, again, the execution plan is lying. First, it implies that the UDF is invoked only once, which is not the case. Second, look at the cost. You may think that the 0% is the effect of rounding down, since a single execution of the function costs so little in relation to the cost of accessing and aggregating 100,000 rows. But if you check the properties of the iterators of the plan for the function, you’ll see that all operator and subtree costs are actually estimated to be exactly 0. This lie is maybe the worst of all – because it’s not just the plan lying to us, it is SQL Server lying to itself. This cost estimate of 0 is actually used by the query optimizer, so all plans it produces are based on the assumption that executing the function is free. As a result, the optimizer will not even consider optimizations it might use if it knew how costly calling a scalar UDF actually is.

    Determinism

    You may have wondered why SQL Server even bothers to invoke the UDF 100,000 times. Based on the CHECK constraint, there can never be more than 10 distinct values in the DataVal column – so why doesn’t the optimizer transform the execution plan to first get the distinct values of DataVal, then call the UDF only for those? Surely, that would be more efficient? Yes, it would, as we can easily verify by making that transformation ourselves:

    SET STATISTICS TIME ON;
    SELECT MAX(dbo.Triple(DataVal)) AS MaxTriple
    FROM  (SELECT DISTINCT DataVal
           FROM   dbo.LargeTable) AS d;

    SELECT MAX(3 * DataVal) AS MaxTriple
    FROM  (SELECT DISTINCT DataVal
           FROM   dbo.LargeTable) AS d;
    SET STATISTICS TIME OFF;

    If you check the returned execution times, you will see that this technique even helps the performance of the query without function, if only by a little bit – 47 ms CPU and 50 ms elapsed on my laptop. For the version with scalar UDF, the saving is significant, as it is now almost as efficient as the version without scalar UDF: 62 ms CPU and 51 ms elapsed.

    So why does the optimizer not make this transformation by itself? There are two reasons for that. The first is that with this version of the UDF, it can’t guarantee that this transformation won’t change result, because of a property called “determinism”. If a function is deterministic, we can be sure that when it is invoked multiple times with the same arguments, it will always return the same result. If a function is not deterministic, it might return different results, even when the same parameters are passed in. Our scalar UDF is not deterministic, as this query shows:

    SELECT OBJECTPROPERTY(OBJECT_ID('dbo.Triple'), 'IsDeterministic');

    You can check Books Online for a list of all the requirements a function has to meet to be deterministic. In our case, the only problem is that the UDF is not schemabound, so let’s remedy that:

    ALTER FUNCTION dbo.Triple(@Input int)
          RETURNS int
          WITH SCHEMABINDING
    AS
    BEGIN;
      DECLARE @Result int;
      SET @Result = @Input * 3;
      RETURN @Result;
    END;
    go
    SELECT OBJECTPROPERTY(OBJECT_ID('dbo.Triple'), 'IsDeterministic');

    The function is now marked as deterministic – but if you rerun the previous tests, you’ll see that this does not affect plan choice for these queries at all! The optimizer still won’t shuffle the plan of the first query to match that of the second, even though they are now (with the function marked deterministic) guaranteed to be equivalent. That is because there is a second reason why the optimizer won’t make this change – and that is that the optimizer thinks that invoking the function has a zero cost. Why would it even consider a complicated plan transform that saves 99,990 function calls if it thinks that those calls are free? After all, zero multiplied by 99,990 is still zero. Unfortunately, whereas we can affect determinism of a function, we cannot influence the cost estimate the optimizer uses for it.

    This same zero cost estimate leads to more bad plan choices. For instance, in the query below, the optimizer will happily invoke the scalar UDF two times for each row: once for the WHERE and once for the SELECT:

    SELECT 1 - dbo.Triple(DataVal)
    FROM   dbo.LargeTable
    WHERE  dbo.Triple(DataVal) > 20;

    It gets worse

    Unfortunately, these two (overhead times the number of rows and bad cost estimate affecting plan choice) are not the only problems with scalar UDFs. There is a third problem: SQL Server will never use parallelism in a plan that uses scalar UDFs. This becomes only visible with larger tables, so let’s get rid of our 100,000 test rows and replace them with ten million fresh ones:

    TRUNCATE TABLE dbo.LargeTable;

    WITH Digits
    AS (SELECT d FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) AS d(d))
    INSERT INTO dbo.LargeTable (KeyVal, DataVal)
    SELECT 1000000 * sm.d
         + 100000 * ht.d + 10000 * tt.d + 1000 * st.d
         + 100 * h.d + 10 * t.d + s.d + 1,
           10 * RAND(CHECKSUM(NEWID())) + 1
    FROM   Digits AS s,  Digits AS t,  Digits AS h,
           Digits AS st, Digits AS tt, Digits AS ht,
           Digits AS sm;

    If we now execute our original queries again, we will see two changes over the first time, when we used 100,000 rows. The first change is that now the plans are not the same; the plan for the query with scalar UDF is still the same, but the plan for the query without scalar UDF introduces parallelism.

    image

    The second change is maybe not really a change – it’s the observation that the percentages in the plan are still way off. On my laptop, the query with UDF takes 40108 ms CPU and 43760 ms elapsed to process all million rows; the query without UDF does the same in 4397 ms CPU and 808 ms elapsed. Based on CPU usage, the UDF version takes 90% of the batch (making the difference slightly less than in the non-parallel version – this is caused by the overhead of synchronizing over all the threads and combining the results); based on elapsed time, it’s even 98% (based on all cores in my laptop working instead of just one).

    I made another interesting (and slightly disturbing) observation when I looked at the execution plans of the queries that force the optimizer to first find the distinct values of DataVal and then only invoke compute the triple of those distinct value:

    image

    The version without UDF uses the parallel part of the plan to read all rows and find the distinct DataVal values, then computes the triple for those distinct values in the serial part. I would have expected a similar plan for the version with UDF (since the UDF would only be called in the serial part), but apparently, the mere presence of a scalar UDF in the query prevents any form of parallelism for the entire execution plan!

    The remedy

    If you care about performance, you should avoid the use of scalar UDFs, except in situations where their performance hit doesn’t hurt. For instance, in a SET statement without subquery, a UDF does not hurt you, because it will be invoked only once. And in a SELECT statement that processes only a very small table and is not part of a time-critical part of your system, the performance hit doesn’t really matter much (but don’t think that it’s safe to use a scalar UDF in a query that returns only a few rows from a large table – sometimes the optimizer will produce a plan where the evaluation of the UDF is pushed down to a part of the plan that is executed before the majority of the rows is filtered out!)

    The obvious workaround is to not use a scalar UDF at all, but instead inline the code. For the Triple function I used here, this is dead simple. If you have a UDF that contains multiple statements, calculating and storing intermediate results in variables, doing conditional logic based on IF … ELSE blocks,etc – this can be quite hard. You may have to use complicated CASE expressions, and you may have to repeat expressions multiple times (or use CTEs to avoid that duplication). But the performance gain will make up for the effort! Just don’t forget to carefully comment and document the sometimes hideous queries this may result in. As an example of what I mean, look at this scalar UDF and the corresponding inline rewrite (and if you want to know what the use of this UDF is, there is none; it’s just some nonsense I made up).

    CREATE FUNCTION dbo.Nonsense(@Input int)
           RETURNS int
           WITH SCHEMABINDING
    AS
    BEGIN;
      DECLARE @Result int,
              @BaseDate date,
              @YearsAdded date,
              @WeekDiff int;
      SET @BaseDate = '20000101';
      SET @YearsAdded = DATEADD(year, @Input, @BaseDate);
      IF @Input % 2 = 0
      BEGIN;
        SET @Result = DATEDIFF(day, @YearsAdded, @BaseDate)
                    - DATEDIFF(month, @YearsAdded, @BaseDate);
      END;
      ELSE
      BEGIN;
        SET @WeekDiff = DATEDIFF(week, @BaseDate, @YearsAdded);
        SET @Result = (100 + @WeekDiff) * (@WeekDiff - 100);
      END;
      RETURN @Result;
    END;
    go

    SELECT KeyVal, DataVal,
           dbo.Nonsense(DataVal)
    FROM   dbo.LargeTable
    WHERE  KeyVal <= 100;

    WITH MyCTE
    AS (SELECT KeyVal, DataVal,
               CAST('20000101' AS date) AS BaseDate,
               DATEADD(year, DataVal, CAST('20000101' AS date)) AS YearsAdded
        FROM   dbo.LargeTable)
    SELECT KeyVal, DataVal,
           CASE WHEN DataVal % 2 = 0
                THEN DATEDIFF(day, YearsAdded, BaseDate)
                   - DATEDIFF(month, YearsAdded, BaseDate)
                ELSE (100 + DATEDIFF(week, BaseDate, YearsAdded))
                   * (DATEDIFF(week, BaseDate, YearsAdded) - 100)
           END
    FROM   MyCTE
    WHERE  KeyVal <= 100;

    The query with the logic inlined may not look pretty – but if I execute the query for all ten million rows in the table (again using the MAX aggregate to reduce I/O and rendering overhead), the performance difference makes it worth it: over a minute for the UDF version, versus less than 4 seconds for the inlined version!

    What’s next?

    In the next part of this series, I’ll look at how data access in a scalar UDF changes the situation – to the worse! After that, I’ll check the two types of table-valued functions. Where scalar UDFs are good, table-valued UDFs can be downright ugly – or they can be good, depending on the type. I will also present a way to replace scalar functions with inline table-valued functions, so that you can encapsulate and reuse code without paying the gigantic performance penalty of scalar UDFs – but at the cost of complicating your query. So stay tuned!

  • Principles of Modeling: Avoid Redundancy

    In 1994, I learned a method for data modeling that is based on three principles. I immediately knew that these principles should embraced by anyone who does any data modeling or process modeling. Or almost any other job, for that matter. I have described these principles in three previous blog posts: the Jargon Principle, the Concreteness Principle, and the Reproducibility Principle. But I have later found that there are more principles and guidelines that are important to keep in mind when modeling.

    Avoid Redundancy

    I almost hear you think: “Yes, avoid redundancy. Duh! Do you have any more open doors to kick in?”

    But there is actually more to this than meets the eye. I’ll start with how I word this guideline, and then comment on it. Note that there are actually three parts to this guideline.

    “No data model shall contain any unchecked redundancy.

    “Process models should avoid making the processor do redundant actions.

    “The modeler shall strive to avoid redundant work for the domain expert.”

    As you see, the three parts apply to three different parts of the work of a modeler: the data model, the process model, and the process to create these models.

    Redundancy in the data model

    Many sources say that there should never be any redundancy in a data model (in an OLTP workload, that is; data warehouses are a completely different game). As you can see in the first part of my guideline to avoid redundancy, my opinion is a bit more nuanced than that – I say that no data model should contain any unchecked redundancy. And yes, that means that I believe redundancy in a data model can be okay – as long as the developer is aware that it exists, has a good reason to introduce or keep it, and makes sure that this is all thoroughly documented.

    With the concreteness example in mind, I will immediately give an example of a situation where I think redundancy is not only acceptable, but even necessary. Just think of the database that a bank uses for your checking account. It will probably include a table that has a single row for each account, including information such as the date it was opened, a reference to the account holder, the currency used for the account, etc. And it will also include a table with a row for each transaction, that includes the transaction date, the amount transferred, an optional description, and of course a reference to the account.

    Would you include a column for the current balance in the account table? I bet you would! And yet, this column would be redundant – after all, the current balance can easily be calculated by adding together all the values in the transaction table. So if you believe that there should never be any redundancy in a data model, you should not include this current balance column, but instead compute it whenever it’s needed. Nice in theory – but in practice, that would probably cost too much performance. Which is why every sane data modeler would (and should) include the current balance in the account table. But (s)he must also document this, and any other redundancy that (s)he decides to introduce or keep in the data model – for accountability, for review at a later time, and (maybe the most important) to ensure that the QA department will include some thorough tests to thwart the risk that this redundancy leads to inconsistent data.

    Redundancy in the process model

    The process model is the blueprint for the program. When the process model contains redundancy, the computer has to do the same work more than once. And that is always wrong, right? Well, no. Not always.

    For example, imagine a table with information about items that have to be shipped. Three of the columns in that table would be Net weight, Tare (packaging) weight, and Gross weight. If you store all three of them as normal columns, you avoid making the computer do extra work, but at the cost of redundancy in the data model. If you don’t include the Gross weight column, or define it as a computed column, you remove the data redundancy – but now, the computer has to compute it every time you query the table. That’s redundancy in the process model. And in this case, other than in the previous example (the checking account), the cost to compute the Gross weight when the table is queried has so little effect on performance that I (and I hope all other modelers) would immediately chose to use a computed column for Gross weight, or to not include it at all.

    This is just one example to illustrate that redundancy in the process model can be okay. But (just as with redundancy in the data model) it should be a conscious and well-documented decision, not an accident. So the modeler “should avoid making the processor do redundant actions” – but not at all cost.

    Redundancy for the domain expert

    The third part of the guideline to avoid redundancy concerns itself with redundant work for the domain expert, the dialog partner of the modeler. A very simple other way to say this, is that you shall not ask the same question more than once. Nor any different but equivalent question.

    But it is no coincidence that this guideline is the last of the three, and that the wording is quite weak (“shall strive”). The correctness of the data and process model is always the most important. Sometimes, the developer already has the answer to a question, but the wording is a bit ambiguous. In such a case, (s)he should always go to the domain expert and make sure to get an unambiguous answer – even if the domain expert may feel this to be redundant.

    But don’t go overboard with this! There is no faster way to unemployment than to keep bugging the domain expert with countless superfluous questions. When working without any method, always keep in mind how you expect the domain expert to experience the questions you ask. And when using a method, choose a method that ensures no redundant questions are asked.

    Redundancy for the modeler

    Maybe you have been wondering why there is no fourth element to the guideline to avoid redundancy, one that applies to the modeler him- or herself. Does this mean that I think that a modeler should just repeat each task two, three, five, or even twenty times?

    Sure! Knock yourself out, go crazy!

    Well, okay, not really. If you repeat everything twenty times, you will take so much time for your work that you will pretty fast find that you no longer have any work to do.

    And yet, I do not include this in the guideline to avoid redundancy. For good reason. Two good reasons, actually. First, it’s not needed – everyone hates repeating the same work over and over again, so no guideline to avoid doing that is ever needed. And why should I include redundancy in a guideline about avoiding redundancy?

    But more important – as a result of the Reproducibility Principle, there is actually a good reason to duplicate your work. If you have finished a task and then do it a second time, the result should be the same. If it’s not, you have obviously done something wrong – either the first time you did it, or the second time. If the results are the same, that’s still no guarantee that you made no mistakes, but the chance is a lot lower. So doing a task twice can be a valuable check for errors, rather than redundant work.

    Bottom line

    The bottom line of this post is pretty simple. Redundancy is pretty bad in almost all situations, but there are always exceptions. Avoid redundancy, but not at all costs.

  • The Curious Case of the Optimizer that doesn’t

    The optimizer is the part of SQL Server that takes your query and reorders and rearranges your query to find the optimal execution plan. In theory.

     

    In practice, that doesn’t always work out well. Often, the optimizer manages to come up with brilliant ways to execute a complex query very efficiently – but sometimes, it misses an option that appears to be so simple that you can only stare in utter amazement at the execution plan before going to the Connect site.

     

    Here is an example I recently ran into. I tested it on SQL Server 2012 and on SQL Server 2008 R2, and it reproduces on both. Execute the query below in the AdventureWorks sample database, with the option to Include Actual Execution Plan enabled (Ctrl+M), or request an Estimated Execution Plan (Ctrl-L).

     

    SELECT   TerritoryID,
             Name,
             SalesLastYear,
             SalesYTD,
             RANK() OVER (ORDER BY SalesLastYear) AS Rank1,
             RANK() OVER (ORDER BY SalesYTD) AS Rank2
    FROM     Sales.SalesTerritory
    ORDER BY SalesLastYear;

    When following the flow of data (by reading the execution plan from right to left), we see that the data is first read (with an unordered clustered index scan, since there is no better index available), then sorted by SalesLastYear, so that Rank1 can be computed (using two Segment operators and a Sequence Project operator – don’t ask). After that, the rows are sorted again, now by SalesYTD, and we see another combination of two Segment and one Sequence Project, for the calculation of Rank2. And then, finally, the rows are re-sorted by SalesLastYear so that they can be returned in the requested order.

     

    Now the big question is: why does the plan include two sort operators that both sort the data in the same (SalesLastYear) order? If the task of the optimizer is to find smart ways to rearrange computation order for better performance, why doesn’t it simply compute Rank2 first and Rank1 after that? In that case, the rows are already in the SalesLastYear order after the last Sequence Project, so no extra sort is needed. The execution plan of the query below confirms this suspicion:

     

    SELECT   TerritoryID,
             Name,
             SalesLastYear,
             SalesYTD,
             RANK() OVER (ORDER BY SalesYTD) AS Rank2,
             RANK() OVER (ORDER BY SalesLastYear) AS Rank1
    FROM     Sales.SalesTerritory
    ORDER BY SalesLastYear;

    Indeed, the execution plan of this query includes only two Sort operators instead of the three we had in the first execution plan. If you include both queries in a single batch, you’ll see an estimated cost of 59% for the first query, and 41% for the second. (Some people think that the percentages shown in an Actual Execution Plan are an indication of the actual cost; that is not the case – the percentages are computed from the Estimated Subtree Cost property of the left-most SELECT operator). The SalesTerritory table is too small to measure any actual performance differences, but I tried queries with a similar pattern on the SalesOrderDetail table (121,317 rows), and on FactProductInventory in AdventureWorksDW (776,286 rows) and I did see an actual difference in execution time. No surprise, but now I know for sure!

     

    So, we have seen that simply reordering the two columns that use an OVER clause reduces the query cost by about 30%. How is it possible that such a simple, basic reordering strategy is not considered by the optimizer? Surely, this can only be a bug?

     

    That’s what Fabiano Neves Amorim thought when he filed this bug on Connect. But, as you can see, the bug has been closed as “By design”. That probably doesn’t mean that someone wrote a design document telling the optimizer team to make sure that OVER clauses must always be evaluated in the order in which they appear in the query, even if a different order would be cheaper. I rather think of it as “missing an optimization opportunity is not a bug; the results are still correct, just a bit slower – so we’re going to close this “bug” report”. In this case, maybe the optimization opportunity was not identified during the design phase, or it was just too hard to implement. The latter statement may sound ridiculous at first (how can such a basic rewrite be too hard?), but you have to be aware that the optimizer does not operate on the original query text, but on an internal representation that is based on mathematics and set theory. Rewrites that may be complex in the query text may be obvious in this representation, but the reverse can also be true – so I’m prepared to accept the comment that Andrew Richardson made on behalf of Microsoft to the above Connect item: that it would be fairly complicated for the Query Optimizer.

     

    That does not mean I agree with the rest of Andrew’s comment. He suggests that this is a case where we should not rely on the optimizer, but rewrite our queries ourselves, especially since it’s such an easy rewrite in this case. Well, I would agree with that, except that: (a) this missed optimization opportunity is not documented, so how are developers supposed to know that they may need to reorder columns in a SELECT clause for optimal performance? (that is one of the reasons for this blog post); and (b) the behavior of the optimizer in this situation is not documented, so it can change at any time; I’d hate to rewrite all my queries and then find that the sysadmin just installed a patch and now the optimizer always starts with the last instead of the first OVER clause (or, worse, I don’t find it and just get all the bad performance right back).

     

    However, Andrew is right in so far that, at this time, rewriting queries does consistently improve performance in all my tests. So at this time, rewriting does seem to be the right thing to do. Just keep in mind that you have to test all your queries, not only on your test server but also on your production hardware, and that you’ll have to repeat these tests on a regular basis (at least after each patch, CU, service pack, or other change).

     

    The basic rewrite pattern is simple – for each query that uses OVER clauses with different ORDER BY subclauses as well as an ORDER BY clause on the query that matches one of the ORDER BY subclauses, make sure that the OVER clause that uses the matching ORDER BY comes last in the SELECT list. If you have a client that expects the columns in a particular order, the rewrite becomes a bit more complex – in that case, you have to use a CTE that includes the OVER clauses in the optimized order, and then you can reorder the columns in the query that references the CTE – as shown in this example:

     

    WITH MyCTE
    AS (SELECT   TerritoryID,
                 Name,
                 SalesLastYear,
                 SalesYTD,
                 RANK() OVER (ORDER BY SalesYTD) AS Rank2,
                 RANK() OVER (ORDER BY SalesLastYear) AS Rank1
        FROM     Sales.SalesTerritory)
    SELECT   TerritoryID,
             Name,
             SalesLastYear,
             SalesYTD,
             Rank1,
             Rank2
    FROM     MyCTE
    ORDER BY SalesLastYear;

    These rewrites are indeed the best option for the time being – but I still think that the optimizer should be improved to do these rewrites internally. So I decided to file a new item on Connect, this time labeling it as a suggestion instead of a bug. If you agree with me that this would be a great investment of Microsoft’s engineering time, then please add your vote to this item. (Or vote it down if you think I’m just being ridiculous). But don’t stop there! Microsoft knows me; they know I’m a geek who plays around with options like this and then runs into this issue. No real production databases were hurt during the production of this blog. And if I am the only one, then, frankly, I myself will say that they have better things to do with their engineering time. However, if I know that this affects real people, I can make a much stronger case to Microsoft for getting this fixed.

     

    So – go out to your production servers and find if you use queries with this pattern (two OVER clauses with different ORDER BY and an ORDER BY on the final query), then check to see if you should rewrite them. And then report back – add a comment here or on the Connect item; share if this affected you, and how much performance you were able to gain as a result of the rewrite.

     

    If Microsoft knows that their customers would actually benefit, they’ll be much more inclined to add this improvement to the optimizer then if it’s just about me, a geek moaning about an edge case that no one in the real world would ever run into.

  • Slides and demo code for Columnstore Index session

    Almost a week has passed after SQLBits X in London, so I guess it’s about time for me to share the slides and demo code of my session on columnstore indexes. After all, I promised people I would do that – especially when I found out that I had enough demos prepared to fill two sessions!

    I made some changes to the demo code. I added extra comments, not only to the demos I could not explain and run during the session, but also to the rest, so that people who missed the session will also be able to benefit. I also found and fixed the error that caused one of my demos to fail. It turned out to be as embarrassing as it was unspectacular – somewhere along the way, I must have accidentally fat-fingered the backspace button while the cursor was on the name of an index. And if the index name doesn’t match, queries against index-related DMVs tend to produce no results. <sigh>

    After fixing this typo, I re-ran all demos and they now worked flawlessly.

    One major catch (and those who were in my session already know this). I ran my demos on a database that I got from within Microsoft, and I have no permission to redistribute this database. That means that people can only study the code, but not run it – well, okay, they can run it, on the “small” version of the database and table (change database names AdventureWorks2008DWXL and AdventureWorks2008DWBig to AdventureWorks2008DW, and change table names FactResellerSalesXL and FactResellerSalesPart to FactResellerSales), but with the size of that table, I expect the optimizer to make completely different choices for the execution plans. So while you can’t see the actual performance benefit by running the code yourself, you can still learn the patterns to use to work around the many limitations of columnstore indexes and batch mode.

    I normally prefer to use demo code that any attendee can replicate on their own test databases, but in this case, I simply did not have the time to make a realistic 100+-million row table, and I did not want to demonstrate columnstore indexes on an unrealistic and heavily skewed table or on a table that is too small to show the performance benefit.

  • 80% off for SQL Azure!

    I have spent the last three days at SQLBits X in London – a truly great experience! There were lots of quality sessions, but I also enjoyed meeting new people and catching up with old friends. One of these friends (and I hope he’s still a friend after I post this) is Buck Woody. Not only a great and humorous speaker, but also a very nice fellow – for those who don’t mind being teased every now and then.

    When we were chatting, he told me that he was planning to announce a special access code to allow attendees of the SQLBits X conference to buy a SQL Azure subscription at a true bargain price. He told me how he had worked with various departments to make sure everything was set up, so that attendees who got the code would be able to sign up for a limited time (expiring today, at midnight GMT), and how he even made sure to pass the truly Buck-style access code by Legal, just in case – and then, when he already was in London, he got an email from higher management telling him he could not do it, so he had to remove this slide from his deck and change his talk.

    However, since Buck told me that everything was already set up for the offer, and he had told me the funny special offer code, I decided today to just try and see what happens if I put it in. Who knows, maybe someone in Microsoft HQ just forgot to disable the discount code – and lo and behold, the offer actually worked! And since Buck (probably expecting the code to be disabled) has not asked me to keep this information secret, I decided to write this blog post.

    So here are the details of the offer – for a very limited time only, you can get a 6-month subscription to a 50 GB SQL Azure database for the price of £9.99/month instead of the normal price (£64.23/month) – that is a whopping 84% discount!

    To use this offer, simply go to the SQL Azure pricing page, drag and draw the database bar to exactly 50 GB, change the currency to “British Pound (£)” (This is essential! The offer will not work otherwise. You can change the actual currency back to your currency later), and pick the “6-Month Plans” option. The price should now display “£64.23/mo”. Then, click “review + purchase”. A new popup window will now open, asking if you have a special offer code. In this popup window, enter the code “BuckWoodyIsTrulyAwesome!!!” (using no spaces, upper- and lowercase as indicated, and with all three exclamation marks), then click “Ok”, and the quoted price will change to “£9.99/mo”. After this, you can continue the setup process.

    Final note: After sharing this information with a few of my friends, I heard that for some people, the popup window did not appear. I don’t know why, and since I don’t think Buck ever intended me to share this code that he was not allowed to mention in his sessions, I am not planning to ask. If you don’t get a popup after following the steps above, check that your popup blocker is disabled; if that still doesn’t help, cancel the subscription process, unless you are willing to pay the full price. You can always try again in a year or so; maybe you are more lucky then.

    Finally, remember that this offer IS time-limited. You only have until midnight GMT to use it; on April 2nd, it should no longer work.

    EDIT: To prevent confusion by people who land on this page later - this post was published on April 1st, 2012, and it was an April Fools prank. The first paragraph is true: I did go to SQLBits, I did have a great time there, I did manage to catch up with Buck Woody, and he is a really nice fellow. But everything else in this post is pure fiction. There is, unfortunately, no huge savings offer for SQLBits - at least not one I am aware of.

  • Busy months ahead

    Almost two months have passed since my last blog post. And while it’s true that I’ve had (much) longer breaks, I do have a good reason now. All the time that I would normally at least in part spend on preparing new blog posts is now reserved for preparing presentations for a few upcoming events. I’ll give you an overview – who knows, maybe you’ll have a chance to attend one of them and meet me there? I’m looking forward to it!

    On Saturday, February 25, I’ll present my session on “Advanced Indexing” at SQL Saturday 108, in Redmond, WA. This session covers included columns, filtered indexes, indexed views, tools for finding the right indexes, and the dangers of over-indexing. I see several top speakers on the agenda, so this is definitely an event you don’t want to miss out – especially considering the price! Unfortunately, I hear that the event is fully booked, but you can still get yourself added to the waiting list. If other people have to cancel, you may get a chance to get admitted.

    I’ll stay in the Redmond area the whole next week, attending the MVP Summit. Not as a presenter, but to soak up all the information the product team wishes to share with us, and to give feedback on current and planned functionality. And to have lots of funs with my fellow MVPs, of course!

    Shortly after that, I’ll be heading to SQL Saturday 115, in Lisbon, Portugal. I will first present a full-day training session on database design on Friday, March 16. The design method I’ll teach is one that observes all the principles of modelling I presented in some previous blog posts. The next day, I’ll present two sessions: the one on advanced indexing I already mention above; and one about the MERGE statement, that covers not only the intended use but also several other interesting possibilities of this statement – and warns about some caveats that you need to know if you want to use MERGE. This event has an awesome line-up of speakers, so if you see a chance to go to Lisbon, do so!

    And finally, I’ll head to London for SQLBits X, which lasts from Thursday, March 29 until Saturday, March 31. On Thursday, I’ll present a training day on database design, so if you can’t attend in Lisbon, you get a second chance in London. I then have a day off on Friday, which gives me a chance to attend some of the superb sessions that are scheduled for this day; on Saturday, I will present a session on columnar indexes, a new feature introduced in SQL Server 2012 that has the potential to yield enormous performance benefits for queries that have to process large tables of read-only (or mainly read-only) data. In this session, I will peek below the hood and disclose a bit about how this new kind of indexes works, and why that results in such enormous performance benefits.

    So, if you were waiting for my next blog post with actual technical content, I have to disappoint you. That will probably have to wait until after these busy conference weeks are over.

  • Principles of Modeling: the Reproducibility Principle

    A year or so ago, I watched a few episodes of a Dutch television program that had an interesting format. The name of the series was (or is, I have no idea if it still runs) “Sterren op het doek” (“Stars on Canvas”). Every episode featured a Dutch celebrity, three painters, and an interviewer. For the program, the three painters each paint a picture of the celebrity (who is interviewed while posing – and because this takes a long time, the interviews were typically quite deep, and had enough material for an interesting first 15 minutes of the program). After that, the painters were given two weeks to finish their paintings. And at the end of the program, the painters each revealed their painting to the celebrity, who then got to pick one to keep for him- or herself; the remaining two portraits were auctioned off and the proceeds were given to a charity.

    What made this program intriguing was the enormous differences between the finished paintings. Not only did each painter have his or her own unique style, they also all chose to capture and emphasize different aspects of the physical appearance and different personality traits of the celebrity. They all made portraits that captured the celebrity really well, and yet they were as different as one could imagine. Even though the painters all received the same “input” (watching the celebrity pose and listening to the interview), they produced radically different “output” (paintings), while still all following the same task. It was also interesting to discuss the results with my wife. We often disagreed on which painting we liked best, and the celebrity would then sometimes pick the one we both disliked! Again, three people who, with the same input, generated vastly different output. Because when it comes to art, picking the “best” is a matter of taste.

    But what’s good in art, and in this television format, is not necessarily good elsewhere. When I fly somewhere, I don’t want the result of the pre-flight security check to depend on the engineer assigned to that task. For a given “input” (the current technical state of the plane), I want the “output” (‘clear to go’ or ‘needs repairs’) to be the same, regardless of the engineer who inspects the plane. And luckily, the airline companies are aware of this, so their safety inspection engineers use checklists that list exactly what details of the engine have to be inspected, and what value ranges are or are not acceptable.

    Data models, like paintings, often end up as decoration on office walls. Put they serve a very different purpose. For a painting, hanging on the wall is the end goal, pleasing the eye of all beholders. For a data model, being stuck to the wall is only a further step in a longer process to the end product: a working database application that serves some business need. If a painting fails to please the beholder, it’s simply a difference in taste. If a data model fails to serve the business needs, a mistake has been made – and mistakes like this can cost companies large sums of money, or, in some cases, even cost lives.

    I want data models to be like the safety inspection on planes: dependent on the input only, not on the person who carries out the task. That brings me to the third principle of modeling, after the Jargon Principle and the Concreteness Principle: The Reproducibility Principle.

    The Reproducibility Principle

    “The analyst shall perform each of his or her tasks in such a way that repeating the same task with the same input shall invariably yield the same result.”

    Note that the wording of the Reproducibility Principle (which, again, is put in my own words because I don’t recall the original wording, only the meaning) does not state who repeats the task, and that is on purpose. This principle should still hold if someone else repeats the same task; the result of an information analysis should not depend on who does the analysis.

    Easier said than done?

    While writing this, I almost heard the angry outcry from all over the world: “Yeah, right! I can work in a way that almost always ensures consistent results, but some of my co-workers are blithering idiots who simply don’t get it – how can you seriously expect them to produce the same quality analysis that I make?” That’s a fair point, and maybe a weak spot in the wording of the Reproducibility Principle – but not a weak spot in the principle itself! Hang on, I’m sure it will become more clear after a few paragraphs.

    But even with coworkers who are level with your experience and knowledge, you’ll still often have different opinions about a data model. I experienced this once, many years ago, when I attended a class on data modeling. The students had to draw a data model for a given scenario, and then present it in class. To my surprise, several models that were very different were all praised by the teacher. That just didn’t make sense to me. How can two models that don’t represent the same information both be correct? Clearly, the teacher of that class did not know or care about the reproducibility principle! If that is how information modeling is taught, it’s impossible to expect the modelers to suddenly all start producing the same results on a given input.

    Does this mean that the Reproducibility Principle is good in theory but impossible to achieve in practice for information and process analysis? No! It just means that the currently accepted methods of doing information and process analysis are not equipped for this principle. They lack recipes. Not recipes for pie, as you’d find in a cookbook, but recipes for the steps in making a model – but these modeling recipes should be just like the cookbook recipes, in that they tell you exactly what ingredients you need, and how and in what order you process and combine these ingredients to get the required result: a delicious apple pie, or a completely correct data or process model.

    The only way to implement the Reproducibility Principle is to have strict recipes that govern each and every conclusion that an analyst even infers from information given to him or her by the subject matter expert. With recipes, I know for sure that if I give the same information to my co-worker, the result will be either the same – or it will be different, and then we’ll sit together, go over the steps in the recipe until we see where one of us made a mistake.

    And those blithering idiots who just don’t get it? Having recipes will not suddenly change them to superstar modelers. They’ll still be idiots, they’ll still not get it. But with recipes, we no longer have to confront their bad insights with our good insights, and still disagree. Now, we can simply point out exactly where they failed to apply the recipe in the correct way, and even our bosses will agree. The blithering idiot coworkers will still be a royal pain, but fighting them will be easier if your company agrees on implementing the Reproducibility Principle by supplying recipes for everything you have to do in your role as information or process analyst.

    Putting my money where my mouth is

    Look at me, all blathering in abstract terms about how information modeling does not follow the Reproducibility Principle – and all this time, I myself am not applying the Concreteness Principle that I advertised in an earlier blog post. Time to change that!

    First, I’d like to start on the positive side. The job of creating a data model is not all up to the experience and insight of the analyst; some of the steps involved actually have very good recipes for them. Normalizing a data model is a fine example of this. For a given set of attributes (columns) and functional dependencies, the steps are clearly described. Simply apply them one by one, and the end result is always a nice, normalized data model. Unless you made a mistake somewhere (which does happen; I never said that the steps are easy, I just say that they are well prescribed!), your teacher (if still on course) or a coworker can tell you exactly where you made a mistake.

    But where do that list of attributes and their functional dependencies come from? Now we get into the realm of the bad and the ugly – none of the mainstream analysis methods have strict recipes for this. If I ask how to find the attributes for my model, the answer is always a variation of “talk to the subject matter expert”. Yeah, sure, I get that. But what questions do I ask? How do I ensure that no important attributes are missed in our conversation? And how do I separate the wheat from the chaff, so that I don’t waste time on attributes that are not relevant at all? The only answer I have ever got to that question is “experience”. And that is not an answer that I, as a fan of the Reproducibility Principle, like.

    For functional dependencies, there is a similar problem. Given a list of columns, how do I find the dependencies? Again, there is no answer that satisfies me. Most people will tell me that these dependencies are “obvious”, that you “just see” them. And indeed, in 99.9% of all cases, people will agree on the dependencies. It’s not those 99.9% that I’m concerned about; my concern is for the remaining 0.1% – those few cases where the functional dependencies are not obvious is where errors are introduced. Errors that sometimes are caught quickly, costing only a few man months of work. Or sometimes, they are not caught until it’s too late, bringing down entire companies –or, worse, killing people!– as a result.

    Bottom line

    After reading the above, you’ll understand that I only consider a modeling method to be good if it includes strict “recipe” procedures. And thinking back about my previous posts on the principles of modeling, you’ll also understand that I think those procedures should tell you not only what questions to ask the domain expert, but also tell you to use concrete examples in the jargon of the domain expert when asking them – and tell you how to do just that.

    Unfortunately, most design methods fail to embrace these principles. The only method I ever encountered that does fully embrace all these principles is NIAM – a method that, as far as I know, has only been documented in a Dutch book that has never been translated. (There are some more or less closely related methods that are documented better, but they all either lack full support for the Modeling Principles, or focus on the wrong details). Uptake of this method in the Netherlands is minimal, and virtually non-existent in the rest of the world.

    However, that has never stopped me from using it, extending it, and improving it. With all the changes I made the method I am now using can be said to have evolved from NIAM to a personal (though obviously still NIAM-derived) method. And I am proud to announce that this personal method will be the subject of a seminar that I will be giving on March 29 in London, on the first day of SQLBits X – a conference that anyone who can should attend. Thanks, SQLBits, for giving me a platform where I can share my experience with this modeling method with others.

  • Bin packing part 6: Further improvements

    In part 5 of my series on the bin packing problem, I presented a method that sits somewhere in between the true row-by-row iterative characteristics of the first three parts and the truly set-based approach of the fourth part. I did use iteration, but each pass through the loop would use a set-based statement to process a lot of rows at once. Since that statement is fairly complex, I am sure that a single execution of it is far from cheap – but the algorithm used is efficient enough that the entire input set is processed after only a few repetitions of that statement.

    This approach runs rings (performance-wise) around all the other solutions I had previously tried, and it also produces a more efficient packing than most (but not all) of the slower alternatives. After adding the set-based iteration algorithm to the table of test results and removing all alternatives that are both slower and less efficient (except the baseline, of course), I now only have a few options left:

    image

    Based on the measurements in this table, your choices when you need to implement a bin packing algorithm appear pretty clear – either go for a very fast algorithm that wastes very little bins, or use an algorithm that’s over 5 times as slow but wastes even less bins, or sacrifice another 35-40% performance to save yet another 0.01% of space. I have to note, though, that for a true efficiency comparison the algorithms have to be tested with many different sets of data instead of the single set I have used in this series. If you really have to implement a bin packing algorithm and efficiency is more relevant than speed, I’d advise you to conduct those tests, using all the algorithms in all the parts I wrote, using many different sets of sample data that match the distribution of your actual data as close as possible.

    On the other hand, if you can live with a fairly high (but not the highest) packing efficiency and need the process to finish fast, go for set-based iteration. The performance difference between this version and all the others is significant enough to safely conclude that even with other data distributions, it will still be the fastest of them all. The only caveat is here is how the algorithms scale. If execution time for one algorithm grows linearly with the amount of input and another scales exponentially, then there will probably be some point where another algorithm will become the winner. I will take a look at that in a later blog post – but first, I want to try if I can improve the existing algorithm a bit more.

    How to abuse integer division

    When I was busy trying out the variations I’ll describe later in this post, I suddenly realized that one of the queries in the original version of my algorithm uses computations with fractions instead of integers. Since integer computation is generally cheaper than computation with fractional precision, changing this should improve performance a bit. I refer to the expression that calculates the minimum amount of sessions as the total number of candidates divided by the maximum session size, rounded up. I implemented this very straightforward with the expression: CEILING(1.0 * SUM(NumCandidates) / @MaxCandidatesPerSession). I multiply by 1.0 to convert from integer to numeric (basically a trick to save the few extra keystrokes that CEILING(CAST(SUM(NumCandidates) AS numeric(10,2)) needs). Otherwise the division would be done as integer division, truncating the fractional part; the CEILING function would then have no further effect. However, it’s also possible to use a trick that effectively changes the truncation of integer division to rounding up – add the divisor minus 1 to the divided before dividing. So I changed the formula for the computation of the minimum required number of sessions to (SUM(NumCandidates) + @MaxCandidatesPerSession - 1) / @MaxCandidatesPerSession and saved this version as SetBasedIter1. Now this is not the most executed query in the procedure, but it’s still executed a few times, and this change did indeed give me a performance improvement – albeit a tiny one.

    Improving the first estimate

    There are scenarios where the set-based iteration algorithm starts with an amount of bins that every human being will immediately recognise as insufficient. For example, if the bin size is 10 and there are three packages of size 6, everybody will immediately see that you need three bins. But my algorithm will calculate the total size of all packages (18), divide by bin size and round up to 2, and then attempt to squeeze all packages in those bins. The result will be that the outer loop will have to be executed a second time; the entire process will thus be far less efficient then it would have been if the algorithm had started with three bins.

    To profit from this observation, I created yet another version of the algorithm and called it SetBasedIter2. This one uses two methods to calculate the amount of bins needed: the already familiar one (total size divided by bin size and rounded up), but also a simple count of the number of packages that are over half the bin size. The actual amount of bins used is then the highest of those two numbers. First tests (I did more extensive tests later) indicated that this version did indeed run a bit quicker while using the exact same amount of bins.

    Encouraged by this success, I went even further. After all, we don’t need to limit ourselves to just looking at the number of packages over half the maximum size. The next step is packages over one third of the bin size. Let’s for example assume that the bin size is still 10, and we now have five packages of size 4. You will immediately see that only two of those packages fit in a bin, so we need three bins. But even with the improved algorithm, we still start with one bin too little – the total size of all packages is 20, divided by 10 and rounded up yields 2. And the number of packages over size 5 is 0, so we do indeed start with two bins, and we’ll need another iteration of the outer loop to get all packages in a bin. So SetBasedIter3 expands on SetBasedIter2 by also including a count of packaged over one third of the bin size, divided by 2 and rounded up. And I then also added SetBasedIter4, that adds yet another step to get a good first estimation of the amount of bins needed: the amount of packages over a quarter of the bin size, divided by three and rounded up. I could have gone on like that, but I decided to stop here and do some more extensive testing first.

    The proper way to test

    In the previous posts, I always tested with the same set of data, to enable easy comparison of the efficiency of the various algorithms. But I realised that the effect of the new versions of the algorithm depends strongly on the actual data distribution. If, for example, there are no packages over half the bin size, then the extra code to count them is just overhead that adds some CPU cycles but doesn’t change the outcome in any way. But if there are a lot of those large packages, then the extra CPU invested can pay off by saving an extra execution of the outer loop. So the only way to get a fair comparison between the different versions of the algorithm would be to run it lots of times, with different data all the time.

    I set up an automated test script that generates a new set of test data, using a different seed but the same distribution (so you could argue that my test method is still far from perfect – and indeed, I do recommend anyone who wants to implement any bin packing algorithm in a production environment where speed and/or packing efficiency matter to set up their own tests, including ALL versions of the various algorithms, while making sure that the amount of data and its distribution matches the expected production data as closely as possible), then executes each of the algorithms ten times in a row, inserting the elapsed time and the amount of bins used in a table. I put this in an endless loop, tested that it worked the way I wanted, and then hit the execute button and then went to enjoy myself at the snooker club, followed by a good night sleep. The next morning, a total of 215 different sets of test data had been processed. I cancelled execution and noted the average execution time of each of the algorithms.

    image

    The most surprising result was that of SetBasedIter. When I tested this same algorithm for my previous post, it took 6,741 ms. This time, the exact same code resulted in an average execution time of 2,962 ms, over 50% less. Even the highest measured execution time of this algorithm, at 4,463 ms is much faster than my previously measured average execution time. I don’t know how to explain this. Maybe the distribution of test data in my original test case just happens to be extremely unlucky for this algorithm, maybe I had forgotten to shut down some other program that was running on my laptop? Maybe even a bit of both? If you have any explanation to offer, please let me know!

    Other than that, the results confirmed both my suspicions: that the integer calculation of SetBaserIter1 is a tiny bit faster than the numeric computation of the original version, and that at some point, the overhead of all the extra computations in the other versions starts to exceed the average savings due to reduced number of executions of the outer loop. It turns out that this cut-off point comes pretty quick – counting the number of packages over half the maximum bin size pays off, but also counting the number of packages over a third of the maximum bin size already makes the average execution time go up.

    How about efficiency?

    Though not included in the table above, I have also monitored the number of sessions used by each of the algorithms. In most test cases, all versions of the algorithm used the same amount of bins, but there were a few exception. Out of 215 sets of test data, 7 used 1 extra session with the algorithms SetBasedIter2, SetBasedIter3, and SetBasedIter4, and 3 others used 1 extra session with only the algorithms SetBasetIter3 and SetBasedIter4.

    The explanation for this difference is partly simple, partly not. The simple part is that, when extra bins are introduced in the first pass, the division of packages across bins will be totally different, so that bins will have different amounts of empty space remaining. And if, for instance, the remaining empty space for two bins is divided as 2+2 in one case and as 3+1 in another case, that remaining package of size 3 will take up an extra bin in the former case, but not in the latter.

    I had expected this difference to show up (that’s why I kept counting sessions in my fully automated overnight-version of the test script). I had no idea beforehand of how frequent I would see these differences, and what algorithms would benefit from the differences. So I was surprised to see how little versions of the test data were actually impacted, and equally surprised that in cases where there was a difference, it was always in the disadvantage of the versions with a better estimation for the first pass. Apparently, starting with more sessions (even if we know they will definitely be needed) is never good for the packing efficiency; for some reason the packing is less tight; more empty room remains. Again, if you have any explanation of this, please let me know!

    More optimizations

    If you have read the explanation of the set-based iteration method in my previous post on this topic, you’ll remember that at one point I wrote: “The threshold calculation, the ranking of bins by remaining capacity, and the test that a package fits the bin with the same rank may all seem pretty pointless when assigning the first bunch of packages to the bins. But the same code is reused for later iterations and then these are all important ingredients, as you will see in a bit”. But if that extra complexity is only required for the first execution of the code, then why not use separate code paths for the first and the subsequent executions? That way, the first execution can use a less complicated (and hence faster, I hope) query.

    I decided to put that theory to the test. I created another version of the code, SetBasedIter2B, starting at SetBasedIter2 (the fastest version so far). I duplicated the code of the complex query that creates the correct amount of session and the even more complex query that assigns registrations to sessions (which also required me to reorder some of the code), then looked at possible optimizations.

    For the query that creates new sessions, the first execution could be simplified because I don’t have to find the highest used session number for each quarter. The subsequent sessions could also be simplified by changing the join type for those highest used session numbers to an inner join (the outer join was only needed for the first execution), and by removing the extra line to count registrations over half the maximum session size – the algorithm guarantees that those will all be assigned in the first pass of the outer loop, so there’s no need to count them on subsequent passes.

    For the query that assigns registrations to sessions, the first execution in the inner loop can be simplified a lot. If it’s also the first execution of the outer loop, all sessions are still completely empty. But it’s also possible that it’s a later iteration of the outer loop, in which case there may be sessions with some empty space (but not enough for any of the remaining registrations) as well as new sessions that are completely empty. So instead of filtering on SpaceLeft > 0, I now filter on SpaceLeft = @MaxCandidatesPerSession. Since all sessions now have the same amount of empty space, the ordering for the ROW_NUMBER() function is not relevant; I made this explicit in the query (it’s unfortunately not allowed to leave out the ORDER BY clause), so that the optimizer doesn’t have to do unnecessary work. But the best simplification is that I can leave out the entire threshold calculation – since all (new) sessions are still completely empty, the threshold is not relevant.

    I could also have added an extra codepath to separate the first execution of the inner loop in the first execution of the outer loop from the first execution of the inner loop in a later execution of the outer loop. In that case, I could have omitted the WHERE clause in both CTE expressions for the former codepath. But since the columns used in these WHERE clauses are not indexed, I don’t expect any saving from this, whereas the logic to choose the right code path would increase complexity (and maybe even execution time).

    I included this version as well in the huge test on 215 sets of test data. The amount of sessions for SetBasedIter2B was always the same as for SetBasedIter2, but the performance was now a bit better – 2,789 ms on average instead of the 2,884 for SetBasedIter2; a 3.3% saving.

    Sacrificing speed for efficiency

    At this point I had run out of ideas to optimize the algorithm any further. But I did want to try something else. When testing the versions that sometimes add a few extra sessions based on the number of registrations over half, one third, or one quarter of the maximum, I found that these would usually produce the same amount of sessions, but sometimes produce one extra sessions. And though I could not explain it, I did decide to try if the reverse would also hold. In other words, if I deliberately start with too few sessions, will the packing efficiency increase?

    In order to test this, I created yet two new versions of my algorithm: SetBasedIter1BMin10 and SetBasedIter1BMin25. Both are based on SetBasedIter1, but with the performance enhancement of the SetBasedIter2B version (duplicating the queries and simplifying them for the first pass). The change I made was that the amount of new sessions used on each pass is now set to 10% or 25% less than the amount needed (rounded up, to prevent endless repetition of the outer loop). This guarantees that we will always need extra iterations of the outer loop, so performance will definitely suffer. But I hope this will be offset by increased efficiency.

    These two versions were included in the test with 215 sets of random data as well. The speed was as expected: 4,171 ms for SetBasedIter1BMin10 (almost 50% slower than the fastest option algorithm), and 5,903 ms for SetBasedIter1BMin25 (over 111% slower). But the packing efficiency did indeed increase. For the 215 sets, the amount of sessions was decreased by 0.57% on average with the Min10 version, and by 1.55% with the Min25 version. For the original set of test data, the Min10 version required 19,182 session; the Min25 version needed 18,986. Still not as good as the most efficient (but slow!) cursor-based versions, though; it seems my algorithm cannot beat those versions for packing efficiency.

    Overview

    After all these measurements, it’s high time for a new overview of all the relevant versions of the bin packing algorithms. I consider a version irrelevant if there is another version that is both faster and more efficient. Leaving those versions out (except the baseline), I now have the following overview. Note that the execution time for all the “SetBasedIter” versions is based on testing with 215 different sets of random data, where each test is repeated 10 times and the fastest and slowest times are discarded; the execution time for the other algorithms and the number of sessions for all algorithms is based on a single execution with the standard set of data used in the previous parts of this series.

    image

    Notes:

    1 I expect the SetBasedIter1B version to be 3-3.5% faster (~2.84 ms), but I did not include this version in my tests. It’s quite easy to write this version – simply start with the SetBasedIter2B version and remove the extra code required to count the number of registrations over half the maximum session size.

    2 Though the number of sessions for SetBasedIter2B is equal to that for SetBasedIter1 in this specific test case, I observed it to be one more in 3.2% of all my test cases. So this version is (slightly) less efficient.

    What’s next?

    When I finished the previous part of this series, I expected to be able to cover both the final improvements to the set-based iteration algorithm and the scaling tests in this part. But as I was working, I found many more improvement possibilities than I expected. This post is, again, much longer than what I think a blog post should be.

    So I will have to postpone testing how the various algorithms scale to the next post in this series. Will that post conclude this series? Frankly, I don’t know yet. At this time, I have no ideas for further improvements or other approaches. However, if someone else has a good idea, feel free to let me know (in the comments or by mail). I’ll look at your ideas and test them – and if they are good, I will include them in yet another episode in this series (that was originally planned to span 5 posts at most – oops!)

  • Principles of Modeling: the Concreteness Principle

    In an earlier post, I talked about the Jargon Principle, one of three principles I learned in 1994 and that have not only helped make me a better modeler, but that I have found to be very valuable in many other situations as well. Today, I will cover the second of those principles.

    The Concreteness Principle

    Again, not in the exact wording I learned it but paraphrased, here is “The Concreteness Principle”:

    “When communicating with the domain expert, the analyst shall avoid all abstract statements and questions and instead use concrete examples as the base of all communications; these examples shall use a notation that is familiar to the domain expert.”

    Note that the last part of the Concreteness Principle is basically a repetition of the Jargon Principle. One could argue that this part of my wording is redundant, but in this case I really don’t mind driving the point home a bit.

    What’s wrong with abstract questions?

    I have tried to create a completely new and original example to illustrate this, but I was unable to find anything better than the one I already used in one of my chapters in the first SQL Server MVP Deep Dives book, so I’ll just reuse that here.

    As a reader of my blog on SQLBlog.com, I’ll assume that you are fairly familiar with SQL Server, and that you could play the role of domain expert if some data modeler wants to model SQL Server but doesn’t know it himself. So let me assume that role of the data modeler. I’ll forget (temporarily) everything I know about SQL Server, and ask you this fairly simple question: “Can a table with a composite primary key be referenced by a foreign key constraint from several tables?”

    Before reading on, please take a minute to ponder the question. Try to interpret what I’m asking and then decide what your answer would be.

    Give me a concrete example, then!

    The question I asked you is an abstract question. It talks about tables, primary keys, and foreign keys in general, on an abstract level. Replacing it with a concrete example is fairly simple – just substitute all these abstract terms with concrete examples. Instead of “a table” and “several tables”, I use tables TableA, TableB, and TableC; instead of “a composite primary key”, I use “Key PK_TableA on columns Foo and Bar in TableA”, etc.

    Here’s the new version of the question: “Suppose I have a table TableA with a composite primary key made up of the columns Foo and Bar, a table TableB with amongst others a column Foo, and a table TableC with amongst others a column Bar. Is it allowed to define a foreign key constraint from the columns Foo in TableB and Bar in TableC that references the combination of columns Foo and Bar in TableA?”

    Again, please take the time to read the question carefully and consider your answer. Was it still the same? Or did you actually misinterpret my original question?

    If you did misinterpret the question, now is a good time to go back and reread the original question. I did phrase it exactly as intended. I did not ask “… referenced by foreign keys from …” (plural), but “… referenced by a foreign key from …” (singular). And yet, I am sure that a majority of readers will interpret the question as if I meant it in plural: multiple keys from multiple tables. Why? Simple. You are in this business; you work with tables and foreign keys; you are bound to interpret abstract questions in a way that fits your image of the world.

    I’ll admit that I could have written the original, abstract question as “… referenced by a single foreign key from …”. That would have prevented the misinterpretation. But that would only be a logical way to phrase the question if the one asking the question assumes that without the addition of “single”, misinterpretations are likely. I would have done that in a real interview, because I already know the answer (so I actually would not have asked the question at all). But a modeler who actually does not know the answer has no way of suspecting that this misinterpretation is possible.

    Familiar notation

    In my wording of the Concreteness Principle, I also explicitly say that the concrete example should be in a familiar notation – familiar to the domain expert, that is. Even though I already rephrased the original abstract question using a concrete example, and even though I did not violate the jargon principle (both the abstract question and the one using a concrete example use the jargon of SQL Server developers, which should be familiar to any SQL Server domain expert), I am still not happy with the way I asked it.

    Why? Fair question. After all, most (if not all) readers will have understood the second version of my question the way I intended it. But I also assume that many readers will not have known the answer immediately. Most will probably first have drawn a mental image of the model I was describing. Some might even have taken out a scratchpad and pencil and drawn a real image. That’s a waste of time. And it introduces a new risk of misunderstanding. With the mental image, there is a risk of the domain expert misplacing one of the columns, and since the image exists in her head only, the modeler would not even see it. With the actual scratchpad, that risk is somewhat reduced if the modeler takes the time to inspect if the model on the scratchpad matches his intention. Provided the modeler understands this image, the jargon that the image is based upon, sufficiently to grasp the picture.

    But if the modeler is indeed able to read this concrete example in the domain expert’s familiar notation, would he not be also able to draw it himself? Try to recall how long you took before you understood my second question sufficiently to answer the question. Now look at the picture below and watch how long it takes you to see what (if anything) is wrong with it:

    image

    I’m pretty sure that most readers will instantly see that this is not allowed. It is, again, the same question I already asked twice before. But now, you (still my domain expert) are able to see immediately what’s wrong with it, even if I don’t draw your attention to the part of the example I need you to focus on (which, in a real situation, I would do).

    More examples

    I hope the above example is sufficient to convince you that concrete examples are a much better way to phrase a question than abstract language. This does not end at asking questions, though; it’s the same for explaining something. Who has ever helped a kid with his or her first division exercises? I’m willing to bet that almost everyone introduced food at some point in the explanation – be it a pizza that has to be shared by three persons, a bag with 10 cookies for 5 children, or whatever. Why? Simple – because those concrete examples are much easier to understand than vague abstract explanations about division. And if you need to explain the remainder as well, you simple add another cookie to the bag (or, if you’re cheap, another kid).

    And look at this blog post. I include an example about table design, and then another example about elementary school homework assistance. This whole blog post is one giant example of how to use concrete examples, rather than long series of abstract words, if you want to be understood.

    (And for a counter example, analyze what politicians say during interviews. Many of them like to give long series of vague abstract quotes when they have to explain unpopular decisions. That’s because they hope that most voters won’t understand what they are actually saying. They do tend to use concrete examples when claiming a success, though!)

    But he said…

    As much as I believe in my principles, I also believe in not imposing them on others. Okay, you might argue that I am now trying to convince you of their value, and that’s true. But when talking to a domain expert, I would never force him or her to use my principles. But I am still very aware that abstractions can be easily misinterpreted. That poses a dilemma, because how can I thwart this risk if the domain expert chooses to use abstractions when talking to me?

    The answer is surprisingly simple. I try to parse what the domain expert tells me, then verify my interpretation by presenting him or her with a concrete example (in familiar notation, of course) that illustrates my interpretation of his or her words.

    The above might sound a bit vague (it’s an abstract formulation, after all), so I’ll give an example. Let’s say I am interviewing a domain expert from an insurance company. At one point, he might tell me “customers who smoke or have an occupation with hazard level F or higher and who are over 40 years of age cannot buy a life insurance unless they agree to pay a 20% surcharge on their premiums”. There is some room for misunderstanding in that statement. Mixing “or” and “and” together is one of them (does the 40 year age limit apply to dangerous professions only, or to smokers as well?). The or might be inclusive or exclusive (though the latter is quite improbable in this context). The “over 40 years” technically implies that this does not apply to people at the age of exactly 40, but many people will use these words when they mean “40 or over”. And does the age refer to the age when buying the insurance, or also to the yearly renewal? In other words, will a 38-year old smoker be confronted with a 20% premium range when he turns 41 (or 40), or does this only apply to people who are already that old when buying the policy?

    At this point, I would whip out a set of blank policy forms, fill them with imaginary customers, making sure to touch all the relevant combinations of smokers vs non-smokers, dangerous vs safe occupations, and various ages. I would then present these to the domain expert and ask him or her to verify if my interpretation of the above statement is correct or not.

    Bottom line

    I have learned that applying the Concreteness Principle is a great way to prevent miscommunication. Not only in the data modeling job, but also in many other situations – yes, even in discussions with my teenage daughter (okay, there it helps a bit; it’s not the silver bullet that all parents are so desperately looking for). Almost everyone applies this principle, unconsciously, in many cases. Being aware of this principle has helped me to increase the number of times I use concrete examples instead of (or sometimes in addition to) vague abstractions.

    Presenting the concrete examples in familiar notation can sometimes be a bit challenging. How do you even know what notation is familiar for the domain expert? And if that notation is not familiar to you, then how could you ever present your question in that notation? Those questions are all very justified, but answering them is beyond the scope of this blog post.

    If this blog post has helped you to become more aware of the way you talk with people, the way you ask questions, or the way you make an argument; if you now try to avoid abstractions and use concrete examples instead, I have achieved what I aim for. Even if you are not (or not always) able to present those examples in a notation that’s familiar to the person you are talking to.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement