Some problems can only be solved by brute-forcing every possible combination. The problem with such an approach, is that execution time grows exponentially as the amount of input data grows – so that even on the best possible hardware, you will get inacceptable performance once the input data goes beyond the size of a small test set. These problems are called “NP-complete” or “NP-hard”, and the most viable way to deal with them is to use an algorithm that finds a solution that, while not perfect, is at least good enough – with a performance that, while not perfect, is at least fast enough.
The bin packing problem is one of those problems. It is basically a very simple problem – for example, you are given a number of packages, each with its own weight, and an unlimited number of bins with a given maximum weight capacity. Your task is to use as little bins as possible for packing all packages. There are various situations in real life where some sort of bin packing is required – such as loading trucks to transport all freight in as little trucks as possible, assigning groups of people to rooms, cutting forms from raw material (this is a two-dimensional variation of bin packing), etc.
Back in 2004 and 2005, when I had just started answering questions in the SQL Server newsgroups, I replied to some questions that were essentially a variation on the bin packing problem – one involving packages (with a weight) and trucks (with a maximum load capacity); the other involving inviting families to dinner. I came up with an algorithm that combined set-based and iterative characteristics, and managed to find a very efficient distribution of packages, at a very good performance. I wanted to share this algorithm ever since I started blogging; the reason I haven’t done so yet is that there is much more to say about this category of problems, and that I never before found the time to investigate and describe it all.
This is the first part of what will become a series of posts investigating all possible (and some impossible) solutions to this problem, including some new possibilities (such as SQLCLR) that have only become available since SQL Server 2005 was released.
The sample scenario
As a sample scenario for testing various algorithms, I decided to stray from the packages and trucks, and switch to examinations. The scenario outlined here is imaginary, though it is (very loosely) based on a Dutch examination institute that I have done some work for, several years ago.
ImEx (Imaginary Examinations Inc.) is responsible for various certifications. It does not teach students, but it does define what candidates need to know, publish model exams, and (of course) take exams. The latter activity is four times per year. Candidates register for one of the many subjects available. ImEx only has a small office for its staff, so it has to rent a room in a conference centre where the exams are takes. This room has a maximum capacity of 100 seats; there are more candidates, so the room is rented for a period of time; during that time, ImEx will hold two examination sessions per day. All candidates that take an exam in the same subject have to be seated in the same session, since an expert on that subject has to be available to answer questions and settle disputes. However, candidates for different subjects can be combined in the same session, as long as the maximum capacity is not exceeded. Since the rent for this room is high, ImEx wants to seat all registered candidates in as little sessions as possible.
The script file “Create DB + tables.sql” (see attached ZIP file) contains the SQL to create a database for ImEx, and to create the two tables that we will focus on in this series. The table dbo.Registrations holds the registrations – not the individual registrations (imagine that they are in a different table, that is not relevant for the bin packing problem), but the aggregated number of registrations per subject for each quarter. The table dbo.Sessions holds the sessions that will be held in each quarter. One of the columns in dbo.Registrations is a foreign key to dbo.Sessions; this column is NULL at the start and has to be filled with a link to the session in which each subject will be examined. The table dbo.Sessions has a column SpaceLeft that is equal to (100 – SUM(NumCandidates) of registrations appointed to this session); this is of course just a helper column, included solely for performance.
Another script file, “Generate test data.sql” (also in the attached ZIP file), fills the table dbo.Registrations with a set of randomly generated data. I use different data distributions for each of the four quarters, so that I can test the various bin packing algorithms for evenly distributed data (Q1), for data with more small groups and less large groups (Q2), for data with more large groups and less small groups (Q3), and for data with a non-linear distribution of group size – very few small (4 – 10) and large (70 – 80) groups; many average (35 – 45) groups (Q4). I seed the random number generator with a known value at the beginning of the script to get reproducible results, so that I can make meaningful comparisons when regenerate the data to test a new algorithm. For “truly” random data, you’ll have to remove this statement – or you can use a different seed value to see if different data makes much difference for the results.
Setting a baseline: the first attempt
I’ll conclude this first post of the series with a first attempt at solving the problem. One that is probably neither the fastest, nor the most effective. This first version will than act as a baseline to compare future attempts to. I will measure both speed (how fast does it complete on a given set of test data) and effectiveness (how many sessions does it create for a given set of test data). I hope to find an algorithm that yields the maximum effectiveness while still exceeding the speed of other algorithms, but it’s more likely that I’ll end up having to choose for a trade-off between speed and effectiveness.
This first attempt is based on mimicking how a human would approach this problem – inspect each registration in turn, and assign it to a session that still has sufficient capacity if one exists, or to a new session otherwise. Implementing this algorithm in T-SQL results in the code that you find in the enclosed ZIP file in Baseline.sql. The code is pretty straightforward. Since the algorithm will inspect the registrations one by one, it is all centred around a cursor over the registrations table – of course, using the fastest cursor options available (see this blog entry for details). I also experimented with the “poor man’s cursor” (see this post), but in this case the real cursor turned out to be (slightly) faster.
I want the session numbers to be sequential and starting from 1 within each quarter. There are two ways to do that – either query the Sessions table for the MAX(SessionNo) within the current quarter each time a new session is added, or use a variable that I increment for each new session, and that I reset when a new quarter starts. I chose the latter, since variable manipulation is lots cheaper than accessing the Sessions table.
At the heart of the procedure is the SELECT TOP 1 query that I use to find a single session that has enough space left to accommodate the current registration. Since there is no ORDER BY in this query, the results are not deterministic – that is, I know it will return a session with enough space left if there is one, but if there is more than one session with enough space left, no one can predict which one will be returned, nor whether consecutive runs will yield the exact same results. Many details, such as the number of processors available, workload, and other factors, can influence the query execution plan that is used, so don’t be surprised if you get different results when testing this code on your machine. I could make this query return deterministic, reproducible results – but that would affect both performance and efficiency of the procedure, so I left that for the next part of this series.
The test setup
To test this and all future algorithms, I created a generic stored procedure for testing, called dbo.PerfTest (see PerfTest.sql in the attached ZIP file). In this stored procedure, I first wipe clean any results that may have been left behind by the previous run. Then I make sure that both the data cache and the procedure cache are empty. And then, I call the procedure I need to test (which is passed to dbo.PerfTest as a parameter and assumed to be in the dbo schema), making sure to note the time the call is made and the time the procedure returns control. The difference in milliseconds is then returned to the client, as the duration of the procedure to be tested.
The script file RunTest.sql is the file I actually execute to do the tests. Whenever I need to test a new algorithm, I only have to change the name of the stored procedure to test in the line that calls dbo.PerfTest and then I can hit the execute button and sit back. When the procedure finishes, it displays two result sets – one from dbo.PerfTest displaying the duration of the test procedure in milliseconds; the second generated by the code in RunTest.sql to assess the efficiency of the algorithm by comparing the number of sessions, the average session size, and the average number of free seats per quarter for each of the quarters and overall.
As was to be expected, the speed of my first attempt is abysmal. For the 40,000 registrations in my randomly generated test set, the average elapsed time for 5 test runs was 73,759 milliseconds. Faster than when I had to do it by hand, but that’s about all I can say in favour of this “speed”.
The efficiency of this algorithm turned out to be pretty good. The tightest packing ratio is achieved with the data for quarter 2, that consists mainly of small groups. Quarter 3, with an overdose of big groups, turns out to be much more challenging for this algorithm. Even though the average group size for quarter 4 is slightly smaller than that of quarter 1, it is harder to pack because the bell curve results in a much lower number of small groups that can be used to fill those last few seats in an almost packed session. Here are the full results:
Quarter NumSessions AvgSessionSize AvgEmptySeats
------- ----------- ---------------- ----------------
1 5271 95.367482 2441.800000
2 2825 98.928849 302.600000
3 6871 88.074225 8194.200000
4 4490 93.207349 3049.900000
ALL 19457 92.810556 3497.125000
As I said before, this is just a baseline. I have already investigated several other algorithms and intend to investigate even more – such as improving the cursor code (for both speed and efficiency), using a set based solution, combining set based and cursor based techniques, and employing the CLR. Watch this space for the next episode!