This is my first blog post here at sqlblog.com after some convincing arguments made by Adam Machanic.
Some of you know me as Peso, and I have mostly blogged at SQLTeam.com before.
Yesterday Joe Celko posted a query on microsoft.public.sqlserver.programming newsgroup of how to write an elegant query for Weighted Median calculation.
He managed to get the correct results but also ended up with some pretty ugly code. Joe also wrote he had a feeling of an elegant query should exist but was not seeing it.
Anyway, since some of Celko's books have helped me in the past, I thought I should return the favor.
Consider this sample data
DECLARE @Foo TABLE
x SMALLMONEY NOT NULL
The most common approach to calculate the median value I have seen is
ROW_NUMBER() OVER (ORDER BY x DESC) AS a,
ROW_NUMBER() OVER (ORDER BY x) AS b
) AS d
WHERE b - a BETWEEN -1 AND 1
It's a great method and it works in almost all cases. Yes, in almost all cases. There are circumstances where SQL Server will not give the correct result.
Adam Machanic has described the problem in this Connect
issue. It seems the table need some key to work properly, but as seen with Celko's sample data, there is no key in this scenario.x a b a-b
1.0000 6 1 5
2.0000 4 2 2
2.0000 5 3 2
3.0000 1 4 -3
3.0000 2 5 -3
3.0000 3 6 -3
As you can see, the difference calculated by a-b suddenly doesn't match!
How can we overcome this behaviour? How can we write a piece of code that works all times?
It's not that hard. See this query.-- Median by Peso SELECT AVG(x)
2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS y
) AS d
WHERE y BETWEEN 0 AND 2
Yes it works! And now how to do the weighted median? Well, we follow the same approach and write this piece of code.
-- Weighted Median by Peso
SELECT SUM(y) / SUM(t)
SELECT SUM(x) OVER (PARTITION BY x) AS y,
2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS z,
COUNT(*) OVER (PARTITION BY x) AS t
) AS d
WHERE z BETWEEN 0 AND 2
I think the query qualifies as elegant. Let's see if Celko thinks the same.
As you can see, the query is only slightly more complicated than the normal Median. I leave the math behind it as an exercise to the reader.