THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Peter Larsson

Thinking outside the box

  • Median and Weighted Median calculations

    Hi all!

    This is my first blog post here at sqlblog.com after some convincing arguments made by Adam Machanic.
    Some of you know me as Peso, and I have mostly blogged at SQLTeam.com before.

    Yesterday Joe Celko posted a query on microsoft.public.sqlserver.programming newsgroup of how to write an elegant query for Weighted Median calculation.
    He managed to get the correct results but also ended up with some pretty ugly code. Joe also wrote he had a feeling of an elegant query should exist but was not seeing it.

    Anyway, since some of Celko's books have helped me in the past, I thought I should return the favor.

    Consider this sample data


    DECLARE @Foo TABLE
            (
                x SMALLMONEY NOT NULL
            )

    INSERT  @Foo
    VALUES  (1),
            (2),
            (2),
            (3),
            (3),
            (3)

    The most common approach to calculate the median value I have seen is

    SELECT  AVG(x)
    FROM    (
                SELECT  x,
                        ROW_NUMBER() OVER (ORDER BY x DESC) AS a,
                        ROW_NUMBER() OVER (ORDER BY x) AS b
                FROM    @Foo
            ) AS d
    WHERE   b - a BETWEEN -1 AND 1

    It's a great method and it works in almost all cases. Yes, in almost all cases. There are circumstances where SQL Server will not give the correct result.
    Adam Machanic has described the problem in this Connect issue. It seems the table need some key to work properly, but as seen with Celko's sample data, there is no key in this scenario.

    x       a  b  a-b
    1.0000  6  1    5
    2.0000  4  2    2
    2.0000  5  3    2
    3.0000  1  4   -3
    3.0000  2  5   -3
    3.0000  3  6   -3

    As you can see, the difference calculated by a-b suddenly doesn't match!
    How can we overcome this behaviour? How can we write a piece of code that works all times?
    It's not that hard. See this query.

    -- Median by Peso
    SELECT  AVG(x)
    FROM    (
                SELECT  x,
                        2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS y
                FROM    @Foo
            ) AS d
    WHERE   y BETWEEN 0 AND 2

    Yes it works! And now how to do the weighted median? Well, we follow the same approach and write this piece of code.

    -- Weighted Median by Peso
    SELECT  SUM(y) / SUM(t)
    FROM    (
                SELECT  SUM(x) OVER (PARTITION BY x) AS y,
                        2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS z,
                        COUNT(*) OVER (PARTITION BY x) AS t
                FROM    @Foo
            ) AS d
    WHERE   z BETWEEN 0 AND 2

    I think the query qualifies as elegant. Let's see if Celko thinks the same.
    As you can see, the query is only slightly more complicated than the normal Median. I leave the math behind it as an exercise to the reader.

    //Peso

More Posts « Previous page
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement