THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Maria Zakourdaev

Grouping events into intervals using hedgehog rolling style

- I have a challenging SQL problem for you, – yelled one of my coworkers one morning entering my office and dropping on his way a few pictures, SQL server system views map and a visitors chair. – Much more interesting than all your alerts and twitter conversations! ( “Well”, I thought to myself, “it might beat the alerts but not the twitter conversations” )

The problem presented by him is as follows:

There is a table containing information about a device’s, like phones or tablets, history of connections to hotspots as shown below:

image

The idea was to bind together the above mentioned events into groups. Each group, or we can call it - an interval, should represent the time that any device was connected to the specific access point. Each time the device has moved to the new access point, a new interval should start. If there were no events for the specific device longer than 30 min, the next event should start a new interval even if the new event was reported on the same access point as before.

Queries will eventually slice the data to see how many users were on the same access point at the specific time, average time devices stayed connected to specific access_point and other similar business questions.

The matter was indeed an attention-grabbing one since the desired result cannot be achieved by using a regular GROUP BY logic.

First of all, here is a test data:

SELECT *

INTO test_events

FROM ( VALUES ( 1, 1, 10, '20130101 10:01'),

( 3, 2 , 10 , '20130101 10:03' ),

( 4, 3 , 10 , '20130101 10:04' ),

( 5, 1 , 10 , '20130101 10:05' ),

( 6, 1 , 11 , '20130101 10:09' ),

( 7, 2 , 12, '20130101 10:06' ),

( 8, 3 , 10 , '20130101 10:10' ),

( 9, 1 , 11 , '20130101 10:40' ),

( 10, 3 , 10 , '20130101 10:47' ),

( 11, 3 , 10 , '20130101 10:52' ),

( 12, 3 , 10 , '20130101 11:05' ),

( 13, 2 , 10 , '20130101 10:53' )

) vals (id, hardware_id, access_point, datein);

image

The most trivial and easy algorithm would be to travel through the table, ordered by devices and the time. For each new event in the resultlset, access point and hardware_id should be analyzed. If the access point is different from the device’s previous event or more than 30 min have passed since the previous event, we will consider it as an interval start and use it’s own id as an Interval id. Keeping the Interval id we will go to the next event. If the access point is the same, use a previous event id as an Interval Id.

This logic is easily implemented using a cursor. I haven’t used the cursors for the last 10 years but I wondered what would be the execution times.

I have added two columns to the table, IntervalId to group the events into the intervals and the IntervalDateEnd to add information for when the interval ends in order to satisfy the above-mentioned queries.

 ALTER TABLE test_events ADD IntervalId int;

ALTER TABLE test_events ADD IntervalDateEnd datetime ;

The first cursor was using the aforesaid logic as is and had updated the table on every iteration. On a table with on 250000 rows I have stopped it after 4 hours, when less than half of the rows have got updated.

My next cursor implementation was a bit different. First of all I have created a temporary table. Instead of updating the table on every iteration, I added a new row to the temporary table grasping the event id and Intervalid. When the cursor finished I updated the main table by joining it to the temporary results table. This solution took 1 min ( ~ 35 sec cursor and ~30 sec table update) on table with 250000 rows and 10 min on table with 500000 rows. The test was performed on a fairly slow VM machine so the execution times reflect that.

The execution is quite heavy, so if you are using the cursors please make sure you are using them wisely. Differences between the cursors are marked in yellow.

-----------Better cursor----------

DECLARE @id int, @accessPoint int, @hardware_id varchar(50),@timestamp datetime

DECLARE @parentId int,@parentAccessPoint int,@ParentHardwareID varchar(50), @ParentTimeStamp  datetime

 

CREATE TABLE #Mapping ( Id int, IntervalId int,IntervalDateEnd datetime);

DECLARE EventLoop CURSOR READ_ONLY STATIC FOR

         SELECT id,access_point,hardware_id,datein

         FROM test_events

         ORDER BY hardware_id, datein ;

 

OPEN EventLoop

 

FETCH NEXT FROM EventLoop

INTO @id,@accessPoint,@hardware_id,@timestamp

 

WHILE @@FETCH_STATUS = 0

BEGIN

 

       IF ( @hardware_id = @ParentHardwareID

            AND @accessPoint = @parentAccessPoint

            AND DATEDIFF(mi,@ParentTimeStamp,@timestamp) <= 30 ) BEGIN

                   INSERT INTO #Mapping(Id,IntervalId) SELECT  @id,@parentId

       END

       ELSE BEGIN

              INSERT INTO #Mapping(Id,IntervalId) SELECT  @id,@id

              UPDATE #Mapping

                SET IntervalDateEnd =@ParentTimeStamp

                WHERE id = @parentId

 

              SELECT  @parentId = @id

       END

      

       SELECT 

                     @parentAccessPoint = @accessPoint,

                     @ParentHardwareID = @hardware_id,

                     @ParentTimeStamp = @timestamp;

               

 

    FETCH NEXT FROM EventLoop

       INTO @id,@accessPoint,@hardware_id,@timestamp

END

UPDATE #Mapping

SET IntervalDateEnd =@ParentTimeStamp

WHERE id = @parentId

CLOSE EventLoop;

DEALLOCATE EventLoop;

UPDATE e

SET e.IntervalId = m.IntervalId,

    e.IntervalDateEnd = m.IntervalDateEnd

FROM  test_events e

  JOIN #Mapping m

    ON e.id = m.id

-----------BAD cursor----------

DECLARE @id int, @accessPoint int, @hardware_id varchar(50),@timestamp datetime

DECLARE @parentId int,@parentAccessPoint int,@ParentHardwareID varchar(50), @ParentTimeStamp  datetime

 

DECLARE EventLoop  CURSOR READ_ONLY STATIC FOR

       SELECT id,access_point,hardware_id,datein

       FROM test_events

       ORDER BY hardware_id, datein ;

 

OPEN EventLoop

 

FETCH NEXT FROM EventLoop

INTO @id,@accessPoint,@hardware_id,@timestamp

 

WHILE @@FETCH_STATUS = 0

BEGIN

 

       IF ( @hardware_id = @ParentHardwareID

                     AND @accessPoint = @parentAccessPoint

                     AND DATEDIFF(mi,@ParentTimeStamp,@timestamp) <= 30 ) BEGIN

                                                UPDATE test_events

                                                  SET IntervalId = @parentId

                                                WHERE id = @id

       END

       ELSE BEGIN

             

              UPDATE test_events

                SET IntervalId = @id

              WHERE id = @id

              UPDATE test_events

                SET IntervalDateEnd =@ParentTimeStamp

                WHERE id = @parentId

 

              SELECT  @parentId = @id

       END

      

       SELECT 

                     @parentAccessPoint = @accessPoint,

                     @ParentHardwareID = @hardware_id,

                     @ParentTimeStamp = @timestamp;

               

 

    FETCH NEXT FROM EventLoop

       INTO @id,@accessPoint,@hardware_id,@timestamp

END

UPDATE test_events

SET IntervalDateEnd =@ParentTimeStamp

WHERE id = @parentId

 

CLOSE EventLoop;

DEALLOCATE EventLoop;

I’m habitually trying to avoid cursors for the reason that in most cases they will require more memory and produce much more locks. Therefore I needed to find a true set based solution which, as I have believed, additionally would have much better execution times.

That’s where I came up with 4 steps logic.

1st step. Find all interval starters.

By thinking about the data you can see that each event can belong to only one of the two possible statuses: Start the interval or continue the interval. We can find this by comparing each event to the previous event that was reported for the same device. I can achieve this by using LAG()  window function to bring the previous event. Here you can find more information about the analytic functions.

Here is a query:

UPDATE r1

SET r1.IntervalId = r2.IntervalId

FROM test_events r1

JOIN (SELECT id,IntervalId =

                 CASE WHEN

                      lag(access_point,1) OVER (PARTITION BY hardware_id ORDER BY datein ) = access_point

                      and

                      DATEDIFF (mi,

                                lag(datein,1) OVER (PARTITION BY hardware_id ORDER BY datein ),

                                datein

                                ) < 30

                      THEN

                             NULL

                      ELSE id

                 END

       FROM test_events

       WHERE IntervalId IS NULL) r2

on r1.id = r2.id

WHERE r1.IntervalId IS NULL

Unfortunately I can use windowed functions only in SELECT or ORDER BY clause, otherwise I could have used it in the SET clause of the update and the query would have been even lighter.

 

image

 

This step took 7 sec on 250000 rows and 10 sec on the 500000 rows.

 

Now I need to link the rest of the events to their intervals. In order to do that I can use this simple query that for each event that is continuing the interval find the closest event that is an interval starter using sequential id column which is PK of the table.

 

UPDATE e1

SET IntervalId = (SELECT top 1 IntervalId

                  FROM test_events e2

                  WHERE e2.hardware_id = e1.hardware_id and e2.access_point = e1.access_point and e2.id < e1.id

                        and IntervalId is not null

                  ORDER BY e2.datein desc)

FROM test_events e1

WHERE IntervalId IS NULL

Although at the first look this query looks better than the cursor, in fact, is not entirely a set based query. For each event in the table, the subquery searches for the interval starter by going back to the table. This step took 2 minutes on table with 250000 rows and 6 min on table with 500000 rows.

Which means I need another solution.

If you take a look at the example below, you can clearly see that the intervals belong to one of the two categories

    • Closed intervals: there are other appearances of the same hardware_id on the different access_point ( marked in blue )
    • Open intervals : there are none other appearances of the same hardware_id on the different access_point ( marked in green)

image

2nd step. Link the events within the closed intervals.

For each start of the interval find the next start interval’s starting time. We can easily find the next value by using the LEAD() window function.

Together with the current timestamp they define the interval boundaries which will be between the CurrentTime and  LESS than NextIntervalStartTime.

This step took 15 sec on 250000 table and 25 sec on 500000 table.

UPDATE e1 

SET e1.IntervalId = e2.IntervalId

FROM test_events e1

  JOIN (select *, NextEventStart = LEAD(datein) OVER (PARTITION BY hardware_id ORDER BY datein )

       FROM test_events

       WHERE IntervalId IS NOT NULL ) e2

   ON e1.hardware_id = e2.hardware_id and e1.access_point = e2.access_point

       and e1.datein >= e2.datein and e1.datein < e2.NextEventStart

WHERE e1.IntervalId IS NULL

image

3rd step. Link the events within the open intervals.

 

For each start of the interval we will need to find the last timestamp for the same device on the same access point.  Analytical function LAST_VALUE() will help us to achieve that. Take into the consideration that the default range of this function includes only  preceding events so you need to explicitly specify RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING option.

 

Together with the current timestamp they define the interval boundaries which lay between the CurrentTime and the LastTimeForThisDevice

UPDATE e1

SET e1.IntervalId = e2.IntervalId

FROM test_events e1

  JOIN (SELECT * , LastTimeForThisDevice = LAST_VALUE(datein) OVER (PARTITION BY hardware_id,access_point ORDER BY datein RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

              FROM test_events ) e2

   on e1.hardware_id = e2.hardware_id and e1.access_point = e2.access_point

   and e1.datein >= e2.datein and e1.datein <= e2.LastTimeForThisDevice

WHERE e1.IntervalId IS NULL

 

This query took 25 sec on 250000 rows and 45 sec on 500000 rows.

 

image

4th step. Grab all the Interval End Dates.

If we want to find an interval end dates, there is a need for another query that took 30 sec on 500000 rows.

UPDATE e1

 set e1.IntervalDateEnd = e2.IntervalDateEnd

 from test_events e1

   join (select id, IntervalDateEnd = LAST_VALUE(datein) OVER (PARTITION BY hardware_id,IntervalId ORDER BY datein RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

                from  test_events) e2          

 on e1.id = e2.id

Total set based solution time is about 2 min. Now, the next step would be to identify good indexes that will support the queries and crawler updates that will take care of fresh data, will not slow down the inserts and will not cause page fragmentation.

You may think that since the (simpler to write) second cursor took only(!) 10 min to finish there would be not much reason for all the hassle. However, a set based solution is far more scalable which is extremely important these days where every small table tends to turn into BIG data.

Forcing the set based technology to use iterative functionality is not graceful at all.  Think about the hedgehogs, who would never use their tiny paws to descent from a climb. They will typically roll into a ball and roll down.

May all your TSQL solutions roll.

baby hedgehog

Published Monday, November 18, 2013 3:50 PM by Maria Zakourdaev

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Uri Dimant said:

I cannot see ItervalID column.Is it ID column?

November 19, 2013 1:27 AM
 

Uri Dimant said:

Does not matter, I see you added them.

November 19, 2013 1:29 AM
 

Maria Zakourdaev said:

Hi Uri, you are correct, ALTER TABLE script was a part of the cursor script. I will move it out of there.

November 19, 2013 3:47 AM
 

Uri Dimant said:

Hi Maria

Do you really need  to update those columns?

WITH cte

AS

(

SELECT

*,

 ISNULL(

   (SELECT MIN(datein)

    FROM test_events AS S3

    WHERE S3.datein >= S1.datein

AND S3.hardware_id = S1.hardware_id

AND  S3.access_point = S1.access_point

      AND ISNULL(

            DATEDIFF(

              minute,

              S3.datein,

              (SELECT MIN(datein)

               FROM test_events AS S4

               WHERE S4.datein > S3.datein

AND S4.hardware_id = S3.hardware_id

          AND  S4.access_point = S3.access_point)), 30) >= 30),

            datein) AS datein_last

FROM test_events AS S1

) SELECT id, hardware_id, access_point, datein,

 MIN(id) OVER (PARTITION BY hardware_id,access_point,datein_last)AS intervalID

 FROM cte

November 19, 2013 8:33 AM
 

Maria Zakourdaev said:

Nice try, I was there. This table will potentially have few daily millions of rows, if not more. Of course I can use the same queries that I have used in updates but pre-calculating those columns in a great way speeds up the analytic queries because they don't need to go back and forth the resultset anymore.

Your query, if proposed for the crawler that updates the table, takes more than 20 min to finish on 500000 table (I have stopped it after 20 minutes)

November 19, 2013 10:05 AM
 

Geri Reshef said:

Unfortunately you didn't supply the desired output..

Every connection is considered closed when the device is getting connected to a new hardware or after 30 minutes with no other activity (=new connection)?

I'm not sure I understood the rules..

November 20, 2013 7:44 AM
 

Maria Zakourdaev said:

Geri, the initial question was to group the data as fast as possible and this task was accomplished. More information directly related to those groups (from the different data sources) exists and will be linked to them later on.

IN addition, more information about the intervals, like IntervalEndTime or maybe even more pre-calculated values will allow them to query the data, like calculate average time users were connected to the hotspot, or number of users at the specific time frames.

Every connection is considered closed when the device is moving to a new hotspot or after 30 minutes with no activity.

November 20, 2013 8:08 AM
 

Uri Dimant said:

Another one,

WITH cte

AS

(

SELECT* ,   CASE WHEN ABS(DATEDIFF(minute,COALESCE(COALESCE(LAG(datein,1) OVER  (PARTITION BY hardware_id, access_point ORDER BY id),

                   LEAD(datein,1) OVER  (PARTITION BY hardware_id, access_point ORDER BY id)),

datein),datein))>=30

THEN  id END dt

FROM test_events

),cte1

AS

(

SELECT *,

  ROW_NUMBER() OVER (PARTITION BY hardware_id, access_point,dt ORDER BY (SELECT 0)) rn

FROM cte

) SELECT id, hardware_id, access_point,datein,

           CASE WHEN dt IS NOT NULL

           THEN dt

           ELSE COALESCE((SELECT MIN(dt)

                 FROM cte1

                 WHERE id <= t.id  and rn <= t.rn

 AND hardware_id=t.hardware_id

 AND access_point=t.access_point

),MIN(id) OVER (PARTITION BY hardware_id, access_point))

      END AS intervalid

FROM cte1 t

November 20, 2013 9:31 AM
 

Maria Zakourdaev said:

Thank you Uri, I'm adding your cool queries to another 20 I wrote myself during my work on this task. All of them produce the required result and surely pre-calculation is only one of the many possible solutions but the developers explicitly wanted those groups to be in the data, each with unique id because they wanted to manage them as a additional entity. Furthermore, having this data pre-calculated, makes queries much easier to write and to understand for the developers.

The reason I ended up with 4 queries instead of one is that I have tried to avoid correlated subqueries and the larger the table the bigger the performance gain.

November 20, 2013 10:16 AM
 

Maria Zakourdaev said:

Your second query is running for 30 min now and still working. 4 queries above take totally 2 min.

November 20, 2013 10:19 AM
 

Uri Dimant said:

Yes, sometimes we need to write more code in order to get things better...

November 20, 2013 11:05 AM
 

Geri Reshef said:

The second one of Uri seems to be efficient as it is involved in only one Table Scan.

With correct indexing I believe its performance would be better than of the non-set-based one.

November 20, 2013 11:44 AM
 

Michael Zilberstein said:

Maria, interesting and very useful - I have almost same task at one of my databases. Several questions:

1) For 2nd step did you try self join instead of LAG? I once was very surprised to find out that old solution can be more efficient.

http://sqlblog.com/blogs/michael_zilberstein/archive/2012/03/14/42332.aspx

2) Entire task is very similar to "islands" problem - your island's definition is "particular device connected to particular access point with events in less than 30 minutes gap". Did you try one of "islands" algorithms? (Personally, I've used this solution for my client).

December 3, 2013 2:59 AM
 

Maria Zakourdaev said:

Michael, when I have performed my tests, self joins behaved very slowly. Maybe because my joins involved data sorting and filtering which is more difficult that getting a previous row in the clustered index.

Reading "islands" algorithm - I really was using similar logic just I needed to persist the islands in the database, that was the major requirement from the developer. I will test the queries, it's interesting to see the execution times. Thanks!

December 3, 2013 8:02 AM

Leave a Comment

(required) 
(required) 
Submit

About Maria Zakourdaev

The shortest word in the English language that contains the letters: abcdef is… feedback! Let me know if I have touched something that is alive in the cosmos.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement