Sql-server – Performance comparison between using Join and Window function to get lead and lag values

join;performancequery-performancesql serversql-server-2016window functions

I have a table with 20M rows, and each row has 3 columns: time, id, and value. For each id and time, there is a value for the status. I want to know the lead and lag values of a certain time for a specific id.

I have used two methods to achieve this. One method is using join and another method is using the window functions lead/lag with clustered index on time and id.

I compared the performance of these two methods by execution time. The join method takes 16.3 seconds and the window function method takes 20 seconds, not including the time to create the index. This surprised me because the window function seems to be advanced while the join methods is brute force.

Here is the code for the two methods:

Create Index

create clustered index id_time
 on tab1 (id,time)

Join method

select a1.id,a1.time
   a1.value as value, 
   b1.value as value_lag,
   c1.value as value_lead
into tab2
from tab1 a1
left join tab1 b1
on a1.id = b1.id
and a1.time-1= b1.time
left join tab1 c1
on a1.id = c1.id
and a1.time+1 = c1.time

IO statistics generated using SET STATISTICS TIME, IO ON:

Here is the execution plan for the join method

Window Function method

select id, time, value, 
   lag(value,1) over(partition by id order by id,time) as value_lag,
   lead(value,1) over(partition by id order by id,time) as value_lead
into tab2
from tab1

(Ordering only by time saves 0.5 seconds.)

Here is the execution plan for Window function method

IO statistics

[ Statistics for Window function method 4]

I checked the data in sample_orig_month_1999 and it seems that the raw data is well ordered by id and time. Is this the reason of performance difference?

It seems that the join method has more logical reads than the window function method, while the execution time for the former is actually less. Is it because the former has a better parallelism?

I like the window function method because of the concise code, is there any way to speed it up for this specific problem?

I'm using SQL Server 2016 on Windows 10 64 bit.

Best Answer

The relatively low row-mode performance of LEAD and LAG window functions compared with self joins is nothing new. For example, Michael Zilberstein wrote about it on SQLblog.com back in 2012. There is quite a bit of overhead in the (repeated) Segment, Sequence Project, Window Spool, and Stream Aggregate plan operators:

In SQL Server 2016, you have a new option, which is to enable batch mode processing for the window aggregates. This requires some sort of columnstore index on the table, even if it is empty. The presence of a columnstore index is currently required for the optimizer to consider batch mode plans. In particular, it enables the much more efficient Window Aggregate batch-mode operator.

To test this in your case, create an empty nonclustered columnstore index:

 -- Empty CS index
CREATE NONCLUSTERED COLUMNSTORE INDEX dummy 
ON dbo.tab1 (id, [time], [value]) 
WHERE id < 0 AND id > 0;

The query:

SELECT
    T1.id,
    T1.[time],
    T1.[value],
    value_lag = 
        LAG(T1.[value]) OVER (
            PARTITION BY T1.id
            ORDER BY T1.[time]),
    value_lead =
        LEAD(T1.[value]) OVER (
            PARTITION BY T1.id
            ORDER BY T1.[time])
FROM dbo.tab1 AS T1;

Should now give an execution plan like:

...which may well execute much faster.

You may need to use an OPTION (MAXDOP 1) or other hint to get the same plan shape when storing the results in a new table. The parallel version of the plan requires a batch mode sort (or possibly two), which may well be a little slower. It rather depends on your hardware.

For more on the Batch Mode Window Aggregate operator, see the following articles by Itzik Ben-Gan:

Related Solutions

Postgresql – How to get the aggregate of a window function in Postgres

You can have window functions on the result of aggregate functions in a single query level.

This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.

SELECT perm
      ,combo
      ,avg(value)                 AS perm_average_value
      ,sum(avg(value) * count(*)) OVER w_combo /
       sum(count(*)) OVER w_combo AS combo_average_value
      ,stddev_pop(value)          AS perm_stddev
      ,0                          AS combo_stddev  -- doesn't work!
      ,count(*)                   AS perm_count
      ,sum(count(*)) OVER w_combo AS combo_count
FROM   foo
GROUP  BY perm, combo
WINDOW w_combo  AS (PARTITION BY combo);

For combo_average_value you would need this expression

sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo

Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)

This works:

SELECT DISTINCT ON (perm, combo)
       perm
      ,combo
      ,avg(value)        OVER wpc AS perm_average_value
      ,avg(value)        OVER wc  AS combo_average_value
      ,stddev_pop(value) OVER wpc AS perm_stddev
      ,stddev_pop(value) OVER wc  AS combo_stddev
      ,count(*)          OVER wpc AS perm_count
      ,count(*)          OVER wc  AS combo_count
FROM   foo
WINDOW wc  AS (PARTITION BY combo)
      ,wpc AS (PARTITION BY perm, combo);

I am using two different windows here, and reduce the rows with DISTINCT which is applied even after window functions.

But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.

Better performance with altered table layout

Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:

CREATE TABLE combo ( 
  combo_id serial PRIMARY KEY
 ,combo    int[] NOT NULL
);

CREATE TABLE perm ( 
  perm_id  serial PRIMARY KEY
 ,perm     int[] NOT NULL
);

CREATE TABLE value (
  perm_id  int REFERENCES perm(perm_id)
 ,combo_id int REFERENCES combo(combo_id)
 ,value numeric NOT NULL DEFAULT 0
);

If you don't need referential integrity you can omit the foreign key constraints.

The connection to combo_id could also be placed in the table perm, but in this scenario I would store it (slightly de-normalized) in value for better performance.

This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your numeric column. (If you don't need extreme precision, a double precision or even a real column might do, too.)

More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance

Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.

You would first aggregate in a subquery and then join to perm and combo for best performance.

Sql-server – What can be done to further enhance performance of Multiple Join and Aggregate Queries

There is rarely any need, point or benefit trying to micro optimise star schema queries with non-clustered indexes laden with included columns. Fact tables are built to be scanned.

The indexes you've created in your examples are subset copies of the parent table, which are being scanned (no seeks). The minor performance improvements come from scanning marginally fewer pages than the parent table. Given that star schemas are built to support ad-hoc query patterns it is not viable to create the indexes to support every possible enquiry.

Create your fact table clustered index on the date key. The majority of (typical) fact table queries include a time element and clustering on the date key enables range scanning of fact table rows.
Add non-clustered indexes on the foreign keys of your fact tables to assist with highly selective queries. Foreign keys to dimension tables can be created with NOCHECK to prevent any impact on ETL.
Cluster your dimension tables on their surrogate keys.
Create a non-clustered index on the natural key of each dimension table.
Stop.

The optimiser detects star schema query patterns and has strategies to deal with them efficiently, utilising scans and hash joins in Standard Edition or bitmap filtering in Enterprise. Follow the indexing strategy outlined above and let the optimiser deal with the rest.