SQL Server – Aggregate Data from Multiple Rows into Single Row

sql serversql-server-2008-r2t-sql

In my table I have rows, and each row has some data and its version. When column is not-null, it overrides an old value otherwise nothing happens.

Pseudocode (in C#)

var resultTable = new List<Row>();
foreach (var group in table.GroupBy(x => x.ID).Select(g => g.OrderBy(x => x.Priority)))
{
    var row = Row.GetEmpty();
    foreach (var anotherRow in group)
    {
        foreach (var column in group.Columns)
        {
            if (column.GetValue(anotherRow) != null)
                column.SetValue(row, anotherRow);
        }
    }
    resultTable.Add(row);
}

Table is large enough (several tens of columns, several hundreds of thousands rows), so it's desired to do it in single table scan (if possible).

I appreciate any help and any possibility to achieve it without any bias, SQLCLR, DSQL and so on, everything is applicable if it works fast and makes its job.

Current script for a test table:

CREATE TABLE [dbo].[T](
    [id] [int] NOT NULL,
    [A] [int] NULL,
    [B] [int] NULL,
    [C] [int] NULL,
    [Priority] [int] NOT NULL
) ON [PRIMARY]

GO
CREATE UNIQUE CLUSTERED INDEX [ClusteredIndex-20160204-122857] ON [dbo].[T]
(
    [id] ASC,
    [Priority] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (1, NULL, 3, 4, 1)
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (1, 5, 6, NULL, 2)
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (1, 8, NULL, NULL, 3)
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (2, 634, 346, 359, 1)
GO
INSERT [dbo].[T] ([id], [A], [B], [C], [Priority]) VALUES (2, 34, NULL, 734, 2)
GO

input:

id  A       B       C       Priority
1   NULL    3       4       1
1   5       6       NULL    2
1   8       NULL    NULL    3
2   634     346     359     1
2   34      NULL    734     2

desired result:

id  A   B   C   
1   8   6   4   
2   34  346 734

In this example table is small and has only 5 columns, but in real table it will be much larger. I really want this script to work fast. I tried do it myself, but my script works for SQLSERVER2012+ so I deleted it as not applicable.

Numbers: table could have 150k of rows, 20 columns, 20-80k of unique ids and average SELECT COUNT(id) FROM T GROUP BY ID is 2..5

Best Answer

Itzik Ben-Gan's blog post (link provided by Daniel Hutmacher) has some nice solutions to the problem for SQL Server 2012 and later:

The Last non NULL Puzzle.

Here is one more that will work in older versions (even 2005) and can be easily adapted for more columns. I haven't tested for efficiency or compared to the other solutions but I'd expect it to work better if the table had a clustered index on (id, priority). But performance may depend on many other things, like the distribution, the number of distinct id values, the number of versions (rows per id values), the number and types of columns, etc.:

SELECT 
    d.id,
    d1.col1,
    d2.col2,
    d3.col3
FROM 
    ( SELECT id, @point_in_time AS priority 
      FROM tablename
      GROUP BY id
    ) AS d
  OUTER APPLY
    ( SELECT TOP (1) col1
      FROM tablename 
      WHERE id = d.id
        AND priority <= d.priority          -- optional
        AND col1 IS NOT NULL
      ORDER BY priority DESC
    ) AS d1 
  OUTER APPLY
    ( SELECT TOP (1) col2
      FROM tablename 
      WHERE id = d.id
        AND priority <= d.priority          -- optional
        AND col2 IS NOT NULL
      ORDER BY priority DESC
    ) AS d2 
  OUTER APPLY
    ( SELECT TOP (1) col3
      FROM tablename 
      WHERE id = d.id
        AND priority <= d.priority          -- optional
        AND col3 IS NOT NULL
      ORDER BY priority DESC
    ) AS d3 ;

Test in SQLfiddle.

The @point_in_time above is a parameter, that can be removed if we want the latest non-null values across all the table.

If however, we want the latest non-null values for a specific point in time (say for when priority=55), we can replace that @point_in_time with 55 and have the latest non-null values up to that point in time (55).

Related Solutions

Combine Columns from Multiple Rows into Single Row in SQL Server

This is relatively trivial to do with a correlated subquery. You can't use the COALESCE method highlighted in the blog post you mention unless you extract that to a user-defined function (or unless you only want to return one row at a time). Here is how I typically do this:

DECLARE @x TABLE 
(
  id INT, 
  row_num INT, 
  customer_code VARCHAR(32), 
  comments VARCHAR(32)
);

INSERT @x SELECT 1,1,'Dilbert','Hard'
UNION ALL SELECT 1,2,'Dilbert','Worker'
UNION ALL SELECT 2,1,'Wally','Lazy';

SELECT id, customer_code, comments = STUFF((SELECT ' ' + comments 
    FROM @x AS x2 WHERE id = x.id
     ORDER BY row_num
     FOR XML PATH('')), 1, 1, '')
FROM @x AS x
GROUP BY id, customer_code
ORDER BY id;

If you have a case where the data in comments could contain unsafe-for-XML characters (>, <, &), you should change this:

     FOR XML PATH('')), 1, 1, '')

To this more elaborate approach:

     FOR XML PATH(''), TYPE).value(N'(./text())[1]', N'varchar(max)'), 1, 1, '')

(Be sure to use the right destination data type, varchar or nvarchar, and the right length, and prefix all string literals with N if using nvarchar.)

SQL Server Hierarchical Order – Parent-Child Tree Hierarchical ORDER BY

OK, enough brain cells are dead.

SQL Fiddle

WITH cte AS
(
  SELECT 
    [ICFilterID], 
    [ParentID],
    [FilterDesc],
    [Active],
    CAST(0 AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters]
  WHERE [ParentID] = 0
  UNION ALL
  SELECT 
    i.[ICFilterID], 
    i.[ParentID],
    i.[FilterDesc],
    i.[Active],  
    Level + CAST(i.[ICFilterID] AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters] i
  INNER JOIN cte c
    ON c.[ICFilterID] = i.[ParentID]
)

SELECT 
  [ICFilterID], 
  [ParentID],
  [FilterDesc],
  [Active]
FROM cte
ORDER BY [Level];

Best Answer

Related Solutions

Combine Columns from Multiple Rows into Single Row in SQL Server

SQL Server Hierarchical Order – Parent-Child Tree Hierarchical ORDER BY

Related Question