Sql-server – Query detailing differences between rows for a large amount of data

sql serversql-server-2008-r2

I have a number of large tables, each with >300 columns. The application I am using creates "archives" of changed rows by making a copy of the current row in a secondary table.

Consider a trivial example:

CREATE TABLE dbo.bigtable
(
  UpdateDate datetime,
  PK varchar(12) PRIMARY KEY,
  col1 varchar(100),
  col2 int,
  col3 varchar(20),
  .
  .
  .
  colN datetime
);

Archive table:

CREATE TABLE dbo.bigtable_archive
(
  UpdateDate datetime,
  PK varchar(12) NOT NULL,
  col1 varchar(100),
  col2 int,
  col3 varchar(20),
  .
  .
  .
  colN datetime
);

Before any updates are executed on dbo.bigtable, a copy of the row is created in dbo.bigtable_archive, then dbo.bigtable.UpdateDate is updated with the current date.

Therefore UNIONing the two tables together & grouping by PK creates a timeline of changes, when ordered by UpdateDate.

I wish to create a report detailing the differences between rows, ordered by UpdateDate, grouped by PK, in the following format:

PK,   UpdateDate,  ColumnName,  Old Value,   New Value

Old Value and New Value can be the relevant columns cast to a VARCHAR(MAX) (there are no TEXT or BYTE columns involved), as I do not need to do any post-processing of the values themselves.

At the moment I can't think of a sane way of doing this for a large amount of columns, without resorting to generating the queries programmatically – I may have to do this.

Open to lots of ideas, so I'll add a bounty to the question after 2 days.

Best Answer

This is not going to look pretty, especially given the more than 300 columns and unavailability of LAG, nor is it likely to perform exceedingly well, but just as something to start with, I would try the following approach:

  • UNION the two tables.
  • For each PK in the combined set, get its previous "incarnation" from the archive table (the implementation below uses OUTER APPLY + TOP (1) as a poor man's LAG).
  • Cast each data column to varchar(max) and unpivot them in pairs, i.e. the current and the previous value (CROSS APPLY (VALUES ...) works well for this operation).
  • Finally, filter the results based on whether the values in each pair differ from each other.

The Transact-SQL of the above as I see it:

WITH
  Combined AS
  (
    SELECT * FROM dbo.bigtable
    UNION ALL
    SELECT * FROM dbo.bigtable_archive
  ) AS derived,
  OldAndNew AS
  (
    SELECT
      this.*,
      OldCol1 = last.Col1,
      OldCol2 = last.Col2,
      ...
    FROM
      Combined AS this
      OUTER APPLY
      (
        SELECT TOP (1)
          *
        FROM
          dbo.bigtable_archive
        WHERE
          PK = this.PK
          AND UpdateDate < this.UpdateDate
        ORDER BY
          UpdateDate DESC
      ) AS last
  )
SELECT
  t.PK,
  t.UpdateDate,
  x.ColumnName,
  x.OldValue,
  x.NewValue
FROM
  OldAndNew AS t
  CROSS APPLY
  (
    VALUES
    ('Col1', CAST(t.OldCol1 AS varchar(max), CAST(t.Col1 AS varchar(max))),
    ('Col2', CAST(t.OldCol2 AS varchar(max), CAST(t.Col2 AS varchar(max))),
    ...
  ) AS x (ColumnName, OldValue, NewValue)
WHERE
  NOT EXISTS (SELECT x.OldValue INTERSECT x.NewValue)
ORDER BY
  t.PK,
  t.UpdateDate,
  x.ColumnName
;