Edit: this is second attempt
Based on @Hannah Vernon's answer, here is a way to bypass the use of CTE inside an inline subquery, which is like self-joining the CTE and I presume is the reason for the poor efficiency. It uses analytic functions available only in the 2012 version of SQL-Server. Tested at SQL-Fiddle
This part can be skipped from reading, it's a copy-paste from Hannah's answer:
;With AccountHierarchy AS
(
SELECT
Root.AcctId AccountId
, Root.Name AccountName
, Root.ParentId ParentId
, 1 HierarchyLevel
, cast(Root.Acctid as varchar(4000)) IdHierarchyMatch
, cast(Root.Acctid as varchar(4000)) IdHierarchy
, cast(replace(Root.Name,'.','') as varchar(4000)) NameHierarchy
, cast(Root.Acctid as varchar(4000)) HierarchySort
, cast(Root.Name as varchar(4000)) HierarchyLabel ,
Root.CustomerCount CustomerCount
FROM
account Root
WHERE
Root.ParentID is null
UNION ALL
SELECT
Recurse.Acctid AccountId
, Recurse.Name AccountName
, Recurse.ParentId ParentId
, Root.HierarchyLevel + 1 HierarchyLevel
, CAST(CAST(Root.IdHierarchyMatch as varchar(40)) + '.'
+ cast(recurse.Acctid as varchar(40)) as varchar(4000)) IdHierarchyMatch
, cast(cast(recurse.Acctid as varchar(40)) + '.'
+ Root.IdHierarchy as varchar(4000)) IdHierarchy
, cast(replace(recurse.Name,'.','') + '.'
+ Root.NameHierarchy as varchar(4000)) NameHierarchy
, cast(Root.AccountName + '.'
+ Recurse.Name as varchar(4000)) HierarchySort
, cast(space(root.HierarchyLevel * 4)
+ Recurse.Name as varchar(4000)) HierarchyLabel
, Recurse.CustomerCount CustomerCount
FROM
account Recurse INNER JOIN
AccountHierarchy Root on Root.AccountId = Recurse.ParentId
)
Here we order the rows of the CTE using the IdHierarchyMatch
and we calculate row numbers and a running total (from the next row up to the end.)
, cte1 AS
(
SELECT
h.AccountId
, h.AccountName
, h.ParentId
, h.HierarchyLevel
, h.IdHierarchy
, h.NameHierarchy
, h.HierarchyLabel
, parsename(h.IdHierarchy,1) Acct1Id
, parsename(h.NameHierarchy,1) Acct1Name
, parsename(h.IdHierarchy,2) Acct2Id
, parsename(h.NameHierarchy,2) Acct2Name
, parsename(h.IdHierarchy,3) Acct3Id
, parsename(h.NameHierarchy,3) Acct3Name
, parsename(h.IdHierarchy,4) Acct4Id
, parsename(h.NameHierarchy,4) Acct4Name
, h.CustomerCount
, h.HierarchySort
, h.IdHierarchyMatch
, Rn = ROW_NUMBER() OVER
(ORDER BY h.IdHierarchyMatch)
, RunningCustomerCount = COALESCE(
SUM(h.CustomerCount)
OVER
(ORDER BY h.IdHierarchyMatch
ROWS BETWEEN 1 FOLLOWING
AND UNBOUNDED FOLLOWING)
, 0)
FROM
AccountHierarchy AS h
)
Then we have one more intermediate CTE where we use the previous running totals and row numbers - basically to find where the end points for the branches of the tree structure:
, cte2 AS
(
SELECT
cte1.*
, rn3 = LAST_VALUE(Rn) OVER
(PARTITION BY Acct1Id, Acct2Id, Acct3Id
ORDER BY Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
, rn2 = LAST_VALUE(Rn) OVER
(PARTITION BY Acct1Id, Acct2Id
ORDER BY Acct3Id, Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
, rn1 = LAST_VALUE(Rn) OVER
(PARTITION BY Acct1Id
ORDER BY Acct2Id, Acct3Id, Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
, rcc3 = LAST_VALUE(RunningCustomerCount) OVER
(PARTITION BY Acct1Id, Acct2Id, Acct3Id
ORDER BY Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
, rcc2 = LAST_VALUE(RunningCustomerCount) OVER
(PARTITION BY Acct1Id, Acct2Id
ORDER BY Acct3Id, Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
, rcc1 = LAST_VALUE(RunningCustomerCount) OVER
(PARTITION BY Acct1Id
ORDER BY Acct2Id, Acct3Id, Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM
cte1
)
and finally we build the last part:
SELECT
hier.AccountId
, hier.AccountName
--- -- columns skipped
, hier.CustomerCount
, TotalCustomerCount = hier.CustomerCount
+ hier.RunningCustomerCount
- ca.LastRunningCustomerCount
, hier.HierarchySort
, hier.IdHierarchyMatch
FROM
cte2 hier
OUTER APPLY
( SELECT LastRunningCustomerCount, Rn
FROM
( SELECT LastRunningCustomerCount
= RunningCustomerCount, Rn
FROM (SELECT NULL a) x WHERE 4 <= HierarchyLevel
UNION ALL
SELECT rcc3, Rn3
FROM (SELECT NULL a) x WHERE 3 <= HierarchyLevel
UNION ALL
SELECT rcc2, Rn2
FROM (SELECT NULL a) x WHERE 2 <= HierarchyLevel
UNION ALL
SELECT rcc1, Rn1
FROM (SELECT NULL a) x WHERE 1 <= HierarchyLevel
) x
ORDER BY Rn
OFFSET 0 ROWS
FETCH NEXT 1 ROWS ONLY
) ca
ORDER BY
hier.HierarchySort ;
And a simplification, using the same cte1
as the code above. Test at SQL-Fiddle-2. Please note that both solutions work under the assumption that you have a maximum of four levels in your tree:
SELECT
hier.AccountId
--- -- skipping rows
, hier.CustomerCount
, TotalCustomerCount = CustomerCount
+ RunningCustomerCount
- CASE HierarchyLevel
WHEN 4 THEN RunningCustomerCount
WHEN 3 THEN LAST_VALUE(RunningCustomerCount) OVER
(PARTITION BY Acct1Id, Acct2Id, Acct3Id
ORDER BY Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
WHEN 2 THEN LAST_VALUE(RunningCustomerCount) OVER
(PARTITION BY Acct1Id, Acct2Id
ORDER BY Acct3Id, Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
WHEN 1 THEN LAST_VALUE(RunningCustomerCount) OVER
(PARTITION BY Acct1Id
ORDER BY Acct2Id, Acct3Id, Acct4Id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
END
, hier.HierarchySort
, hier.IdHierarchyMatch
FROM cte1 AS hier
ORDER BY
hier.HierarchySort ;
A third approach, with only one CTE, for the recursive part and then only window aggregate functions (SUM() OVER (...)
), so it should work in any version from 2005 upwards. Test at SQL-Fiddle-3 This solution assumes, like the previous ones, that there are 4 levels maximum in the hierarchy tree:
;WITH AccountHierarchy AS
(
SELECT
AccountId = Root.AcctId
, AccountName = Root.Name
, ParentId = Root.ParentId
, HierarchyLevel = 1
, HierarchySort = CAST(Root.Acctid AS VARCHAR(4000))
, HierarchyLabel = CAST(Root.Name AS VARCHAR(4000))
, Acct1Id = CAST(Root.Acctid AS VARCHAR(4000))
, Acct2Id = CAST(NULL AS VARCHAR(4000))
, Acct3Id = CAST(NULL AS VARCHAR(4000))
, Acct4Id = CAST(NULL AS VARCHAR(4000))
, Acct1Name = CAST(Root.Name AS VARCHAR(4000))
, Acct2Name = CAST(NULL AS VARCHAR(4000))
, Acct3Name = CAST(NULL AS VARCHAR(4000))
, Acct4Name = CAST(NULL AS VARCHAR(4000))
, CustomerCount = Root.CustomerCount
FROM
account AS Root
WHERE
Root.ParentID IS NULL
UNION ALL
SELECT
Recurse.Acctid
, Recurse.Name
, Recurse.ParentId
, Root.HierarchyLevel + 1
, CAST(Root.AccountName + '.'
+ Recurse.Name AS VARCHAR(4000))
, CAST(SPACE(Root.HierarchyLevel * 4)
+ Recurse.Name AS VARCHAR(4000))
, Root.Acct1Id
, CASE WHEN Root.HierarchyLevel = 1
THEN cast(Recurse.Acctid AS VARCHAR(4000))
ELSE Root.Acct2Id
END
, CASE WHEN Root.HierarchyLevel = 2
THEN CAST(Recurse.Acctid AS VARCHAR(4000))
ELSE Root.Acct3Id
END
, CASE WHEN Root.HierarchyLevel = 3
THEN CAST(Recurse.Acctid AS VARCHAR(4000))
ELSE Root.Acct4Id
END
, cast(Root.AccountName as varchar(4000))
, CASE WHEN Root.HierarchyLevel = 1
THEN CAST(Recurse.Name AS VARCHAR(4000))
ELSE Root.Acct2Name
END
, CASE WHEN Root.HierarchyLevel = 2
THEN CAST(Recurse.Name AS VARCHAR(4000))
ELSE Root.Acct3Name
END
, CASE WHEN Root.HierarchyLevel = 3
THEN CAST(Recurse.Name AS VARCHAR(4000))
ELSE Root.Acct4Name
END
, Recurse.CustomerCount
FROM
account AS Recurse INNER JOIN
AccountHierarchy AS Root ON Root.AccountId = Recurse.ParentId
)
SELECT
h.AccountId
, h.AccountName
, h.ParentId
, h.HierarchyLevel
, IdHierarchy =
CAST(COALESCE(h.Acct4Id+'.','')
+ COALESCE(h.Acct3Id+'.','')
+ COALESCE(h.Acct2Id+'.','')
+ h.Acct1Id AS VARCHAR(4000))
, NameHierarchy =
CAST(COALESCE(h.Acct4Name+'.','')
+ COALESCE(h.Acct3Name+'.','')
+ COALESCE(h.Acct2Name+'.','')
+ h.Acct1Name AS VARCHAR(4000))
, h.HierarchyLabel
, h.Acct1Id
, h.Acct1Name
, h.Acct2Id
, h.Acct2Name
, h.Acct3Id
, h.Acct3Name
, h.Acct4Id
, h.Acct4Name
, h.CustomerCount
, TotalCustomerCount =
CASE h.HierarchyLevel
WHEN 4 THEN h.CustomerCount
WHEN 3 THEN SUM(h.CustomerCount) OVER
(PARTITION BY h.Acct1Id, h.Acct2Id, h.Acct3Id)
WHEN 2 THEN SUM(h.CustomerCount) OVER
(PARTITION BY Acct1Id, h.Acct2Id)
WHEN 1 THEN SUM(h.CustomerCount) OVER
(PARTITION BY h.Acct1Id)
END
, h.HierarchySort
, IdHierarchyMatch =
CAST(h.Acct1Id
+ COALESCE('.'+h.Acct2Id,'')
+ COALESCE('.'+h.Acct3Id,'')
+ COALESCE('.'+h.Acct4Id,'') AS VARCHAR(4000))
FROM
AccountHierarchy AS h
ORDER BY
h.HierarchySort ;
A 4th approach, that calculates as an intermediate CTE, the closure table of the hierarchy. Test at SQL-Fiddle-4. The benefit is that for the sums calculations, there is no resctriction on the number of levels.
;WITH AccountHierarchy AS
(
-- skipping several line, identical to the 3rd approach above
)
, ClosureTable AS
(
SELECT
AccountId = Root.AcctId
, AncestorId = Root.AcctId
, CustomerCount = Root.CustomerCount
FROM
account AS Root
UNION ALL
SELECT
Recurse.Acctid
, Root.AncestorId
, Recurse.CustomerCount
FROM
account AS Recurse INNER JOIN
ClosureTable AS Root ON Root.AccountId = Recurse.ParentId
)
, ClosureGroup AS
(
SELECT
AccountId = AncestorId
, TotalCustomerCount = SUM(CustomerCount)
FROM
ClosureTable AS a
GROUP BY
AncestorId
)
SELECT
h.AccountId
, h.AccountName
, h.ParentId
, h.HierarchyLevel
, h.HierarchyLabel
, h.CustomerCount
, cg.TotalCustomerCount
, h.HierarchySort
FROM
AccountHierarchy AS h
JOIN
ClosureGroup AS cg
ON cg.AccountId = h.AccountId
ORDER BY
h.HierarchySort ;
1st case
You seem to forget the valid_during
range. As your third case suggests, there can be multiple entries per (rec_id, val)
, so you must select the right one:
UPDATE master m
SET valid_on = f_array_sort(m.valid_on || u.valid_on) -- sorted array, see below
FROM updates u
WHERE m.rec_id = u.rec_id
AND m.valid_during @> u.valid_on -- additional check
AND m.val = u.val
AND NOT m.valid_on @> ARRAY[u.valid_on];
I assume the whole possible date range is always covered for each existing rec_id
and valid_during
shall not overlap per rec_id
, or you'd have to do more.
After installing the additional module btree_gist
, add an exclusion constraint to rule out overlapping date ranges if you don't have one, yet:
ALTER TABLE master ADD CONSTRAINT EXCLUDE
USING gist (rec_id WITH =, valid_during WITH &&) -- disallow overlap
The GiST index this is implemented with is also a perfect match for the query. Details:
2nd / 3rd case
Assuming that every date range starts with the smallest date in the (now sorted!) array: lower(m.valid_during) = m.valid_on[1]
. I would enforce that with a CHECK
constraint.
Here we need to create one or two new rows
In the 2nd case it is enough to shrink the range of the old row and insert one new row
In the 3rd case we update the old row with the left half of array and range, insert the new row and finally insert the with the right half of array and range.
Helper functions
To keep it simple I introduce a new constraint: every array is sorted. Use this helper function
-- sort array
CREATE OR REPLACE FUNCTION f_array_sort(anyarray)
RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT unnest($1) ORDER BY 1)$$;
I don't need your helper function arraymin()
any more, but it could be simplified to:
CREATE OR REPLACE FUNCTION f_array_min(anyarray)
RETURNS anyelement LANGUAGE sql IMMUTABLE AS
$$SELECT min(a) FROM unnest($1) a$$;
Two more to get the left and right half of an array split at a given element:
-- split left array at given element
CREATE OR REPLACE FUNCTION f_array_left(anyarray, anyelement)
RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT * FROM unnest($1) a WHERE a < $2 ORDER BY 1)$$;
-- split right array at given element
CREATE OR REPLACE FUNCTION f_array_right(anyarray, anyelement)
RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT * FROM unnest($1) a WHERE a >= $2 ORDER BY 1)$$;
Query
This does all the rest:
WITH u AS ( -- identify candidates
SELECT m.id, rec_id, m.val, m.valid_on, m.valid_during
, u.val AS u_val, u.valid_on AS u_valid_on
FROM master m
JOIN updates u USING (rec_id)
WHERE m.val <> u.val
AND m.valid_during @> u.valid_on
FOR UPDATE -- lock for update
)
, upd1 AS ( -- case 2: no overlap, no split
UPDATE master m -- shrink old row
SET valid_during = daterange(lower(u.valid_during), u.u_valid_on)
FROM u
WHERE u.id = m.id
AND u.u_valid_on > m.valid_on[array_upper(m.valid_on, 1)]
RETURNING m.id
)
, ins1 AS ( -- insert new row
INSERT INTO master (rec_id, val, valid_on, valid_during)
SELECT u.rec_id, u.u_val, ARRAY[u.u_valid_on]
, daterange(u.u_valid_on, upper(u.valid_during))
FROM upd1
JOIN u USING (id)
)
, upd2 AS ( -- case 3: overlap, need to split row
UPDATE master m -- shrink to first half
SET valid_during = daterange(lower(u.valid_during), u.u_valid_on)
, valid_on = f_array_left(u.valid_on, u.u_valid_on)
FROM u
LEFT JOIN upd1 USING (id)
WHERE upd1.id IS NULL -- all others
AND u.id = m.id
RETURNING m.id, f_array_right(u.valid_on, u.u_valid_on) AS arr_right
)
INSERT INTO master (rec_id, val, valid_on, valid_during)
-- new row
SELECT u.rec_id, u.u_val, ARRAY[u.u_valid_on]
, daterange(u.u_valid_on, upd2.arr_right[1])
FROM upd2
JOIN u USING (id)
UNION ALL -- second half of old row
SELECT u.rec_id, u.val, upd2.arr_right
, daterange(upd2.arr_right[1], upper(u.valid_during))
FROM upd2
JOIN u USING (id);
SQL Fiddle.
Notes
You need to understand the concept of data-modifying CTEs (writeable CTEs), before you touch this. Judging from the code you provided, you know your way around Postgres.
FOR UPDATE
is to avoid race conditions with concurrent write access. If you are the only user writing to the tables, you don't need it.
I took a piece of paper and drew a timeline so not to get lost in all of this.
Each row is only updated / inserted once, and operations are simple and roughly optimized. No expensive window functions. This should perform well. Much faster than your previous approach in any case.
It would be a bit less confusing if you'd use distinct column names for u.valid_on
and m.valid_on
, which are related but different things.
I compute the right half of the split array in the RETURNING
clause of CTE upd2
: f_array_right(u.valid_on, u.u_valid_on) AS arr_right
, because I need it several times in the next step. This is a (legal) trick to save one more CTE.
As for solutions that don't involve unnesting the master table
: You have to unnest the array valid_on
either way, to split it, at least as long as it's not sorted. Also, your helper function arraymin()
is already unnesting it anyway.
Best Answer
For the purpose of this question, I'll assume
employee_details.name
to be definedUNIQUE
. Else, the whole operation wouldn't make sense.You cannot nest a data-modifying CTE like you tried (as you already found out the hard way) - and you don't need to. This query would achieve your objective:
The core feature is the
INSERT
with no target columns and an emptySELECT
. Postgres fills all columns not listed in theSELECT
with default values. This way we can replace the unconditionalVALUES (default)
with a conditionalINSERT
. The CTEi1
only inserts a row if the given name was not found.The manual:
This is a Postgres specific extension of the standard:
The final CTE
i2
only inserts a row ifi1
returned a row. Voilá.This is subject to race conditions under concurrent write load to the same tables. If you need to rule that out, you need to do more. Related:
Without the complications from the conditional INSERT in the 2nd table, this would boil down to a common case of SELECT or INSERT:
Aside
I'd strongly advise to use the data type
uuid
to store UUIDs.