Sql-server – Duplicate rows with a few changes in a table with a composite primary key, but only if target does not exist

azure-sql-databaseif-not-existsinsertsql server

Consider the following table:

CREATE TABLE [xxx].[MyTable](
    [period] [date] NOT NULL,
    [genusId] [int] NOT NULL,
    [subjectId] [int] NOT NULL,
    [waitingStageId] [tinyint] NOT NULL,
    [value] [tinyint] NULL,
    [savedOn] [datetime2](2) NULL,
    [savedBy] [nvarchar](50) NULL,
 CONSTRAINT [PK_ExpectedActivity] PRIMARY KEY CLUSTERED 
(
    [period] ASC,
    [genusId] ASC,
    [subjectId] ASC,
    [waitingStageId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-05-01' AS Date), 71, 92, 1, 1, CAST(N'2017-05-29T13:45:20.6300000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-05-01' AS Date), 71, 96, 1, 1, CAST(N'2017-05-29T13:45:21.2400000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-05-01' AS Date), 71, 128, 0, 2, CAST(N'2017-05-29T13:45:21.6500000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-05-01' AS Date), 71, 128, 1, 3, CAST(N'2017-05-29T13:45:21.4400000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-06-01' AS Date), 73, 35, 0, 1, CAST(N'2017-05-26T20:05:40.2800000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-06-01' AS Date), 73, 94, 0, 3, CAST(N'2017-05-26T20:05:40.6700000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-06-01' AS Date), 73, 94, 1, 1, CAST(N'2017-05-26T20:05:40.4800000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-06-01' AS Date), 74, 23, 0, 6, CAST(N'2017-05-26T20:03:26.5400000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-06-01' AS Date), 74, 47, 0, 6, CAST(N'2017-05-26T20:03:26.7700000' AS DateTime2), NULL)
GO
INSERT [xxx].[MyTable] ([period], [genusId], [subjectId], [waitingStageId], [value], [savedOn], [savedBy]) VALUES (CAST(N'2017-06-01' AS Date), 74, 67, 0, 3, CAST(N'2017-05-26T20:03:27.1800000' AS DateTime2), NULL)
GO

I am trying to create a stored procedure that will allow us to duplicate values from the previous month into the current month — BUT ONLY if there isn't a value for the current month already.

DROP TABLE IF EXISTS #PreviousMonth
DECLARE @schemaName AS varchar(5) = 'xxx' -- To be used later as a parameter in stored procedure
DECLARE @lastMonth AS DATE = CAST(DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE())-1, 0) AS DATE) --First day of previous month
DECLARE @currentMonth AS DATE = CAST(DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()), 0) AS DATE) --First day of current month

-- Add rows for existing month into a temporary table
SELECT @currentMonth AS period, genusId, subjectId, waitingStageId, value, GETDATE() AS savedOn, '<Automated by Stored Procedure>' AS savedBy INTO #PreviousMonth FROM [xxx].[ExpectedActivity]
  WHERE period in(@lastMonth)

SELECT * INTO xxx.MyTable FROM #PreviousMonth
-- Here is where I got stuck. I tried WHERE caluses and NOT EXISTS but my skills are limited.

This is as far as I managed to get unfortunately. My skill is limited where it gets to do an INSERT with a condition with the NOT EXIST condition.

How can I make sure that I only insert rows in the xxx.MyTable where rows with the same primary key (period AND genusId AND subjectId AND waitingStageId) does not exist already?

Best Answer

Taking what your query says literally, you don't have to worry about what exists in xxx.myTable - because your statement will create it, therefore it won't have anything in it (or, if it already exists, your statement will fail). Presumably, that's not really what you were going for.

If we assume that xxx.myTable does exist, and may already have some rows whose values match your defaults, then this should work.

INSERT INTO xxx.myTable (period, genusID, subjectID, waitingStageID, value, savedOn, savedBy)
SELECT period, genusID, subjectID, waitingStageID, value, savedOn, savedBy
  FROM #PreviousMonth pm
 WHERE NOT EXISTS (SELECT 1
                     FROM xxx.myTable
                    WHERE period = pm.period
                      AND genusID = pm.genusID
                      AND subjectID = pm.subjectID
                      AND waitingStageID = pm.waitingStageID
                  )
;

In the NOT EXISTS clause, you're checking if the current row from the temp table matches the primary key values for any rows already in xxx.myTable. If you find a match for the current row, then you won't SELECT it (and thus, won't INSERT it).

You can also write this as what's called an "anti-join":

INSERT INTO xxx.myTable (period, genusID, subjectID, waitingStageID, value, savedOn, savedBy)
SELECT period, genusID, subjectID, waitingStageID, value, savedOn, savedBy
  FROM #PreviousMonth pm
         LEFT OUTER JOIN xxx.myTable mt ON (    pm.period = mt.period
                                            AND pm.genusID = mt.genusID
                                            AND pm.subjectID = mt.subjectID
                                            AND pm.waitingStageID = mt.waitingStageID
                                           )
 WHERE mt.period IS NULL
;

Here, we're matching all the rows in the temp table with any matching records in xxx.myTable. via a LEFT JOIN. Because of the left join, if there's no matching row in xxx.myTable, we'll still return columns from that table, but all the values will be NULL. So, if we find a row where period is NULL in xxx.myTable, that should mean that there isn't a matching xxx.myTable row for our temp table row. So, only those rows (temp table rows with no matching xxx.myTable row) will be SELECTed, and thus will be INSERTed.

In most cases, I wouldn't expect dramatically different performance between the two versions; test in your specific case, to see if one seems better than the other.

UPDATE: If you want to use the previous month's data anywhere the key fields don't exists, or where they do exist but the value = 0, you've got a couple of options available.

You could delete all rows in xxx.myTable where the value is 0, then do the insert as above.

I would be inclined to UPDATE the existing rows, and then perform the INSERT as above. The UPDATE statement would be:

UPDATE mt
   SET value = pm.value
  FROM xxx.myTable mt
         INNER JOIN #PreviousMonth pm ON (    mt.period = pm.period
                                          AND mt.genusID = pm.genusID
                                          AND mt.subjectID = pm.subjectID
                                          AND mt.waitingStageID = pm.waitingStageID
                                         )
 WHERE mt.value = 0
   AND pm.value <> 0
;

Related Solutions

Sql-server – How to avoid using variables in WHERE clause

Parameter sniffing is your friend almost all of the time and you should write your queries so that it can be used. Parameter sniffing helps building the plan for you using the parameter values available when the query is compiled. The dark side of parameter sniffing is when the values used when compiling the query is not optimal for the queries to come.

The query in a stored procedure is compiled when the stored procedure is executed, not when the query is executed so the values that SQL Server has to deal with here...

CREATE PROCEDURE WeeklyProc(@endDate DATE)
AS
BEGIN
  DECLARE @startDate DATE = DATEADD(DAY, -6, @endDate)
  SELECT
    -- Stuff
  FROM Sale
  WHERE SaleDate BETWEEN @startDate AND @endDate
END

is a known value for @endDate and an unknown value for @startDate. That will leave SQL Server to guessing on 30% of the rows returned for the filter on @startDate combined with whatever the statistics tells it for @endDate. If you have a big table with a lot of rows that could give you a scan operation where you would benefit most from a seek.

Your wrapper procedure solution makes sure that SQL Server sees the values when DateRangeProc is compiled so it can use known values for both @endDate and @startDate.

Both your dynamic queries leads to the same thing, the values are known at compile-time.

The one with a default null value is a bit special. The values known to SQL Server at compile-time is a known value for @endDate and null for @startDate. Using a null in a between will give you 0 rows but SQL Server always guess at 1 in those cases. That might be a good thing in this case but if you call the stored procedure with a large date interval where a scan would have been the best choice it may end up doing a bunch of seeks.

I left "Use the DATEADD() function directly" to the end of this answer because it is the one I would use and there is something strange with it as well.

First off, SQL Server does not call the function multiple times when it is used in the where clause. DATEADD is considered runtime constant.

And I would think that DATEADD is evaluated when the query is compiled so that you would get a good estimate on the number of rows returned. But it is not so in this case.
SQL Server estimates based on the value in the parameter regardless of what you do with DATEADD (tested on SQL Server 2012) so in your case the estimate will be the number of rows that is registered on @endDate. Why it does that I don't know but it has to do with the use of the datatype DATE. Shift to DATETIME in the stored procedure and the table and the estimate will be accurate, meaning that DATEADD is considered at compile time for DATETIME not for DATE.

So to summarize this rather lengthy answer I would recommend the wrapper procedure solution. It will always allow SQL Server to use the values provided when compiling the the query without the hassle of using dynamic SQL.

PS:

In comments you got two suggestions.

OPTION (OPTIMIZE FOR UNKNOWN) will give you an estimate of 9% of rows returned and OPTION (RECOMPILE) will make SQL Server see the parameter values since the query is recompiled every time.

Sql-server – updating the entire column with datetime values based on other values in the row

It looks like you have at least two different regional date formats there, possibly USA and Europe? You could just split the updates by languages...

something like this, which uses the language setting to control isdate and convert.

set language english;
update test_dates 
set ToDateTime = convert(datetime,ContainedData)
where AID='123' and PID='2' and isdate(ContainedData)=1;

set language british;
update test_dates 
set ToDateTime = convert(datetime,ContainedData)
where AID='123' and PID='2' and isdate(ContainedData)=1;

If the table is too large to update in big chunks, then you should look at wrapping that in a loop with a row limit and a transaction around each update and repeat until zero rows have been updated.

Best Answer

Related Solutions

Sql-server – How to avoid using variables in WHERE clause

Sql-server – updating the entire column with datetime values based on other values in the row

Related Question