Sql-server – How is IF NOT EXISTS SELECT THEN INSERT faster than UNIQUE index

indexsql server

In SQL Server how is…

Sp:

CREATE PROCEDURE insertToTable
    @field1 VARCHAR(256), @field2 varchar(256), @field3 varchar(256)
AS
BEGIN
    SET NOCOUNT ON

    IF NOT EXISTS (SELECT * FROM my_table WHERE field1 = @field1)
      INSERT INTO my_table
        (field1, field2, field3)
      VALUES (@field1, @field2, @field3);
    ELSE
      THROW 50000, 'xxxxxx', 1;
    END
GO

Table:

CREATE TABLE my_table (
    field1 VARCHAR(256) NOT NULL,
    field2 VARCHAR(256) NOT NULL,
    field3 VARCHAR(256) NOT NULL
);
CREATE INDEX idx_field1 ON my_table(field1);

the above faster than the below?

Sp:

CREATE PROCEDURE insertToTable
    @field1 VARCHAR(256), @field2 varchar(256), @field3 varchar(256)
AS
BEGIN
    SET NOCOUNT ON

    INSERT INTO my_table
        (field1, field2, field3)
    VALUES (@field1, @field2, @field3);
GO

Table:

CREATE TABLE my_table (
    field1 VARCHAR(256) NOT NULL,
    field2 VARCHAR(256) NOT NULL,
    field3 VARCHAR(256) NOT NULL
);
CREATE UNIQUE INDEX idx_field1 ON my_table(field1);

Sample input:

field1: F56yCgZ9AEm9aFpTyjwhERtqNeglYEow

field2: BD84CE2A514316164B7448C804B178AD8F6F597E8EC6F25F4D6E36287259C65F67E7206E82A4F8EFD2389C0821C0C70E8278DC5F166D220356B5A15A091A6C17$0

field3: A18E9049117A77E6A4D41C6CA3FFDEA65D842BF1F57705405B4E66969531D93D

The input is generated on the fly by the web application and using prepared statements. I use Jmeter to generate requests to my web app.

With UNIQUE index, insert performance degrades after 100K inserts and gets worse.

With NON UNIQUE index and a manual check with IF NOT EXISTS SELECT, performance is constant even with millions of records inserted.

The values are unique enough that there never is a duplicate generated. Even after a few million values inserted.

Best Answer

FINAL UPDATE:

It's the INSERT that's really slowing things down.

When a Unique index is in place, with every new record you add SQL has to check if the value already exists. As the table grows, the amount of cross-referencing increases. A non-unique index will require no cross-referencing so performance is constant.

Unique indexes are usually faster for SELECT statements but that comes at a cost when updating a table.

Below is why the SELECT can sometimes be slower on a Unique Index

I've partially recreated your situation to the point where I think it is down to combination of Parameter sniffing and SQL prefers using the NON-UNIQUE index on a HEAP.

Set up 2 test tables, one of them is a heap (just like your table).

CREATE TABLE dbo.TEST1(ID VARCHAR(255) NOT NULL,TXT1 VARCHAR(255) NOT NULL,TXT2 VARCHAR(255) NOT NULL)
CREATE TABLE dbo.TEST2(ID VARCHAR(255) NOT NULL,TXT1 VARCHAR(255) NOT NULL,TXT2 VARCHAR(255) NOT NULL)
GO
INSERT INTO dbo.TEST1 VALUES(NEWID(),NEWID(),NEWID())
GO 30000

INSERT INTO dbo.TEST2
SELECT * FROM dbo.TEST1
GO

CREATE CLUSTERED INDEX cidx ON dbo.TEST1 (ID)
CREATE INDEX idx_nu ON dbo.TEST1 (ID)
CREATE UNIQUE INDEX idx_u ON dbo.TEST1 (ID)

CREATE INDEX idx_nu ON dbo.TEST2 (ID)
CREATE UNIQUE INDEX idx_u ON dbo.TEST2 (ID)

Review the footprint of the Indexes, on the HEAP the UNIQUE index has a smaller footprint than the NON-UNIQUE index. (perhaps the pages of the NON-UNIQUE index contain extra - possibly useful - information)(note: after running the above code multiple times the page count doesn't differ, probably due to caching, alter the "GO 30000" to rectify the issue.)

SELECT
    s.name AS SchemaName,
    t.name AS TableName,
    i.name AS IndexName,
    p.row_count,
    SUM (p.used_page_count) as used_pages_count,
    SUM (CASE
            WHEN (i.index_id < 2) THEN (in_row_data_page_count + lob_used_page_count + row_overflow_used_page_count)
            ELSE lob_used_page_count + row_overflow_used_page_count
        END) as pages
FROM 
    sys.dm_db_partition_stats  AS p 
        JOIN sys.tables AS t 
            ON 
            p.object_id = t.object_id
        JOIN sys.indexes AS i 
            ON 
            i.[object_id] = t.[object_id] 
            AND 
            p.index_id = i.index_id
        JOIN sys.schemas AS s 
            ON
            t.schema_id = s.schema_id
WHERE
    t.name IN ('TEST1','TEST2')
GROUP BY 
    s.name
    ,t.name
    ,i.name
    ,p.row_count

Now query the tables with literals and variables.

--SCAN of the UNIQUE index
DECLARE @account_id VARCHAR(255) = (SELECT TOP 1 ID FROM dbo.TEST2 WHERE ID like '%A%') 

--Parameter Sniffing kicks in --The optimiser doesn't know the value of @account_id

--SEEK of the CLUSTERED index 
DECLARE @ID1 VARCHAR(255)  = (SELECT TOP 1 ID FROM dbo.TEST1 WHERE ID = @account_id)

--SEEK of the NON UNIQUE index
DECLARE @ID2 VARCHAR(255)  = (SELECT TOP 1 ID FROM dbo.TEST2 WHERE ID = @account_id)

For some reason SQL prefers the NON UNIQUE index on a HEAP when performing SEEK operations.

Here's what I think is going on. When the Non-Unique index has more Pages then the corresponding Histogram in the STATS has more STEPS, run below code.

DBCC SHOW_STATISTICS ( 'TEST2' , 'idx_nu' )
DBCC SHOW_STATISTICS ( 'TEST2' , 'idx_u' )

The additional STEPS create a more granular view of the underlying Index, so the Optimiser (knowing that the EQ_ROWS is always 1) is getting a better Cardinality Estimate from the Non-Unique index.

Related Solutions

SQL Server 2012 – Why Denali Sequences Perform Better Than Identity Columns

I'll answer here as well. It has to do with the internals of how IDENTITY and SEQUENCE work.

With IDENTITY, SQL Server pre-caches values into memory so that they are readily available. See Martin Smith's answer for the details. As values are used, a background process generates more values. As you can imagine this pool can run out pretty quickly, leaving the application at the mercy of the background process that is generating the values.

With SEQUENCE, SQL Server allows you to define how large the cache should be. While SQL Server doesn't actually keep the values in the cache, it only keeps the current value and the top end value, this will greatly reduce the amount of IO that is needed to create values.

Don't set the cache too high, as this will reduce the number of numbers which can be used: if SQL Server were to crash, any values specified in the current cache range which weren't used would be lost.

As for row insertion, just specify a default value for the column, like so:

DEFAULT (NEXT VALUE FOR Audit.EventCounter),

Mysql – Unique insert performance: Primary Key vs Unique Index

From your question, I'm not completely sure if you're inserting multiple values per query, but you definitely should be. With MyISAM as the engine, there should not be a performance difference between the index being UNIQUE or PRIMARY; MyISAM doesn't treat them differently in this case. If you were using InnoDB, however, there would be a difference since it stores the data in primary key order. If you don't need the id column, removing it and making domain the primary key would help performance.

Changing the collation should help since ascii is much simpler than utf8, but you might want to use ascii_general_ci instead of ascii_bin since domain names are case-insensitive.

One other way to do the queries would be to get the number of rows, INSERT DELAYED, flush the delayed writes, and then get the new row count. The difference in the counts would be the same as the affected rows. However, I don't think this would be significantly faster, but it would make the process more complex.

Best Answer

Related Solutions

SQL Server 2012 – Why Denali Sequences Perform Better Than Identity Columns

Mysql – Unique insert performance: Primary Key vs Unique Index

Related Question