Sql-server – How to store immutable, unique, ordered, lists

database-designschemasql server

I have an entity with millions of instances. Each instance has to reference an ordered list of items. The lists have to be unique so no list will be stored more than once. But once created, both lists and entity instances are immutable. There will be far more entity instances than lists, and the database has to support fast insertions of entities.

So what's an insert-efficient, robust, way of storing immutable, unique, ordered lists?


Edit: The list items are simple integers, and the typical length is about 5 items. Long lists of, say 10 or 20 items are very unlikely but possible.


Edit: So far, I've considered these approaches:

1)

lists table has these columns: <list_id> <order> <item> so if list #5 contains the elements [10,20,30] the table will contain:

5   1   10
5   2   20
5   3   30
    

The entity table will have a item_list_id column that references the lists table (it's not a foreign key since list_id is not a unique column in the lists table – The this can be solved by adding another table with a single column that which contains all valid list_ids
).

  • This solution makes inserts a bit tricky
  • It also places the responsibility for enforcing the uniqueness of lists on the application, which isn't great.

2)

lists table has these columns: <list_id> <item1> <item2> <item3> ... <itemN> so if list #5 contains the elements [10,20,30] the table will contain:

5   10   20   30

The entity table will have a item_list_id column that references the lists table.

  • This solution is less robust since list lengths are limited (although this isn't a huge problem for me since my lists are VERY unlikely to contain more than 10 or 20 elements)
  • This approach is quite horrible to query ("find all lists in which a particular item appears" has to specify each and every column), and a nightmare to map using an ORM.
  • Insertions of new entities is not too bad since my lists are typically 4-5 items long, so I can probably index the first few columns.
  • Enforcement of uniqueness is still in the hands of the application.

3)

Use solution #1, but replace the meaningless list_id with a hash (say SHA-1) on the list in serialized form.

  • This way uniqueness is more strictly enforced since lists will have unique hashes
  • Insertions are simpler and quicker(?)
  • The data integrity enforcement is still in the hands of the application.

Best Answer

I would go with a traditional set-based approach for this. Create entity, list, entityList, and listItem tables to store your data. Individual list and entity tables allow you to create lists without entities and vice versa. Constraints on these tables such as unique constraints and foreign keys will help preserve the integrity of the data.

For the list, create a single stored procedure which is the single point for inserting lists and listItems. This proc has logic to detect if a list already exists and serves as a gateway. This is straightforward to do using set operators like EXCEPT.

Here's a simple demo which shows how something like this might work. I'm using a table-valued parameter (TVP) to handle passing around lists and a scalar function to determine if a list already exists. Scalar functions are evil when used in resultsets (eg SELECT, WHERE clauses, JOIN ) but are ok when used for what they are meant for, returning single values. Spend some time working through it and see if it could work for you:

USE tempdb
GO

SET NOCOUNT ON
GO

------------------------------------------------------------------------------------------------
-- Setup START
------------------------------------------------------------------------------------------------

-- Reset
IF OBJECT_ID('dbo.entityLists') IS NOT NULL DROP TABLE dbo.entityLists
IF OBJECT_ID('dbo.listItems') IS NOT NULL DROP TABLE dbo.listItems
IF OBJECT_ID('dbo.entities') IS NOT NULL DROP TABLE dbo.entities
IF OBJECT_ID('dbo.lists') IS NOT NULL DROP TABLE dbo.lists
IF OBJECT_ID('dbo.usp_addList') IS NOT NULL DROP PROC dbo.usp_addList
IF OBJECT_ID('dbo.usf_listExists') IS NOT NULL DROP FUNCTION dbo.usf_listExists
IF EXISTS ( SELECT * FROM sys.types st INNER JOIN sys.schemas ss ON st.schema_id = ss.schema_id WHERE st.name = N'tvp_myList' AND ss.name = N'dbo')
DROP TYPE dbo.tvp_myList
GO


CREATE TABLE dbo.entities
(
    entityId        INT IDENTITY,
    entityName      VARCHAR(50) NOT NULL,

    dateAdded       DATETIME    NOT NULL DEFAULT GETDATE(),
    addedBy         VARCHAR(30) NOT NULL DEFAULT SUSER_NAME(),
    dateModified    DATETIME    NULL,
    modifiedBy      VARCHAR(30) NULL,

    CONSTRAINT PK_entities PRIMARY KEY ( entityId ),
    CONSTRAINT UK_entities__entityName UNIQUE ( entityName )

)
GO

CREATE TABLE dbo.lists
(
    listId          INT IDENTITY(1000,1),
    listName        VARCHAR(50) NOT NULL,

    dateAdded       DATETIME    NOT NULL DEFAULT GETDATE(),
    addedBy         VARCHAR(30) NOT NULL DEFAULT SUSER_NAME(),
    dateModified    DATETIME    NULL,
    modifiedBy      VARCHAR(30) NULL,

    CONSTRAINT PK_lists PRIMARY KEY ( listId ),
    CONSTRAINT UK_lists__listName UNIQUE ( listName )

)
GO


-- Although constraints might appear a little excessive, they prevent 'bad data' getting into the table,
-- whereas lesser constraints would not, eg prevent list with duplicate order
CREATE TABLE dbo.listItems
(
    listItemId      INT IDENTITY(100000,1),
    listId          INT NOT NULL,
    itemOrderId     INT NOT NULL,
    itemValue       INT NOT NULL,

    CONSTRAINT PK_listItems PRIMARY KEY ( listItemId ),         -- prevent list with duplicate order, value
    CONSTRAINT UK_listItems__listId__itemValue UNIQUE ( listId, itemValue ),        -- prevent list with duplicate value
    CONSTRAINT UK_listItems__listId__itemOrderId UNIQUE ( listId, itemOrderId ),    -- prevent list with duplicate order
    CONSTRAINT UK_listItems UNIQUE ( listId, itemOrderId, itemValue ),

    CONSTRAINT FK_listItems__listId FOREIGN KEY ( listId ) REFERENCES  dbo.lists ( listId )

)
GO

CREATE INDEX _idx_listItems__itemOrderId__itemValue ON dbo.listItems ( itemOrderId, itemValue ) -- to be used in list existence check 
GO


CREATE TABLE dbo.entityLists
(
    entityId        INT NOT NULL,
    listId          INT NOT NULL,

    CONSTRAINT PK_entityLists PRIMARY KEY ( entityId, listId ),
    CONSTRAINT UK_entityLists__entityId UNIQUE ( entityId ),
    CONSTRAINT UK_entityLists__listId UNIQUE ( listId ),
    CONSTRAINT FK_entityLists__entityId FOREIGN KEY ( entityId ) REFERENCES  dbo.entities ( entityId ),
    CONSTRAINT FK_entityLists__listId FOREIGN KEY ( listId ) REFERENCES  dbo.lists ( listId )

)
GO


-- Create a type for passing lists around; TVP types and constraints mirror target table to act as a quality control
CREATE TYPE dbo.tvp_myList AS TABLE
(
    itemOrderId INT NOT NULL UNIQUE,    -- you could make this an identity column if you want to make sure order is sequential
    itemValue   INT NOT NULL UNIQUE,

    PRIMARY KEY ( itemOrderId, itemValue )

)
GO

-- Scalar function to determine if list already exists
CREATE FUNCTION dbo.usf_listExists
(
    @tvp        AS dbo.tvp_myList   READONLY
)

RETURNS BIT
AS
BEGIN
    DECLARE @listExists_yn  BIT

    IF EXISTS (
        SELECT COUNT(*) OVER() totalItems, itemOrderId, itemValue
        FROM @tvp
        EXCEPT
        SELECT COUNT(*) OVER( PARTITION BY listId ) totalItems, itemOrderId, itemValue
        FROM dbo.listItems
        )
        SET @listExists_yn = 0
    ELSE
        SET @listExists_yn = 1

    RETURN @listExists_yn

END
GO

-- Create a proc to add lists, report if list already exists
CREATE PROC dbo.usp_addList

    @listName   VARCHAR(50) = NULL,
    @tvp        AS dbo.tvp_myList   READONLY,
    @listId     INT OUTPUT

AS

SET NOCOUNT ON

    BEGIN TRY

        -- Check if list name already exists
        IF @listName IS NULL
            SET @listName = NEWID()
        ELSE
        IF EXISTS ( SELECT * FROM dbo.lists WHERE listName = @listName ) 
            RAISERROR ( 'List name ''%s'' already exists id %s', 16, 1, @listName );


        -- Check if list content already exists
        IF dbo.usf_listExists(@tvp) = 1
        BEGIN
            -- Get list
            --!!TODO add 'get list id' proc
            RAISERROR ( 'List already exists id %i', 16, 1, @listId );
        END
        ELSE
        BEGIN

            BEGIN TRAN

                -- Add the list
                INSERT INTO dbo.lists ( listName ) VALUES ( @listName )
                SET @listId = SCOPE_IDENTITY()

                -- Add the list items
                INSERT INTO dbo.listItems ( listId, itemOrderId, itemValue )
                SELECT @listId, itemOrderId, itemValue
                FROM @tvp

            COMMIT TRAN

        END

    END TRY
    BEGIN CATCH

        DECLARE @errorMessage NVARCHAR(2048) = ERROR_MESSAGE()

        IF @@TRANCOUNT > 0 ROLLBACK
        RAISERROR( @errorMessage, 16, 1 )

    END CATCH

    RETURN

GO

-- Setup END
------------------------------------------------------------------------------------------------



------------------------------------------------------------------------------------------------
-- Initial Data START
-- Add some data; actually entities are irrelevant to this demo
------------------------------------------------------------------------------------------------

;WITH cte AS (
SELECT TOP 10 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
FROM master.sys.columns c1
    CROSS JOIN master.sys.columns c2
    CROSS JOIN master.sys.columns c3
)
INSERT INTO dbo.entities ( entityName )
SELECT 'Entity ' + CAST ( rn AS VARCHAR(10) )
FROM cte
GO

---- Add some simple lists
DECLARE @listId INT, @tvp AS dbo.tvp_myList

INSERT INTO @tvp VALUES ( 1, 10 )   -- Single Item List ( 10 )

EXEC dbo.usp_addList 'my New List 1', @tvp, @listId OUT

INSERT INTO @tvp VALUES ( 2, 20 )   -- Two Item List ( 10, 20 )
EXEC dbo.usp_addList 'my New List 2', @tvp, @listId OUT

INSERT INTO @tvp VALUES ( 3, 30 )   -- Three Item List ( 10, 20, 30 )
EXEC dbo.usp_addList 'my New List 3', @tvp, @listId OUT
GO

-- Initial lists
select * from dbo.lists
select * from dbo.listItems


-- Initial Data END
------------------------------------------------------------------------------------------------



------------------------------------------------------------------------------------------------
-- Test Scenarios START
------------------------------------------------------------------------------------------------

-- Try and add any of those lists again
BEGIN TRY

    -- Single item list already exists
    DECLARE @listId INT, @tvp AS dbo.tvp_myList

    INSERT INTO @tvp VALUES ( 1, 10 )
    EXEC dbo.usp_addList 'my New List 4', @tvp, @listId OUT

END TRY
BEGIN CATCH
    PRINT ERROR_MESSAGE()
END CATCH
GO

BEGIN TRY

    DECLARE @listId INT, @tvp AS dbo.tvp_myList

    -- Two item list already exists
    INSERT INTO @tvp VALUES ( 1, 10 ), ( 2, 20 )
    EXEC dbo.usp_addList 'my New List 5', @tvp, @listId OUT
    PRINT @listId

END TRY
BEGIN CATCH
    PRINT ERROR_MESSAGE()
END CATCH
GO

BEGIN TRY

    DECLARE @listId INT, @tvp AS dbo.tvp_myList

    INSERT INTO @tvp VALUES ( 1, 10 ), ( 2, 20 ), ( 3, 30 )
    EXEC dbo.usp_addList 'my New List 6', @tvp, @listId OUT
    PRINT @listId

END TRY
BEGIN CATCH
    PRINT ERROR_MESSAGE()
END CATCH
GO


BEGIN TRY

    -- New four-item list
    DECLARE @listId INT, @tvp AS dbo.tvp_myList

    INSERT INTO @tvp VALUES ( 1, 10 ), ( 2, 20 ), ( 3, 30 ), ( 4, 40 )
    EXEC dbo.usp_addList 'my New List 7', @tvp, @listId OUT
    PRINT @listId

END TRY
BEGIN CATCH
    PRINT ERROR_MESSAGE()
END CATCH
GO


BEGIN TRY

    -- New four-item list
    DECLARE @listId INT, @tvp AS dbo.tvp_myList

    INSERT INTO @tvp VALUES ( 1, 10 ), ( 2, 20 ), ( 3, 30 ), ( 4, 41 )
    EXEC dbo.usp_addList 'my New List 8', @tvp, @listId OUT
    PRINT @listId

END TRY
BEGIN CATCH
    PRINT ERROR_MESSAGE()
END CATCH
GO


BEGIN TRY

    DECLARE @listId INT, @tvp AS dbo.tvp_myList

    -- Two item list does not exist, but three-item list with same two components does exist; both get added
    INSERT INTO @tvp VALUES ( 1, 10 ), ( 2, 21 ), ( 3, 30 )
    EXEC dbo.usp_addList 'my New List 9', @tvp, @listId OUT
    PRINT @listId

    DELETE @tvp WHERE itemValue = 30
    EXEC dbo.usp_addList 'my New List 10', @tvp, @listId OUT
    PRINT @listId

END TRY
BEGIN CATCH
    PRINT ERROR_MESSAGE()
END CATCH
GO



-- Final lists
SELECT * FROM dbo.lists
SELECT * FROM dbo.listItems
GO

-- Test Scenarios END
------------------------------------------------------------------------------------------------



------------------------------------------------------------------------------------------------
-- Load Tests START
-- For 1,000 random entities, create a randomn-size list and add; report errors
------------------------------------------------------------------------------------------------

-- Load test the addList proc
DECLARE @i INT = 0

WHILE @i < 1000
BEGIN

    DECLARE @item INT, @items INT
    DECLARE @listId INT, @tvp AS dbo.tvp_myList

    -- Initialise
    SELECT @items = ( RAND() * 9 ) + 1, @item = 1   -- random number between 1 and 10
    DELETE @tvp

    PRINT @items

    WHILE @item <= @items
    BEGIN

        PRINT CHAR(9) + CAST( @item AS VARCHAR(5) )

        INSERT INTO @tvp VALUES ( @item, RAND() * 10000 )

        SET @item += 1

    END

    EXEC dbo.usp_addList NULL, @tvp, NULL

    SET @i += 1

END
GO



-- Load Tests END
------------------------------------------------------------------------------------------------


-- Visualise your lists
SELECT 
    l.listId,
    l.listName,
    l.dateAdded,
    (
    SELECT
        li1.listId  "@listId",
        COUNT(*) AS "@count",
        (
        SELECT
            itemOrderId AS "@itemOrderId",
            itemValue AS "@itemValue"
        FROM dbo.listItems li2
        WHERE li1.listId = li2.listId
        ORDER BY itemOrderId
        FOR XML PATH('listItem'), TYPE
        )
    FROM dbo.listItems li1
    WHERE l.listId = li1.listId
    GROUP BY li1.listId
    FOR XML PATH('listItems'), TYPE
    ) listItems
FROM dbo.lists l
GO




-- END
------------------------------------------------------------------------------------------------

I'm a big proponent of XML but it is the wrong choice for this design. XML can be appropriate for semi-structured, but your design is actually really structured. The overhead of constantly shredding XML for comparison would be too much, plus you can't directly compare once piece of XML with another easily in SQL Server. Where XML would be useful here, is visualising your lists, as I've done at the end of my demo.

HTH