Best way to model the relationship of unique pairs

database-designindexunique-constraint

I have two tables; one for storing thing and one for storing the relationship between two thing objects.

dbfiddle example

Assume:

  • AB == BA. Storing both would be redundant
  • A != B. The relationship of one thing to itself is not useful
  • Calculating the relationship between AB is expensive but idempotent
CREATE TABLE thing (
    id INT PRIMARY KEY
);
CREATE TABLE relationships (
    thing_one INT REFERENCES thing(id),
    thing_two INT REFERENCES thing(id),
    relationship INT NOT NULL,
    PRIMARY KEY (thing_one, thing_two),
    CHECK (thing_one != thing_two)
);

In order to ensure we don't INSERT AB and BA:

CREATE UNIQUE INDEX unique_pair_ix
    ON relationships (
        least(thing_one, thing_two),
        greatest(thing_one, thing_two)
    );

Is there a better or more efficient way to store/model this data than the example?

EDIT: There are multiple DBMS being considered for the greater application. They include PostgreSQL, MariaDB, and MySQL. PostgreSQL is the current preference.

Best Answer

This may not be needed and I am not totally sure I have the best answer. I was not able to come up with anything that utilized some kind of hash or other type of comparison mechanism. I am also running in SQL Server so I don't have the ability to use a least() and greatest() function. See this DBA Stack Exchange Question.

I have however done some performance testing with a few different methods. What data I captured is below:

+---------------------------------------------------------+-----------+--------------------+--------------------+--------------+----------------+-----------------------------+
|                                                         |           | Instead of Trigger | Instead of Trigger |              | After Trigger  | Sum and Absolute Difference |
|                          Event                          | Baseline  |  Case Statements   |     Not Exists     | Indexed View | Only Completed | Computed Persisted Columns  |
+---------------------------------------------------------+-----------+--------------------+--------------------+--------------+----------------+-----------------------------+
| Execution Time                                          | 07:06.510 | 13:46.490          | 08:47.594          | 22:18.911    | 30:38.267      | 11:24:38                    |
| Query Profile Statistics                                |           |                    |                    |              |                |                             |
|   Number of INSERT, DELETE and UPDATE statements        | 125250    | 249500             | 499000             | 250000       |                | 249999                      |
|   Rows affected by INSERT, DELETE, or UPDATE statements | 124750    | 249500             | 374250             | 124750       |                | 124750                      |
|   Number of SELECT statements                           | 0         | 0                  | 0                  | 0            |                | 0                           |
|   Rows returned by SELECT statements                    | 0         | 0                  | 0                  | 0            |                | 0                           |
|   Number of transactions                                | 125250    | 249500             | 499000             | 250000       |                | 249999                      |
| Network Statistics                                      |           |                    |                    |              |                |                             |
|   Number of server roundtrips                           | 1         | 250000             | 250000             | 250000       |                | 250000                      |
|   TDS packets sent from client                          | 6075      | 250000             | 250000             | 250000       |                | 250000                      |
|   TDS packets received from server                      | 462       | 250000             | 250000             | 250000       |                | 250000                      |
|   Bytes sent from client                                | 24882190  | 62068000           | 62568000           | 59568000     |                | 61567990                    |
|   Bytes received from server                            | 1888946   | 76910970           | 8782500            | 67527710     |                | 69783720                    |
| Time Statistics                                         |           |                    |                    |              |                |                             |
|   Client processing time                                | 420901    | 269564             | 18202              | 240341       |                | 238190                      |
|   Total execution time                                  | 424682    | 811028             | 512726             | 1325281      |                | 665491                      |
|   Wait time on server replies                           | 3781      | 541464             | 494524             | 1084940      |                | 427301                      |
+---------------------------------------------------------+-----------+--------------------+--------------------+--------------+----------------+-----------------------------+

This data would suggest one of 3 options:

  1. Assuming you are ok with the use of an IDENTITY column as the Primary Key, then the best performing method appears to be the INSTEAD OF Trigger - NOT EXISTS.
  2. Assuming you are not ok with the use of an IDENTITY column as the Primary Key, and you are ok with the existence of 2 other Persistent Computed Columns then the best performing method appears to be the Persistent Computed Columns.
  3. Assuming you are not ok with the use of an IDENTITY column as the Primary Key, and you are not ok with the existence of 2 other Persistent Computed Columns then the best performing method appears to be the INDEXED VIEW

Background on the overall testing method

I created a thing table with the same schema as you provided, and populated it with each INT from 1-500. Then I scripted out a bunch of single INSERT statements for each combination of thing (a GO was utilized so that when a given entry failed for either of the checks that the rest of the script would run and we could collect the statistics for the entire process).

INSERT INTO relationship (thing_one, thing_two, relationship) VALUES(1, 1, 1 * 1)
GO

INSERT INTO relationship (thing_one, thing_two, relationship) VALUES(1, 2, 1 * 2)
GO

INSERT INTO relationship (thing_one, thing_two, relationship) VALUES(1, 3, 1 * 3)
GO
...
INSERT INTO relationship (thing_one, thing_two, relationship) VALUES(500, 498, 500 * 498)
GO

INSERT INTO relationship (thing_one, thing_two, relationship) VALUES(500, 499, 500 * 499)
GO

INSERT INTO relationship (thing_one, thing_two, relationship) VALUES(500, 500, 500 * 500)
GO

Done correctly this results in 124,750 total records added from 250,000 total attempts.

This process was repeated for 4 different methods to see what performance was like. I also ran a "baseline" query which only had the INSERT statements for the unique combinations. That way we had some idea on how fast it could possibly get in the framework that the test is setup.

Details on Each Method

  1. INSTEAD Trigger - Case expressions to sort input values for thing_one and thing_two.

With this implementation we were going to take the provided data, ensure that the smaller value of thing_one and thing_two was ultimately placed into the thing_one column and the other (the larger of the two) was placed into the thing_two column. From there the Primary Key would ensure only the unique values would remain.

NOTE: Because with this implementation we are doing uniqueness on the Primary Key (which I think is the correct table structure), it is difficult to handle UPDATE statements via an INSTEAD Trigger. This was asked in a previous Stack Overflow Question and the basic outcome is that you either need a different method, or an Identity column. I don't personally think it is a great idea to add an Identity column if there is a natural primary key as it sounds like you have here.

The Trigger Code:

CREATE TRIGGER InsteadOfInsertTrigger on [dbo].[relationship]
INSTEAD OF INSERT
AS
INSERT INTO [dbo].[relationship]
(
    thing_one,
    thing_two,
    relationship
)
SELECT
CASE
    WHEN I.thing_one <= I.thing_two
        THEN I.thing_one
    ELSE
        I.thing_two
    END
,CASE
    WHEN I.thing_one <= I.thing_two
        THEN I.thing_two
    ELSE
        I.thing_one
    END
,I.relationship
FROM inserted I
GO
  1. INSTEAD Trigger - NOT EXISTS check

This is a trigger that stopped the INSERT if the relationship already existed. The values going into thing_one and thing_two are not sorted, but hopefully that isn't a problem. Just like with the previous Trigger this still has the same pitfall when it comes to UPDATES.

The Trigger Code:

CREATE TRIGGER InsteadOfInsertTrigger on [dbo].[relationship]
INSTEAD OF INSERT
AS
INSERT INTO [dbo].[relationship]
(
    thing_one,
    thing_two,
    relationship
)
SELECT
I.thing_one
,I.thing_two
,I.relationship
FROM inserted I
WHERE NOT EXISTS
(
    SELECT 1
    FROM [dbo].[relationship] t
    WHERE (t.thing_one = i.thing_two AND t.thing_two = i.thing_one)
    --This one shouldn't be needed because of the Primary Key
    --AND (t.thing_one = i.thing_one AND t.thing_two = i.thing_two)
)
GO
  1. Unique Indexed View

With this method, we created a View and put a Unique Clustered Index on it. If a duplicate record is added then it fails this check and the change is rolled back. I saw 2 ways of doing this, either with a CASE expression like I have below, or some kind of UNION. In my testing the CASE performed much better.

The View and associated Index Code:

CREATE VIEW dbo.relationship_indexedview_view
WITH SCHEMABINDING
AS
    SELECT 
    CASE
        WHEN thing_one <= thing_two
            THEN thing_one
        ELSE
            thing_two
        END as thing_one_sorted,
    CASE
        WHEN thing_one <= thing_two
            THEN thing_two
        ELSE
            thing_one
        END as thing_two_sorted
    FROM [dbo].[relationship_indexedview]
GO

CREATE UNIQUE CLUSTERED INDEX relationship_indexedview_view_unique
    ON dbo.relationship_indexedview_view (thing_one_sorted, thing_two_sorted)
GO
  1. AFTER INSERT and UPDATE Trigger

Here we have another TRIGGER implementation that will handle both INSERT and UPDATES. After the INSERT or UPDATE has finished, it checks to see if a duplicate value has been added, and performs a ROLLBACK if it finds one.

NOTE: This method performed super poorly. I stopped it after ~30 minutes of running and it only added 51,378 of the expected 124,750 rows (~24% of the INSERT commands executed).

Trigger Code:

CREATE TRIGGER AfterTrigger ON [dbo].[relationship]
AFTER INSERT, UPDATE
AS
BEGIN
    IF EXISTS
    (
        SELECT 1
        FROM [dbo].[relationship] T1
            INNER JOIN [dbo].[relationship] T2
                ON T1.thing_one = T2.thing_two
                AND T1.thing_two = T2.thing_one
    )
    BEGIN
        RAISERROR ('Duplicate Relationship Value Added', 16, 1);
        ROLLBACK TRANSACTION; --stops the Insert/Update
    END
END
GO
  1. Sum and Absolute Difference Comparison using Physical Computed Columns

After getting confirmation from this Math Stack Exchange Question. We know that a given relationship (thing_one, thing_two) or (thing_two, thing_one) can be tested as unique by looking at the sum and the absolute value of their difference. We can create 2 Computed Persisted Columns and create a Unique Constraint.

With a small modification to the table schema we can ensure the uniqueness without having to modify the INSERT script.

The only downside is having to maintain 2 more columns on the table. As long as that is ok, this appears to be one of the smallest amounts of overhead and does not have the same pitfalls of having to deal with the Primary Key changes, as found with the TRIGGER based methods.

This presumably could be pushed to a separate table, or some other indexed view, but I have not done any testing with that.

CREATE TABLE relationships (
    thing_one INT REFERENCES thing(id),
    thing_two INT REFERENCES thing(id),
    thing_one_thing_two_sum AS thing_one + thing_two PERSISTED,
    thing_one_thing_two_absolute_difference AS ABS(thing_one - thing_two) PERSISTED,
    relationship INT NOT NULL,
    PRIMARY KEY (thing_one, thing_two),
    CHECK (thing_one != thing_two),
    UNIQUE(thing_one_thing_two_sum, thing_one_thing_two_absolute_difference)
);

Hopefully this helps with the design decision or is at least an interesting read.