I have two tables; one for storing thing
and one for storing the relationship
between two thing
objects.
Assume:
AB == BA
. Storing both would be redundantA != B
. The relationship of onething
to itself is not useful- Calculating the relationship between
AB
is expensive but idempotent
CREATE TABLE thing (
id INT PRIMARY KEY
);
CREATE TABLE relationships (
thing_one INT REFERENCES thing(id),
thing_two INT REFERENCES thing(id),
relationship INT NOT NULL,
PRIMARY KEY (thing_one, thing_two),
CHECK (thing_one != thing_two)
);
In order to ensure we don't INSERT
AB
and BA
:
CREATE UNIQUE INDEX unique_pair_ix
ON relationships (
least(thing_one, thing_two),
greatest(thing_one, thing_two)
);
Is there a better or more efficient way to store/model this data than the example?
EDIT: There are multiple DBMS being considered for the greater application. They include PostgreSQL, MariaDB, and MySQL. PostgreSQL is the current preference.
Best Answer
This may not be needed and I am not totally sure I have the best answer. I was not able to come up with anything that utilized some kind of
hash
or other type of comparison mechanism. I am also running in SQL Server so I don't have the ability to use aleast()
andgreatest()
function. See this DBA Stack Exchange Question.I have however done some performance testing with a few different methods. What data I captured is below:
This data would suggest one of 3 options:
IDENTITY
column as thePrimary Key
, then the best performing method appears to be theINSTEAD OF Trigger - NOT EXISTS
.IDENTITY
column as thePrimary Key
, and you are ok with the existence of 2 otherPersistent Computed Columns
then the best performing method appears to be thePersistent Computed Columns
.IDENTITY
column as thePrimary Key
, and you are not ok with the existence of 2 otherPersistent Computed Columns
then the best performing method appears to be theINDEXED VIEW
Background on the overall testing method
I created a
thing
table with the same schema as you provided, and populated it with eachINT
from 1-500. Then I scripted out a bunch of singleINSERT
statements for each combination ofthing
(aGO
was utilized so that when a given entry failed for either of the checks that the rest of the script would run and we could collect thestatistics
for the entire process).Done correctly this results in 124,750 total records added from 250,000 total attempts.
This process was repeated for 4 different methods to see what performance was like. I also ran a "baseline" query which only had the
INSERT
statements for the unique combinations. That way we had some idea on how fast it could possibly get in the framework that the test is setup.Details on Each Method
INSTEAD Trigger
- Case expressions to sort input values forthing_one
andthing_two
.With this implementation we were going to take the provided data, ensure that the smaller value of
thing_one
andthing_two
was ultimately placed into thething_one
column and the other (the larger of the two) was placed into thething_two
column. From there thePrimary Key
would ensure only the unique values would remain.NOTE: Because with this implementation we are doing uniqueness on the
Primary Key
(which I think is the correct table structure), it is difficult to handleUPDATE
statements via anINSTEAD Trigger
. This was asked in a previous Stack Overflow Question and the basic outcome is that you either need a different method, or anIdentity
column. I don't personally think it is a great idea to add anIdentity
column if there is a natural primary key as it sounds like you have here.The
Trigger
Code:INSTEAD Trigger
-NOT EXISTS
checkThis is a trigger that stopped the
INSERT
if the relationship already existed. The values going intothing_one
andthing_two
are not sorted, but hopefully that isn't a problem. Just like with the previousTrigger
this still has the same pitfall when it comes toUPDATES
.The
Trigger
Code:Unique Indexed View
With this method, we created a
View
and put aUnique Clustered Index
on it. If a duplicate record is added then it fails this check and the change is rolled back. I saw 2 ways of doing this, either with aCASE
expression like I have below, or some kind ofUNION
. In my testing theCASE
performed much better.The
View
and associatedIndex
Code:AFTER INSERT and UPDATE Trigger
Here we have another
TRIGGER
implementation that will handle bothINSERT
andUPDATES
. After theINSERT
orUPDATE
has finished, it checks to see if a duplicate value has been added, and performs aROLLBACK
if it finds one.NOTE: This method performed super poorly. I stopped it after ~30 minutes of running and it only added 51,378 of the expected 124,750 rows (~24% of the
INSERT
commands executed).Trigger
Code:Sum and Absolute Difference Comparison using Physical Computed Columns
After getting confirmation from this Math Stack Exchange Question. We know that a given relationship (thing_one, thing_two) or (thing_two, thing_one) can be tested as unique by looking at the sum and the absolute value of their difference. We can create 2
Computed Persisted Columns
and create aUnique Constraint
.With a small modification to the table schema we can ensure the uniqueness without having to modify the
INSERT
script.The only downside is having to maintain 2 more columns on the table. As long as that is ok, this appears to be one of the smallest amounts of overhead and does not have the same pitfalls of having to deal with the Primary Key changes, as found with the
TRIGGER
based methods.This presumably could be pushed to a separate table, or some other indexed view, but I have not done any testing with that.
Hopefully this helps with the design decision or is at least an interesting read.