I have these tables:
create table data (
id int identity not null,
ref int not null
);
create table reference (
refid int not null
);
with ~2 million rows in data and ~200k rows in reference. I add ~5k rows to data each day, resulting in 500-5k additional rows in reference per day.
In most of the cases I need to save only one reference for one data column (multiple data entries can have the same reference) but in some cases (currently about 0.1%) I need to save more than one reference.
I probably could use a third table like:
create table data_reference (
dataid int not null,
refid int not null
);
to create a n:m mapping but for most of the values it will not have any gain here, it will just create another join…
I could also create a second column in each data row, because in most of the cases I have only one additional reference (but in theory there could be more) and save null if no other reference is there – which seems to me quite bad because there could be tuples with 3 or more references.
Is there an efficient and easy to maintain way to save such data?
Best Answer
Using an intermediate table is the proper way to do this. With proper indexes and good maintenance it won't hurt performance much at all.
Avoid the multi-column solution (
refid1
,refid2
etc.) - that's repeating groups and not properly normalised. Plus your queries will degenerate into a dog's breakfast of OR clauses and CASE statements, which will hurt performance.You may get some benefit from subtyping your
data
rows so most use your current arrangement but those with multiplereference
use a different one. Your queries willUNION ALL
the two sets to get the total. I really doubt the overhead from this will compensate for the mostly-redundant join todata_reference
.The subtyped schema would look like this:
The query would be something like:
This is fairly ugly. You'd be lucky to avoid two scans of
data
. Ideally you'd want filtered indexes but IIRC these aren't availble until SQL Server 2008. The applicaiton would have to know how manyreference
rows there were perdata
, set flags accordingly and maintain those if the number changed. Just use the n:m approach.