SQL Server – Store n:m Data with Most n:1 Tuples

database-designsql serversql-server-2005

I have these tables:

create table data (
    id int identity not null,
    ref int not null
);

create table reference (
    refid int not null
);

with ~2 million rows in data and ~200k rows in reference. I add ~5k rows to data each day, resulting in 500-5k additional rows in reference per day.

In most of the cases I need to save only one reference for one data column (multiple data entries can have the same reference) but in some cases (currently about 0.1%) I need to save more than one reference.

I probably could use a third table like:

create table data_reference (
     dataid int not null,
     refid int not null
);

to create a n:m mapping but for most of the values it will not have any gain here, it will just create another join…

I could also create a second column in each data row, because in most of the cases I have only one additional reference (but in theory there could be more) and save null if no other reference is there – which seems to me quite bad because there could be tuples with 3 or more references.

Is there an efficient and easy to maintain way to save such data?

Best Answer

Using an intermediate table is the proper way to do this. With proper indexes and good maintenance it won't hurt performance much at all.

Avoid the multi-column solution (refid1, refid2 etc.) - that's repeating groups and not properly normalised. Plus your queries will degenerate into a dog's breakfast of OR clauses and CASE statements, which will hurt performance.

You may get some benefit from subtyping your data rows so most use your current arrangement but those with multiple reference use a different one. Your queries will UNION ALL the two sets to get the total. I really doubt the overhead from this will compensate for the mostly-redundant join to data_reference.

The subtyped schema would look like this:

create table data (
    id int identity not null,
    ref int not null,       -- set to a standard value for rows with >1 reference.
    is_single_reference bit not null
);

create table reference (
    refid int not null
);

create table data_reference (
     dataid int not null,
     refid int not null
);

The query would be something like:

select
    <whatever>
from data as d
inner join reference as r
    on d.ref = r.refid
where d.is_single_reference = 1

union all

select
    <whatever>
from data as d
inner join data_reference as dr
    on dr.dataid= d.id
inner join reference as r
    on r.refid = dr.refid
where d.is_single_reference = 0;

This is fairly ugly. You'd be lucky to avoid two scans of data. Ideally you'd want filtered indexes but IIRC these aren't availble until SQL Server 2008. The applicaiton would have to know how many reference rows there were per data, set flags accordingly and maintain those if the number changed. Just use the n:m approach.