SQL Server – Store n:m Data with Most n:1 Tuples

database-designsql serversql-server-2005

I have these tables:

create table data (
    id int identity not null,
    ref int not null
);

create table reference (
    refid int not null
);

with ~2 million rows in data and ~200k rows in reference. I add ~5k rows to data each day, resulting in 500-5k additional rows in reference per day.

In most of the cases I need to save only one reference for one data column (multiple data entries can have the same reference) but in some cases (currently about 0.1%) I need to save more than one reference.

I probably could use a third table like:

create table data_reference (
     dataid int not null,
     refid int not null
);

to create a n:m mapping but for most of the values it will not have any gain here, it will just create another join…

I could also create a second column in each data row, because in most of the cases I have only one additional reference (but in theory there could be more) and save null if no other reference is there – which seems to me quite bad because there could be tuples with 3 or more references.

Is there an efficient and easy to maintain way to save such data?

Best Answer

Using an intermediate table is the proper way to do this. With proper indexes and good maintenance it won't hurt performance much at all.

Avoid the multi-column solution (refid1, refid2 etc.) - that's repeating groups and not properly normalised. Plus your queries will degenerate into a dog's breakfast of OR clauses and CASE statements, which will hurt performance.

You may get some benefit from subtyping your data rows so most use your current arrangement but those with multiple reference use a different one. Your queries will UNION ALL the two sets to get the total. I really doubt the overhead from this will compensate for the mostly-redundant join to data_reference.

The subtyped schema would look like this:

create table data (
    id int identity not null,
    ref int not null,       -- set to a standard value for rows with >1 reference.
    is_single_reference bit not null
);

create table reference (
    refid int not null
);

create table data_reference (
     dataid int not null,
     refid int not null
);

The query would be something like:

select
    <whatever>
from data as d
inner join reference as r
    on d.ref = r.refid
where d.is_single_reference = 1

union all

select
    <whatever>
from data as d
inner join data_reference as dr
    on dr.dataid= d.id
inner join reference as r
    on r.refid = dr.refid
where d.is_single_reference = 0;

This is fairly ugly. You'd be lucky to avoid two scans of data. Ideally you'd want filtered indexes but IIRC these aren't availble until SQL Server 2008. The applicaiton would have to know how many reference rows there were per data, set flags accordingly and maintain those if the number changed. Just use the n:m approach.

Related Solutions

Mysql – Database design that handles data growth with time

Your first tactic should be good indexing. Somewhere down the road, you'll want to consider partitioning.

You always want to think about sargable expressions.

Sql-server – SSIS Data Flow Task Excel to SQL table NULL value will not work for small INT datatype

My original answer was working (changing int to nvarchar) but I ran into another column which contained dates in the excel source file where some cells contained the string "NULL" (I did not want to have dates in a NVARCHAR column.) When SSIS got to this date column it was generating an error because it could not convert the string "NULL" to a date. It was not properly reading the "NULL" as a NULL value, instead it was reading it as a string. The way I was able to resolve this issue is by adding a derived column component to the package that replaced the date field by using the following expression

(DT_WSTR,255)Auth_from_date == "NULL" ? NULL(DT_WSTR,50) : (DT_WSTR,255)Auth_from_date

this is just an IF statement that will change the string "NULL" to an actual NULL value. The date is then passed thru a data conversion component and converted into DATETIME format.

Best Answer

Related Solutions

Mysql – Database design that handles data growth with time

Sql-server – SSIS Data Flow Task Excel to SQL table NULL value will not work for small INT datatype

Related Question