SQL Server Flag vs Table Split – Database Design Considerations

database-designsql server

I am designing a table of items which will (potentially) contain tens of millions of records. Some items will not be available for use until they are "approved" by administrator. By "use" I mean that such items will not be referenced in any other table until they are "approved". Up to 50% of items may be "unapproved" at any given time. Records may become "approved", but not vice versa.

I consider two design options:

a bit flag
a separate table of "unapproved" items – when item is approved it is moved to "regular" table (renewal of item's ID is not an issue)

I think the second option is much better. Bit flag takes only a byte per row, so it is not an issue. But if we have a million of approved and a million of unapproved records in same table – scan time increases for operations with approved records.

Question is: should I consider first (bit flag) option instead? Does it have any benefits in described situation?

Best Answer

You can have it both ways with partitioned views.

You create an underlying table for each status, enforced by constraints, with mutually exclusive values. Then a view which UNIONs together the underlying tables. The view or each base table can be referenced explicitly. If a row's status is UPDATEd through the view the DBMS will DELETE it from one base table and insert it into the one corresponding to the new status. Each base table can be indexed independently according to its usage pattern. The optimiser will resolve index references to a single corresponding base table if it can.

The benefits are
a) shallower indexes. Do the math on the index fan-out, however. At that scale and split between your status values it is possible the indexes will be the same depth on the split tables as they would be on the combined table.
b) no application code has to change. The data continues to appear as a continuous whole.
c) future new status values can be included by adding a new base table, with constraint, and re-creating the view.

The cost is all that data movement; two pages and associated indexes are written for each status update. Lots of IO to deal with. That much movement will cause fragmentation, too.

Related Solutions

Hibernate modelling a N:N Relationship

Yes, a bridge table (also known as junction table, mapping table etc.) is a perfectly sensible solution in this case, and it is common for a bridge table to have other attributes beside the references to the tables being connected.

As a stationary object, a stop is just a place, which has coordinates and may also have a fixed name as well as some other properties. As a part of a route, though, it can be assigned the additional attributes you have mentioned that would depend on the route, i.e. on the specific bus number. Those would need to be stored in a different table, the one that would link the stop and the route.

So, if by "bus" you mean "bus number" (or "bus route", "bus line"), then by all means create a dedicated table with references to Bus and Stop and with whatever additional attributes you deem necessary. Note, though, that if Bus is meant to store information about specific buses (vehicles), then I would consider creating another entity (called Route perhaps) that I would link to in the bridge table. Bus would then probably reference that new entity, either directly or, again, through another bridge table (in case vehicles were allowed to work alternately on different lines).

SQL Server 2012 – Inner Join Estimated Number of Rows Issue

I was able to reproduce your issue in SQL Server 2014 with the legacy CE and TF 4199 enabled. I used a rowstore fact table because I have little experience with columnstore.

For your query with the filter, the query optimizer rewrites the query to something like this:

select * 
from fact_sales f
join dim_company d on f.company_SK = d.company_SK 
where f.company_SK = 1 and d.company_SK = 1

For that query the optimizer can directly uses the histogram step containing 1 for both tables and you get 467,583,000 X 1 as the estimated number of rows.

The query without a filter will use linear interpolation:

For joins with a single equality or inequality predicate, the legacy CE joins the histograms on the join columns by aligning the two histograms step-by-step using linear interpolation.

Note that the statistics for your dimension table do not have a histogram step with a RANGE_HI_KEY value of 1, which is where almost all of the data is in your fact table. Something is going wrong with the linear interpolation step for that value. If I create your dimension table without a 0 value then the histogram gets a step with a RANGE_HI_KEY value of 1. That fixes the estimate. In short, you got unlucky with your histogram. I don't think there is a supported way to force the histogram on the dimension table to contain a step for 1, unless perhaps you are willing to add additional filters to your query.

I don't know the exact rules of your data, but I can give you two workarounds. The first workaround is to enable trace flag 2301 for the query. That fixes the estimate for this part of the query but may have other negative effects for the rest of your query. This trace flag is documented by MS, but talk to your DBA before using it.

The second workaround is to encourage the query optimizer to use a different method of cardinality estimation for the join. The following query gave a good estimate for me and only does a single scan on your fact table:

select * 
from FACT_SALES f
join (SELECT COMPANY_SK FROM DIM_COMPANY UNION SELECT 1) d
on f.company_SK = d.company_SK;

Best Answer

Related Solutions

Hibernate modelling a N:N Relationship

SQL Server 2012 – Inner Join Estimated Number of Rows Issue

Related Question