Sql-server – Text Storage and Database Design Optimization in a SQL Server Database

database-designoptimizationsql server

Given a database with a number of columns (c), where some of these columns contain text that is one of multiple values due to a set CONSTRAINT on the columns, what is the most efficient way to structure this database? Also considering the number of rows (r) the database may grow to in size.

Example: given a table with three columns, one being an auto incrementing integer ID column, another being a Fruit column where acceptable fruits are 'Apple','Banana','Orange', and 'Strawberry', along with the final column being Color with 'Red', 'Orange', 'Green', and 'Yellow' as acceptable values. Let us assume the Fruit and Color column have a data type of varchar(50). Which of the following would be a good practice and suitable database design?

  1. Should each value be directly represented in the table as follows given the previous example. Would it be fair to say that after a large number of rows querying such a table would be very inefficient due to the large storage and computation requirements required on the text?

    +-----+--------+--------+
    | ID | Fruit | Color |
    +-----+--------+--------+
    | 1 | Banana | Yellow |
    | 2 | Apple | Green |
    | ... | ... | ... |
    | 500 | Orange | Orange |
    +-----+--------+--------+

    OR

  2. For each column that has one of multiple values applicable to it, should that be represented in another table with integer values, and then joined together with an INNER JOIN? Followed by an integer representation in the main table? Does this depend on the length of the text values being stored in that field? For the given example, we would then have three tables.

    +----+------------+
    | ID | FruitName |
    +----+------------+
    | 1 | Apple |
    | 2 | Banana |
    | 3 | Orange |
    | 4 | Strawberry |
    +----+------------+
    +----+------------+
    | ID | FruitColor |
    +----+------------+
    | 1 | Red |
    | 2 | Green |
    | 3 | Orange |
    | 4 | Yellow |
    +----+------------+

    Then followed by:

    +-----+-------+-------+
    | ID | Fruit | Color |
    +-----+-------+-------+
    | 1 | 2 | 4 |
    | 2 | 1 | 2 |
    | ... | ... | ... |
    | 500 | 3 | 3 |
    +-----+-------+-------+

Is there any mathematical 'rule-of-thumb' when designing a database considering the number of rows and the type of data to be stored in each column? Or should the second schema be followed in general.

Best Answer

If all apples are green and all bananas are yellow a Fruit table (ID int, Name varchar(50), colour varchar(50)) would be appropriate, with your data table having a foreign key to it.

If you have yellow apples and orange bananas (yea for genetic engineering!), but only certain combinations are permitted you will need FruitRainbow(ID int, FruitID, ColourID) with the latter two as FKs to your option 2 tables above and your data table having a FK to FruitRainbow.

If any fruit can occur in any colour, and you don't want to limit those combinations in advance, your option 2 is fine.

If your query is really about resource optimisation rather than relational integrity then you'll have to decide what you want to gain and what you're willing to trade to get it. By using integer FKs instead of natural name you get a smaller disk footprint at the cost of runtime load. There are no free lunches. Pick your problem and solve it in the knowledge of the compromises it will entail.