How to handle table design with variable columns

Architecturedatabase-designnormalizationtable

I have a table design scenario and as a non-DBA type, would like opinions on which is more scaleable.

Say you are asked to record information on houses for a metro area, starting with a small neighborhood (200 houses) but eventually growing to 5000000+ houses.

You are required to store base information: ID# (A unique lot # we can use as a unique index), Addr, City, State, Zip. Fine, simple table will handle it.

But each year, you are going to be asked to record extra information about all of the houses – and WHAT information will change each year. So, for example, first year, you are asked to record the owners last name and square footage. The second year, you are asked to keep the last name, but dump the square footage and instead begin collecting the owners first names.

Lastly – each year the # of extra columns will change. Might start with 2 extra columns, then go to 6 next year, then back down to 2.

So one table approach is to try to add the custom information as columns in the house tables so there is only one table.

But I have a situation where someone laid out the tables for this as:

"House Table" columns: ID, Addr, City, State, Zip – with one row per house

ID   Addr              City     State  Zip 
-------------------------------------------
1    10 Maple Street   Boston      MA  11203

2    144 South Street  Chelmsford  MA  11304

3    1 Main Avenue     Lowell      MA  11280

"Custom Info Table" columns: ID, Name, Value – with table looking like:

ID   Name             Value

1    Last Name        Smith

2    Last Name        Harrison

3    Last Name        Markey

1    Square Footage   1200

2    Square Footage   1930

3    Square Footage 

So there are multiple of rows for each individual house record. Each year when the optional information required changes, this table is literally rebuilt, so next year it might look like:

1    Last Name    Smith

2    Last Name    Harrison

3    Last Name    Markey

1    First Name   John

2    First Name   Harry

3    First Name   Jim

Eventually you amass 100,000 house rows AND one year there are 10 extra pieces of information; the second table now is 1,000,000 rows of information, many of which have redundant (description) information. The database requirements overall are that people will need to get the house row information + associated custom field values thousands of times per day.

So my question: Would it be bad (or horrible) practice to instead either:

A) Lay out house table with guess at max # of custom columns (called perhaps "1" through "10") and insert those custom values right in the house rows

OR

B) Store the custom information in the house table, but each year when the requirements change, rebuild the house table with only the # of columns needed for custom information, with the idea that the requirements could go nuts and you never know how many maximum optional fields might be asked for?

Thanks, hope this makes sense!

Best Answer

You have pretty much 4 choices:

NoSQL - definition Every record is stored as a set of Key/Value pairs. It is very flexible and fast. Not all of the report writers out there support this style of storage. There are many example database implementations of NoSQL. The one that seems to be most popular right now, is MongoDB.

EAV - definition This is where you turn either the whole table or a portion (in another table) on its side. This is a good choice if you already have a relational database in-house that you can't move away from easily. The custom info table example you gave is a good example of an EAV table.

Standard tables with XML columns - Think of this as NoSQL meets relational tables. The data stored in an XML column can be any format that XML supports, including multiple correlated sub-data. For the columns that you know are going to be "regular" columns, they can be built as the appropriate type of column to store the data (LastName, Address, City, State, etc.).

Standard tables with lots of extra columns - You have a relational database, you can't use either XML or EAV, and NoSQL is not an option. Add lots of extra columns of each type. I would guess 30 or more varchar, 30 or more integer, 15 or more numerics. And once you use a column for a value, don't re-use it. And don't delete the column either.

Out of all of these solutions, my own opinion is that you will find either the NoSQL or the EAV approach to be the most successful with the least amount of refactoring your code and your schema.

You will have a situation where you collect data one year, not the next, and then collect it again afterward. Trying to get the older data updated with the correct information is problematic and expensive. Storage is neither.