Database Design – Redesigning a Table with Many Unrelated Columns

database-designtable

I'm new to database stuff so I apologize if I say anything ridiculous..

I work for a company that runs a website for a certain industry. We're not very large at the moment but our parent company wants to push our product out to more places. I'm worried about this because our database was never designed with ANY plan in mind (e.g, normalization doesn't exist, there are maybe 70 unused tables out of about 400- plenty of more unused columns on the tables we do use, etc.,) and we're running on a really shitty database program (pervasive SQL).

The company has at least recognized the need to move away from pervasive. We're planning to go to mariadb. To go along with this, there is a rare opportunity to potentially redesign some parts (or all of) the database but no one really cared to look into it- and that's how I ended up here.

SO, all that being said I just want to know if I'm on the right track here with my redesign so far.

Let's say we have a table that holds basic settings for companies on our system. Some of these settings are relevant to notifications (i.e what events will trigger e-mails for them), some are related to billing, some to how they create orders / what info they see on their orders, etc., This table has 334 columns and a LARGE amount of them only store a CHAR(1) which can be Y/N. (Side note: am I correct in assuming BIT will take up less space?)

What I've been doing is separating the parts (i.e, settings for orders go in their own company_order_settings table which has the company's id in it). The only thing is at this point I already have ~8 tables just for these different settings…is this normal? In general these things will (usually) be accessed separately so I'm not concerned about large joins. This large number of purely settings related tables will also happen when I refactor the user tables (which hold info on the users at said companies- in a similar number of columns).

If anyone wants more info let me know.

Best Answer

Not really an answer - but too big for a comment. Just a few ideas/discussion.

"our database was never designed with ANY plan in mind" - I hear yeh, sister!

Been there, done that. Once worked with a Btrieve (precursor to Pervasive) system that had 35,000 (not a typo - that's thirty five thousand) fields in one table. Agree with @timpone - hesitant about tables with > 30 fields. Tables should be like women - tall and slim, not short and fat :-).

It's a lot of work to redesign. Start with core functionality - what's the minimum amount of tables required? I would consider PostgreSQL if I were you, but like @Austin, that's just my opinion.

Read up on the different RDBMSs - check out their functionality. Is there anything that Btrieve has that they do/don't? As far as I can see, the main thing that Btrieve brought to the party was arrays - they are a breach of the relational model and should be avoided at all costs in an RDBMS.

You say "a LARGE amount of them only store a CHAR(1) which can be Y/N. (Side note: am I correct in assuming BIT will take up less space?)". With space being so cheap these days, this is not really an issue.

Related Solutions

Importing order information

I think the best way to approach this would be to have tables like so:

table Customer
    ID
    Name
    ...

table CustomerFields
    CustomerID
    FieldID
    FieldName
    ... (could have things like type, required/optional, size, etc.)

table CustomerOrders
    CustomerID
    OrderID
    OrderNumber
    DeliveryDate
    ... (all your generics and mappable fields)

table CustomerOrderFields
    CustomerID
    OrderID
    FieldID
    FieldValue

By having the table CustomerOrders, you satisfy the ability to map common customer order data to common fields (simplifying reporting at the expense of making the import a little more painful since the fields must be mapped), and the CustomerOrderFields gives you the ability to have the custom fields per customer necessary for the un-mappable data.

The custom fields are still reportable, but not as easily as your generics as they'll come to you in multiple rows (instead of multiple columns). There are some ways around all that depending on your report creator (e.g., pivoting the results).

The only other option would be to do something like this (which, personally, I would avoid):

table CustomerFields
    CustomerID
    Field1Name
    Field2Name
    Field3Name
    ...
    Field99Name

Table CustomerOrders
    (all your generics)
    Field1Value
    Field2Value
    ...
    Field99Value

This has the advantage that all your data is in one row, but also has the disadvantage that it isn't immediately apparent from the row what each value means. The first method can always be joined to the field list to give a good definition of the data in each custom field. In addition, what happens if some company requires 100 custom fields? In the above example, you'd be making changes to your data structures and code whereas in the first example, you'd never have the issue -- customers could have infinite custom fields.

I've seen it done both ways, and both ways work. Both ways have their downsides and upsides. The first is far more scalable, but harder to get in to columns (instead of rows). Everything's a trade-off.

Hope that helps some!

Database Design – Tracking History Without Losing Referential Integrity

This is a long-standing problem with lots of awkward solutions like the ones you referenced. The best ones still leave the problem of referential integrity. How do you reference an entity that changes over time and you want to maintain all the versions?

The solution actually turns out to be very simple: normalization. When we normalize an entity, there is a master entity table. This contains the PK of the entity and all other attributes. All FKs from other tables refer to this table as there is one and only one entry for any given entity.

After normalization, this table still exists but there is (at least) one other table that contains some of the attributes. There could be zero, one or several entries for any unique entity but that is fine. In fact, the PK of this table is the PK of the entity (which is also a FK back to the master entity table) and some other attribute.

So now you implement Version Normal Form. Take all the attributes of the main entity and separate out those attributes that change over time and you want to track those changes. What you have is the master entity table with the one and only PK of the entity (referential integrity works as normal) and all attributes that either don't change over time or the changes aren't tracked. Then the "version" table contains all the attributes that change and are tracked. The PK of this table is the entity PK and the change date (the date and time the version took affect).

A partial list of advantages of this scheme are

It uses normalization, a process we all know and work with daily.
Referential integrity is not affected.
Current and past versions are in the same table. In fact, the query to access the current state of the data and the query to look back into the past is the same query.
The change date can be set to the future for those instances when a state change is known ahead of time. Such planned changes do not appear in queries until the change time is reached.

There are two methods of "soft delete" that can be used. A "deleted" datetime attribute can be added to the master record with the default being NULL or (as I prefer) the magic date 9999-12-31. This is what I call a "firm" delete because it should not be undo-able. The other method is to add an "isDeleted" attribute to the version record. A delete operation would insert a version with "isDeleted" set to TRUE (or 'Y' or whatever). The deletion would be just another version. So to undelete, a new version with "isDeleted" turned off would bring the entity back into existence -- with the times it was deleted and then undeleted maintained like any other state change. Of course, both methods may be used if needed.

Here are the slides of a presentation I have made a few times at tech fairs. It covers how all the above can be done including the queries. And here is a document that goes into a lot more detail.

Best Answer

Related Solutions

Importing order information

Database Design – Tracking History Without Losing Referential Integrity

Related Question