Preventing duplicate data when usual normalization only makes sense for some of the records

database-designduplicationnormalization

Background

I have two tables, one called issues and the other called status_checks, which both contain (among others) the field comment (VARCHAR).

An issue is raised by a user when something needs to be looked at by the support staff. A status_check on the other hand is something that is done every so often by support staff and can (but doesn't have to) result in an issue being raised.

So, for instance, a status_check might look like this:

+----+---------+---------+--------------+---------+------------+
| id | user_id | date    | satisfactory | comment |  issue_id  |
+----+---------+---------+--------------+---------+------------+
| 1  |    23   | 2014... |      1       | blah... |    NULL    |
+----+---------+---------+--------------+---------+------------+

or like this:

+----+---------+---------+--------------+-------------------+------------+
| id | user_id | date    | satisfactory |      comment      |  issue_id  |
+----+---------+---------+--------------+-------------------+------------+
| 2  |    24   | 2525... |      0       | It all started... |     25     |
+----+---------+---------+--------------+-------------------+------------+

(Typically, an issue will only be raised if it's unsatisfactory (0), but that's not required.)

Then, the issue for that second one would look something like this:

+----+---------+---------+-------------------+---------+--------------+
| id | user_id | date    |      comment      |  status | close_reason |   
+----+---------+---------+-------------------+---------+--------------+
| 2  |    24   | 2525... | It all started... |  open   |     NULL     |
+----+---------+---------+-------------------+---------+--------------+

The Problem

So the question is… how can I avoid duplicating the comment field when an issue is raised by a status_check? I could leave that field NULL when an issue is raised, and simply put the comment contents in the issue table (or vice versa)… but then that makes the SQL retrieval logic more complex. Having to optionally join to another table based on a field being possibly null seems like overkill, but I can't think of a better solution.

Normally, one would normalize by putting the comment in an issue and always having an issue_id in the status_checks table, but obviously that doesn't make sense here, since most status_checks are not issues.

Possible solution?

The other way I thought of doing things seems like normalization gone wrong to me, but maybe it's actually better. That solution is to have a comments table:

+----+---------+
| id | comment |
+----+---------+
| .. | ....... |
+----+---------+

Then I would just include a comment_id in both tables, rather than the comment itself. It's still duplicating the data, but it's only duplicating an integer.

Can some other way of normalizing help me here? Or is one of the above the best I'm going to get?

EDIT: Looking at it now, I see that the user_id and date are also both duplicated, so maybe it's natural that the comment be duplicated too? I guess it's only the size of the comment that makes me queasy about duplicating it. The comments-only table looks ugly, since a comment is such a general concept and the comments by themselves would just be floating in there with no context.

Best Answer

If I understand your dilemma correctly, you have:

  • Two tables, each of which can have a comment
  • One of the tables is optional, i.e. it may not have an entry to correspond to the other
  • For the second table, the comment is also optional, such that even if the second table has a record to match the first, the comment in the second may just "default" to a copy of the first.
  • On the other hand, the second table may have a distinct comment after all.

If this is the case, your concern is that recording the "default" (duplicate) comment in the second table is wasteful or even dangerous since the data could get out of whack.

In this situation, you can use the SQL COALESCE function along with a LEFT OUTER JOIN to solve your problem.

Using the outer join lets you use a single SQL statement that pulls together the two tables (when there are records in each) or just pull data from the first (mandatory) table if the second (optional) record is missing. No complicated branching, just a single SQL select. The coalesce allows you to pick the first non-NULL value in a list of values. This is useful because it lets you take the first comment as a default if the second comment is NULL. The second comment can be NULL because either (a) there is no matching record in the second table or (b) the comment within the second table is NULL.

It seems like in your case there will always be an ISSUE entry. Sometimes this entry is created by a user and sometimes it is the result of an unsatisfactory STATUS_CHECK.

In this case you want to select:

...
COALESCE(I.comment, C.comment) as Remark
FROM ISSUE I LEFT OUTER JOIN STATUS_CHECK C
ON I.ID = C.ISSUE_ID
...

The net effect of this is that if you have a satisfactory status check, it will show the status check comment. If you have an unsatisfactory status check or an independently raised issue, it will show the (non-NULL) comment from the issue table.