Preventing duplicate data when usual normalization only makes sense for some of the records

database-designduplicationnormalization

Background

I have two tables, one called issues and the other called status_checks, which both contain (among others) the field comment (VARCHAR).

An issue is raised by a user when something needs to be looked at by the support staff. A status_check on the other hand is something that is done every so often by support staff and can (but doesn't have to) result in an issue being raised.

So, for instance, a status_check might look like this:

+----+---------+---------+--------------+---------+------------+
| id | user_id | date    | satisfactory | comment |  issue_id  |
+----+---------+---------+--------------+---------+------------+
| 1  |    23   | 2014... |      1       | blah... |    NULL    |
+----+---------+---------+--------------+---------+------------+

or like this:

+----+---------+---------+--------------+-------------------+------------+
| id | user_id | date    | satisfactory |      comment      |  issue_id  |
+----+---------+---------+--------------+-------------------+------------+
| 2  |    24   | 2525... |      0       | It all started... |     25     |
+----+---------+---------+--------------+-------------------+------------+

(Typically, an issue will only be raised if it's unsatisfactory (0), but that's not required.)

Then, the issue for that second one would look something like this:

+----+---------+---------+-------------------+---------+--------------+
| id | user_id | date    |      comment      |  status | close_reason |   
+----+---------+---------+-------------------+---------+--------------+
| 2  |    24   | 2525... | It all started... |  open   |     NULL     |
+----+---------+---------+-------------------+---------+--------------+

The Problem

So the question is… how can I avoid duplicating the comment field when an issue is raised by a status_check? I could leave that field NULL when an issue is raised, and simply put the comment contents in the issue table (or vice versa)… but then that makes the SQL retrieval logic more complex. Having to optionally join to another table based on a field being possibly null seems like overkill, but I can't think of a better solution.

Normally, one would normalize by putting the comment in an issue and always having an issue_id in the status_checks table, but obviously that doesn't make sense here, since most status_checks are not issues.

Possible solution?

The other way I thought of doing things seems like normalization gone wrong to me, but maybe it's actually better. That solution is to have a comments table:

+----+---------+
| id | comment |
+----+---------+
| .. | ....... |
+----+---------+

Then I would just include a comment_id in both tables, rather than the comment itself. It's still duplicating the data, but it's only duplicating an integer.

Can some other way of normalizing help me here? Or is one of the above the best I'm going to get?

EDIT: Looking at it now, I see that the user_id and date are also both duplicated, so maybe it's natural that the comment be duplicated too? I guess it's only the size of the comment that makes me queasy about duplicating it. The comments-only table looks ugly, since a comment is such a general concept and the comments by themselves would just be floating in there with no context.

Best Answer

If I understand your dilemma correctly, you have:

Two tables, each of which can have a comment
One of the tables is optional, i.e. it may not have an entry to correspond to the other
For the second table, the comment is also optional, such that even if the second table has a record to match the first, the comment in the second may just "default" to a copy of the first.
On the other hand, the second table may have a distinct comment after all.

If this is the case, your concern is that recording the "default" (duplicate) comment in the second table is wasteful or even dangerous since the data could get out of whack.

In this situation, you can use the SQL COALESCE function along with a LEFT OUTER JOIN to solve your problem.

Using the outer join lets you use a single SQL statement that pulls together the two tables (when there are records in each) or just pull data from the first (mandatory) table if the second (optional) record is missing. No complicated branching, just a single SQL select. The coalesce allows you to pick the first non-NULL value in a list of values. This is useful because it lets you take the first comment as a default if the second comment is NULL. The second comment can be NULL because either (a) there is no matching record in the second table or (b) the comment within the second table is NULL.

It seems like in your case there will always be an ISSUE entry. Sometimes this entry is created by a user and sometimes it is the result of an unsatisfactory STATUS_CHECK.

In this case you want to select:

...
COALESCE(I.comment, C.comment) as Remark
FROM ISSUE I LEFT OUTER JOIN STATUS_CHECK C
ON I.ID = C.ISSUE_ID
...

The net effect of this is that if you have a satisfactory status check, it will show the status check comment. If you have an unsatisfactory status check or an independently raised issue, it will show the (non-NULL) comment from the issue table.

Related Solutions

Postgresql – Joining based on date (or date range)

That is going to be horribly messy, since you'll need to find the single (staffid, effectivefrom) pair that's the newest prior to the dateParameter.

What I'd do is add another field effectiveTo to go with effectiveFrom, and then when you add a new target for a staff member, before you add it check and see if there are any current targets (effectiveTo IS NULL) and if there are then set their end date to the new target start date (optionally: throw an error if there's more than one current target).

So, the table turns out looking like:

 staffID | target | effectiveFrom | effectiveTo
 --------|--------|---------------|-------------
  1      |  6.0   | 2012-01-01    | 2012-03-01
  2      |  6.0   | 2012-01-01    | NULL
  3      |  6.0   | 2012-01-01    | NULL
  1      |  7.0   | 2012-03-01    | NULL

That makes your search query look like this:

SELECT  othertable.*, targets.target
FROM    othertable
            INNER JOIN 
        targets ON othertable.staffid = targets.staffid
WHERE dateParameter > effectiveFrom AND (dateParameter <= effectiveTo OR effectiveTo IS NULL)

The potential downside of that is data inconsistency (i.e. staff members having overlapping targets) - you can fairly easily write a script to check that though. I'd recommend making an addTarget(staffID, target, fromDate) stored procedure and wrapping all the checking/updating logic in that, also.

(A couple of style notes - I rewrote your query slightly to seperate the join condition from the where clause. Also, I know it's only an example, but SELECT table.* is a bad habit to get into - you land up transferring far more data than you need.)

Mysql – Two nullable columns one required to have value

A version of JD Schmidt's answer, but without the awkwardness of an extra column:

CREATE TABLE foo (
  FieldA INT,
  FieldB INT
);

DELIMITER //
CREATE TRIGGER InsertFieldABNotNull BEFORE INSERT ON foo
FOR EACH ROW BEGIN
  IF (NEW.FieldA IS NULL AND NEW.FieldB IS NULL) THEN
    SIGNAL SQLSTATE '45000'
    SET MESSAGE_TEXT = '\'FieldA\' and \'FieldB\' cannot both be null';
  END IF;
END//
CREATE TRIGGER UpdateFieldABNotNull BEFORE UPDATE ON foo
FOR EACH ROW BEGIN
  IF (NEW.FieldA IS NULL AND NEW.FieldB IS NULL) THEN
    SIGNAL SQLSTATE '45000'
    SET MESSAGE_TEXT = '\'FieldA\' and \'FieldB\' cannot both be null';
  END IF;
END//
DELIMITER ;

INSERT INTO foo (FieldA, FieldB) VALUES (NULL, 10); -- OK
INSERT INTO foo (FieldA, FieldB) VALUES (10, NULL); -- OK
INSERT INTO foo (FieldA, FieldB) VALUES (NULL, NULL); -- gives error
UPDATE foo SET FieldA = NULL; -- gives error

Best Answer

Related Solutions

Postgresql – Joining based on date (or date range)

Mysql – Two nullable columns one required to have value

Related Question