Design a db table where rows may have NULL in multiple columns

database-designnormalization

I am trying to design a database for tour packages. Tour packages contains the detailed itinerary.

Tour package duration may be between 3 and 30 days. I need to add the description of each day.
If I make 30 columns like day_1, day_2, … day_30 .
While creating the package for 5 days the rest of 25 Columns will be empty/NULL.

There will be constraints that all packages should have less than 30 days.

How can I improve this design ?

Best Answer

You normally would have your data normalized. This means you will have one table for the tour packages, and another one for each days itinerary:

CREATE TABLE tour_packages
(
    tour_package_id INTEGER PRIMARY KEY,
    tour_package_name varchar(150),
    other_data varchar(1000)
) ;

CREATE TABLE package_schedules
(
    tour_package_id INTEGER NOT NULL REFERENCES tour_packages(tour_package_id) /* ON DELETE CASCADE */,
    day INTEGER NOT NULL,  -- Day 1 of itinerary, day 2, ...
    day_description varchar(1000) NOT NULL,

    -- The rows in this table are identified by a combination of tour_package_id and day
    PRIMARY KEY (tour_package_id, day),
    -- We can have the database check that we don't have day 0, or day negative
    CHECK (day >= 1),
    -- If you want to limit the number of days:
    CHECK (day <= 30)
) ;

With the proper CONSTRAINTS defined for your tables (which include in this case the PRIMARY KEYS, the REFERENCES and CHECKS) you can be quite sure that your data follows the rules.

The only thing that an SQL database (as of now) cannot do for you is to make sure that you don't have "missing days" just by declaring constraints.

That is, the situtation produced by the following INSERTs cannot be avoided, and should be taken care of at the application level:

We have one tour package:

INSERT INTO tour_packages
    (tour_package_id, tour_package_name)
VALUES
    (1, 'My first tour') ;

We add now the schedules for days 1 and 3, but we miss day 2:

INSERT INTO package_schedules
    (tour_package_id, day, day_description)
VALUES
    (1, 1, 'First day of first tour: Welcome') ;

-- We are missing day 2

INSERT INTO package_schedules
    (tour_package_id, day, day_description)
VALUES
    (1, 3, 'Third day of first tour: Going to visit Honolulu') ;

This could actually be accomplished on the unnormalized scenario that you provided, with a large collection of CHECK constraints; where each day_n can only be null if day_n+1 is also null. That is, all the NULL values must be on the right side and be consecutive.

CHECK(day_1  IS NOT NULL)
CHECK(day_2  IS NOT NULL OR (day_2 IS NULL AND day_3 IS NULL))
...
CHECK(day_29 IS NOT NULL OR (day_29 IS NULL AND day_30 IS NULL))

I'd say that the advantage that the unnormalized design has because of this is far outweighted by all the inconvenients you may have. The day that you will move from 30 day-max packages to 90 day-max will become absolutely obvious. The day that you want to query: Is there any package where any scheduled day moves through Honolulu? you will find out as well.

The "missing day" is usually dealt with at the application layer, not the database. The database could in theory also handle this (probably) by using triggers, or with a database that allows the definition of functions that can perform complicated checks, and put them into CHECK( no_empty_days(tour_package_id)). In practice, it's something I've seldomly found.

The previous definitions are standard SQL and run on all versions available in DBFiddle.uk (as of today). Take into account, however, that MariaDB (or MySQL) allow the declaration of CHECK constraints, but then do not enforce them.

You can check everything at dbfiddle here

Related Solutions

Mysql – Two nullable columns one required to have value

A version of JD Schmidt's answer, but without the awkwardness of an extra column:

CREATE TABLE foo (
  FieldA INT,
  FieldB INT
);

DELIMITER //
CREATE TRIGGER InsertFieldABNotNull BEFORE INSERT ON foo
FOR EACH ROW BEGIN
  IF (NEW.FieldA IS NULL AND NEW.FieldB IS NULL) THEN
    SIGNAL SQLSTATE '45000'
    SET MESSAGE_TEXT = '\'FieldA\' and \'FieldB\' cannot both be null';
  END IF;
END//
CREATE TRIGGER UpdateFieldABNotNull BEFORE UPDATE ON foo
FOR EACH ROW BEGIN
  IF (NEW.FieldA IS NULL AND NEW.FieldB IS NULL) THEN
    SIGNAL SQLSTATE '45000'
    SET MESSAGE_TEXT = '\'FieldA\' and \'FieldB\' cannot both be null';
  END IF;
END//
DELIMITER ;

INSERT INTO foo (FieldA, FieldB) VALUES (NULL, 10); -- OK
INSERT INTO foo (FieldA, FieldB) VALUES (10, NULL); -- OK
INSERT INTO foo (FieldA, FieldB) VALUES (NULL, NULL); -- gives error
UPDATE foo SET FieldA = NULL; -- gives error

How to handle table design with variable columns

You have pretty much 4 choices:

NoSQL - definition Every record is stored as a set of Key/Value pairs. It is very flexible and fast. Not all of the report writers out there support this style of storage. There are many example database implementations of NoSQL. The one that seems to be most popular right now, is MongoDB.

EAV - definition This is where you turn either the whole table or a portion (in another table) on its side. This is a good choice if you already have a relational database in-house that you can't move away from easily. The custom info table example you gave is a good example of an EAV table.

Standard tables with XML columns - Think of this as NoSQL meets relational tables. The data stored in an XML column can be any format that XML supports, including multiple correlated sub-data. For the columns that you know are going to be "regular" columns, they can be built as the appropriate type of column to store the data (LastName, Address, City, State, etc.).

Standard tables with lots of extra columns - You have a relational database, you can't use either XML or EAV, and NoSQL is not an option. Add lots of extra columns of each type. I would guess 30 or more varchar, 30 or more integer, 15 or more numerics. And once you use a column for a value, don't re-use it. And don't delete the column either.

Out of all of these solutions, my own opinion is that you will find either the NoSQL or the EAV approach to be the most successful with the least amount of refactoring your code and your schema.

You will have a situation where you collect data one year, not the next, and then collect it again afterward. Trying to get the older data updated with the correct information is problematic and expensive. Storage is neither.

Best Answer

Related Solutions

Mysql – Two nullable columns one required to have value

How to handle table design with variable columns

Related Question