Table design for a dictionary that can have words with many different spellings

database-design

I'm working on a small, personal dictionary database in Microsoft Access (the 2013 version). There are a lot of words in English that have two or even more spellings. Realistically speaking though, there are not that many words with three, let alone, four spellings. Nevertheless, they do exist. Examples include aerie/aery/eyrie/eyry (a word with four spellings) and ketchup/catsup/catchup (a word with three spellings). Not to mention that English is literally rife with words that have two spellings. Everybody knows that (the differences between the English and British spelling systems come immediately to mind). So, I need to design my tables in such a way that there are no significant flaws with the design. I'm going to explain step by step what the database should look like and introduce the problems I have found with my current design along the way. So, here we go.

All words, obviously, should be stored in the same table. And I'm not going to include irrelevant aspects of the design such as other columns that might be part of the table (in reality, the database is much more complex). Let's focus on the most important parts. Here's what the Words table with some pre-filled sample data will look like:

+---------+-----------+
| word_id | word      |
+---------+-----------+
|       1 | ketchup   |
|       2 | catsup    |
|       3 | catchup   |
|       4 | moneyed   |
|       5 | monied    |
|       6 | delicious |
+---------+-----------+

To keep track of a group of words that are the same, but just have different spellings, it is probably wise to choose one of them as the main word and the other ones as its child words. Here's the diagram to show you how I envision that (here, ketchup and moneyed are main words, all the others child words):

All this information will be placed in a new table which we shall call the Alternative Spellings table (The columns word_id and alt_spell_word_id are going to be part of the table's compound primary key):

+---------+-------------------+
| word_id | alt_spell_word_id |
+---------+-------------------+
|       1 |                 2 |
|       1 |                 3 |
|       4 |                 5 |
+---------+-------------------+

Here's how all this looks in Access's Relationships panel (notice that I have enforced referential integrity between the word_id column of the Words table and the word_id column of the Alternative Spellings table and checked off the Cascade Delete Related Records option):

Although straight-forwardly simple, that's the only design I've been able to come up with so far. And I think that will basically do it. This is as simple as it gets. The problem with this design, however, is threefold:

1: This is not a serious problem, but I'd still like to hear your thoughts anyway. Every time I'm making a lookup of a word to see it in the Word Details form, I have to go through the entire Alternative Spellings table to see if it has other spellings associated with it or if it is a child word. So, I'd have to search both the word_id and alt_spell_word_id columns. And this process will be talking place for each and every word in the database every time I want to check the details of it. One possible solution is in the Words table to create an additional Boolean column that will keep track of whether a word has alternative spellings. This will indicate if we should scan the Alternative Spellings table at all when opening it up in the Word Details form. Here's what this would look like:

+---------+-----------+------------------+
| word_id | word      | has_alt_spelling |
+---------+-----------+------------------+
|     101 | ketchup   | yes              |
|     102 | catsup    | no               |
|     103 | catchup   | no               |
|     104 | moneyed   | yes              |
|     105 | monied    | no               |
|     106 | delicious | no               |
+---------+-----------+------------------+

I think that's a good design, but, as I said, I'd very much like to hear what you've got to say about this: a problem/not a problem? Your solution?

2: The other problem, which is of more serious nature, has to do with primary keys. word_id and alt_spell_word_id should be part of a compound primary key, of course. We don't want duplicate rows in the table. We all understand that. Not a problem. But here's what happens when we try to enforce referential integrity between the Words table and Alternative Spellings table (see the screenshot above). Everything is fine except that now we can associate a word with the id of a nonexistent word and the database is not going to complain because, for example, the last record in word_id has 4 in it, which is true, we do have a record with the id of 4 in the Words table, but there is no way to impose any kind of constraint on the alt_spell_word_id column. We can put any kind of nonsense in there:

+---------+-------------------+
| word_id | alt_spell_word_id |
+---------+-------------------+
|       1 |                 2 |
|       1 |                 3 |
|       4 |                 5 |
|       4 |             34564 |
+---------+-------------------+

I think that breaks the referential integrity of the database schema and thus is a serious problem. What kind of solution would you like to offer?

3: Another problem with this design is that if we want to delete a certain word from the Words table, the deletion will cascade through the Alternative Spellings table and delete all related records there, which is perfectly fine, but here's the catch: since we agreed that different words in the database can actually be just one word with different spellings, they all should be deleted along with the main word. But that's not going to happen as things stand at the moment. For instance, if I were to delete ketchup in the Words table, all related records in the Alternative Spellings table would be deleted. Fine. But we'd really get two dangling records, catchup and catsup—they can't exist on their own because they are part of the group where ketchup is the main word, but now it has been deleted:

+---------+-----------+
| word_id | word      |
+---------+-----------+
|       2 | catsup    |
|       3 | catchup   |
|       4 | moneyed   |
|       5 | monied    |
|       6 | delicious |
+---------+-----------+

+---------+-------------------+
| word_id | alt_spell_word_id |
+---------+-------------------+
|       4 |                 5 |
+---------+-------------------+

Here's the actual database (simplified version) if you want to play with it.

Thank you all in advance.

Best Answer

The way I would tackle this is by asking, how the word and description are related to each other. In this case multiple words have a single description (definition, pronunciation, part of a sentence, etc.).

I would set up a single description table with the neccesary information on it, which then uses the "Description_Num" as a primary key. Then for the Word table I would set up three columns, an ID as a Primary Key, the word its self, and the description_num to reference as a Foreign Key.

You can then use the description_num to combine these, and use your DBMS functions to find information like the number of different spellings of a word.

I think this would result in maintaining referential integrity in tables as long as you are using the description_num.

Related Solutions

Well-known name for this “poor man’s ref. integrity” schema design pattern

Your design looks a bit like the "supertype/subtype" pattern. Search for that and for "table inheritance". It needs quite a lot of work to be able to enforce integrity constraints though.

You are missing a generic Fruit table (that's the "supertype") and a FruitType table to store the alllowed fruit types:

FruitType 
    fruit_type PK

Fruit
    fruit_type PK, FK -> FruitType (fruit_type)
    fruit_id   PK

Then the 3 (or 4 or more) tables would be (the "subtype" tables):

Apple
    fruit_type 
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Apple')

Banana 
    fruit_type PK
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Banana')

Orange
    fruit_type PK
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Orange')

And any other table can reference the Fruit table:

FruitPack 
    fruitpack_id PK 
    destination

FruitPackFruits 
    fruitpack_id FK -> FruitPack (fruitpack_id)
    fruit_id     
    fruit_type
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)

It doesn't look very nice and one column in every "fruit" table seems redundant as it has one and only one allowed value. And every time you need to add a new fruit (say Cherry), you have to add a row in the table FruitType and a new table (Cherry), similar to the other ones. So, it works better if your design is more or less stable. If you find that you may need to add a new "fruit" every few days or if you have a thousand (or more!) different fruits, it's not the best way.

On the other hand, it enforces integrity and you can't insert cherries into the Bananas or oranges into the Apples.

Optional Parent-Child relationship

Just make the Tour optional? The below code is for SQL Server, but you should be able to port this to other flavors of SQL. Essentially, the TourID column can be nullable, indicating the Gig is not part of a Tour.

CREATE TABLE dbo.Bands
(
    BandID INT NOT NULL
        PRIMARY KEY CLUSTERED
    , BandName VARCHAR(100) NOT NULL
);

CREATE TABLE dbo.Tours
(
    TourID INT NOT NULL
        PRIMARY KEY CLUSTERED
    , TourName VARCHAR(100) NOT NULL
)

CREATE TABLE dbo.Gigs
(
    GigID INT NOT NULL
    , GigName VARCHAR(100) NOT NULL
    , GiGDate DATETIME NOT NULL
    , BandID INT NOT NULL
        FOREIGN KEY
        REFERENCES dbo.Bands(BandID)
    , TourID INT NULL
        FOREIGN KEY 
        REFERENCES dbo.Tours(TourID)
);

This design will also allow multiple bands to take part in any given tour, thereby providing a mechanism for headliners and opening acts, or multi-stage events, etc.

Filling out the example from above with some data:

INSERT INTO dbo.Bands (BandID, BandName)
VALUES (1, 'Past the Perimeter');

INSERT INTO dbo.Tours (TourID, TourName)
VALUES (1, 'Taking it to the City');

INSERT INTO dbo.Gigs (GigID, GigName, GiGDate, BandID, TourID)
VALUES (1, 'Tour Date 1', '2016-05-27T21:30:00', 1, 1);

INSERT INTO dbo.Gigs (GigID, GigName, GiGDate, BandID, TourID)
VALUES (2, 'Standalone Gig', '2016-05-28T21:30:00', 1, NULL);

SELECT b.BandName
    , g.GigName
    , g.GiGDate
    , t.TourName
FROM dbo.Bands b
    INNER JOIN dbo.Gigs g ON b.BandID = g.BandID
    LEFT JOIN dbo.Tours t ON g.TourID = t.TourID
ORDER BY g.GiGDate;

Best Answer

Related Solutions

Well-known name for this “poor man’s ref. integrity” schema design pattern

Optional Parent-Child relationship

Related Question