PostgreSQL – How to Model Database Structure for Storing Words Information

database-designpostgresql

I have an application where the user can click on any word in a text, which will bring back related information about the base word and the specific inflection (word form) of the word clicked. I've tried representing this in a multitude of ways, but now that I'm close to launch, I'm worried as hell about the best way to represent it (this is my first ever project with programming, so I have no idea what I'm in for).

Considerations

The way I had originally set this up was that there was a base word table that contained the base word, part of speech, and then an array of word forms for each base word.

I then learned that this was a bad idea in terms of normalization, and I also realized that I needed to store specific inflection information for the words (mood, tense, person, etc.), so I rearranged it.

In the second iteration, I had three tables, a base_word table that had the base_word and the part of speech, a word table that had the word and its inflection information, and then a “join” table between them that contained the base_word_id and the word_id.

However, I've been re-thinking this setup, because semantically speaking, a base word is also a word, just a specific type of word. So in this case, I feel like a “self-join” table might be most appropriate, although I could be completely overthinking this. In the “self-join” table, there would be an is_baseword boolean attribute, which would theoretically allow all the inflection columns to store NULL marks if it was set to true. Is this a better idea?


Note: I found these two similar questions which didn't seem to completely answer my questions, so I'm posting my twist on these here:

Best Answer

I believe your design for a one to many join with base_word in one table and word forms in another is fine. The reasons I make this judgement are:

  1. The base_word, as an entity, is sufficiently different to word_forms as to represent a different entity. For instance, if you were to include the words origin language, this would also be only applicable to the base word entity, not its word forms.
  2. Storing the data in the same table will have implications on the size of the tables due to having sparse columns with attributes that are associated with base_word OR word_form only, meaning they are NULL or empty for the others. You point this out with the is_baseword and inflection columns. Having a bunch of NULL values on a row which will never be changed is usually a reason to use the EAV model or split into different entities.
  3. Performance considerations for queries should be taken into account. With the additional size consideration a self join to a sparsely populated table will likely not be as performant. You will probably find yourself indexing a single table for base_words OR word_forms separately. The indexes will be larger and more difficult to maintain for inserts/deletes etc. Better to have indexes on smaller tables and to have as few as possible on a table.

Don't get too concerned with perfect normalization. While a perfectly normalized system may adhere to strict information science principles, it is often better in the real world to have a slightly denormalized database for other reasons.