I have an application where the user can click on any word in a text, which will bring back related information about the base word and the specific inflection (word form) of the word clicked. I've tried representing this in a multitude of ways, but now that I'm close to launch, I'm worried as hell about the best way to represent it (this is my first ever project with programming, so I have no idea what I'm in for).
Considerations
The way I had originally set this up was that there was a base word
table that contained the base word
, part of speech
, and then an array of word forms
for each base word.
I then learned that this was a bad idea in terms of normalization, and I also realized that I needed to store specific inflection
information for the words (mood
, tense
, person
, etc.), so I rearranged it.
In the second iteration, I had three tables, a base_word
table that had the base_word
and the part of speech
, a word
table that had the word
and its inflection
information, and then a “join” table between them that contained the base_word_id
and the word_id
.
However, I've been re-thinking this setup, because semantically speaking, a base word is also a word, just a specific type of word. So in this case, I feel like a “self-join” table might be most appropriate, although I could be completely overthinking this. In the “self-join” table, there would be an is_baseword
boolean attribute, which would theoretically allow all the inflection
columns to store NULL marks if it was set to true. Is this a better idea?
Note: I found these two similar questions which didn't seem to completely answer my questions, so I'm posting my twist on these here:
Best Answer
I believe your design for a one to many join with
base_word
in one table and word forms in another is fine. The reasons I make this judgement are:base_word
, as an entity, is sufficiently different toword_forms
as to represent a different entity. For instance, if you were to include the words origin language, this would also be only applicable to the base word entity, not its word forms.base_word
ORword_form
only, meaning they are NULL or empty for the others. You point this out with theis_baseword
andinflection
columns. Having a bunch of NULL values on a row which will never be changed is usually a reason to use the EAV model or split into different entities.base_words
ORword_forms
separately. The indexes will be larger and more difficult to maintain for inserts/deletes etc. Better to have indexes on smaller tables and to have as few as possible on a table.Don't get too concerned with perfect normalization. While a perfectly normalized system may adhere to strict information science principles, it is often better in the real world to have a slightly denormalized database for other reasons.