How to design a database table in a context where the columns can potentially change with each new row

database-design

Lets consider the following task: For a NLP task I am collecting news title in a database. I need to use this data for word frequency analysis, so I tokenize the title and store the words in the following way:

titles table: 

titleID    word1    word2    word3    ....   
1          0        0        1   
2          1        1        0   
3          0        0        0   
...

Every column is a different word, and i mark it 0 or 1 according to whether or not it is present in the title.

The issue is of course that each time i add a new title, there might be a word that is new and therefore I would need to add a new column for this word.

In SQL world that means a potential migration at each insert of a new row, which is not tenable.
I could use nosql but I would still like that table to have relations with other tables in my database.

What would be the alternatives for my case? What is the common way to store NLP data in a database?

Best Answer

What you probably want to do is normalize your data. Please read up on it in this post: What is Database Normalization in SQL?

To summarize, you should actually restructure your table design by removing all the word columns and storing the words themselves into their own table that is referenced by your Titles table.

Your Words table would have a unique identifying column like WordId which would be referenced by your Titles table which itself would have a TitleId column.

You can then have a third table to relate the many-to-many relationship between the two tables (named something like TitleWords) and it would have TitleId and WordId as columns so you could link between the Titles and Words tables.

Implementing your database design this way not only helps your maintainability of the relational data structure, but it also reduces redundancy and saves space in your database leading to efficiencies as well.

Related Question