How to design database for monitoring the time trends of article tags

database-designschematime

My purpose is to monitor which tags increase/decrease rapidly last week on different bulletin boards. Or which tag do new appear. The current table schema of articles is as follows:

article_id e.g. 123
tags e.g. apple,orange
article_created_at e.g. 2021/5/7 01:23
borad_name e.g. database
author e.g. some_person

My current idea is to calculate the amount of each tag by board weekly. But I want to survey other solutions to reduce the development time. Is it ok to choose a time-series database engine or other possible solution? Thanks for your time.

Best Answer

Having a comma-separated list of tags would save on the number of rows written to a database, but would make querying far more computationally expensive. Typically this sort of table would have one record per tag, as this would allow for simpler queries to be written. A structure similar to something you'd find in a blogging database would work. For example:

TagHistory
  article_id
  tag
  flat
  posted_at
  board_name
  author

Then you would have records like this

article_id tag flat posted_at board_name author
123 pokémon pokemon 2021-05-01 00:00:00 games BillyBob
129 PokEmon pokemon 2021-05-01 00:07:41 games Jackie
129 cards cards 2021-05-01 00:07:41 games Jackie
129 Deck deck 2021-05-01 00:07:41 games Jackie
147 MewTwo mewtwo 2021-05-01 00:11:12 games Archibald

The flat field would contain a lower-case tag value with all accents and other non-alphanumeric components removed so that values like pokémon, PokEmon, and Pokemon do not appear as three separate values.

With this sort of design, you could then begin writing queries that could rank tag usage over a period of time. If you were interested in the five most popular tags from last week, you could write a query like this:

SELECT WEEK(posted_at) as week_no, flat, COUNT(article_id) as usages
  FROM TagHistory
 WHERE `posted_at` BETWEEN DATE_FORMAT(DATE_SUB(NOW(), INTERVAL DAYOFWEEK(NOW()) + 6 DAY), '%Y-%m-%d 00:00:00')
                   AND DATE_FORMAT(DATE_SUB(NOW(), INTERVAL DAYOFWEEK(NOW()) -1 DAY), '%Y-%m-%d 23:59:59')
 GROUP BY week_no, flat
 ORDER BY usages DESC
 LIMIT 5;

Alternatively you could do it by board, or by author, or across a period of weeks or months. The sky is the limit and the queries are simpler to write. If there is a regular means of presentation, this data could be collated every week and written to a summary table that would save you the hassle of looking up and calculating tag usage over time.

For example:

TagHistorySummary
  year
  week_no
  flat
  usages

Then if you wanted to present a chart showing the popularity of the pokemon tag for all of 2020 (similar to Google's Ngram Viewer), you could write a query like this:

SELECT week_no, usages
  FROM TagHistorySummary
 WHERE flat = 'pokemon' and year = 2020
 ORDER BY week_no;

Note: Using year as a field name would be a poor choice given it's a reserved word by most database engines. Do not take these examples as verbatim.

This data could then be used to populate a simple line chart:

Ngram for Pokemon

Hmm ... who knew that Pokemon was a thing before WWI? I sure as heck didn't.

This isn't the only way to accomplish your goal, of course. Experience has shown that it is one of the simpler ways, though, particularly when working with data that is supplied by people.