Mysql – How to design a table for storing news items

database-designMySQL

I have a script that can get result from news websites and it will run every hour.

The php script will get the title and content. The things I want to store in MySQL are:

title
content
createTime (which is the time when I inserted the record)

As I will run the php script every hour I know there will be many cases where the same news will be found by the script.

How do I design a table that allows me to INSERT the new records, while previously existing news items are skipped?

Best Answer

A few things you will have to think about before starting:

  • How do you decide if a title is the same as a previous one? Say there is an article 'A god table for storing news items'. Your script discovers and stores it in your DB. Later the editors of that news site notice the typo and correct the title as 'A good table for storing news items'. What do you do in such a case? (The same applies to the content.)

  • If you want to do something with your data after storing, you will need a unique identifier in the table. Using this, you will be able to point to an item from other tables.

  • You may find useful to store the URL of the articles. However, from time to time these can change just like titles do.

  • Short news are typically found in many places on the internet. Both their title and contents are the same (like in the case of short news from news agencies) - are these the same or not? If you store the URL, then these are different. If not, these are the same. What to do with these?

  • If you decide that you will store these from each site you process, than you will find useful to identify your source sites. (Be careful, the URL will look like 'http://dba.stackexchange.com/questions/23568/how-to-design-a-good-table-for-real-time-news-extraction' and the site will be 'dba.stackexchange.com'. Here the latter forms a part of the former, but this is not necessarily true for other sites.) In this case, you will have to create another table where you list all the source sites. This table will be referenced from your other table.

So far we have the following database structure:

Table "news_item"

+--------------+------------+--------------+---------------------+-----+-----------+
| news_item_id |   title    |   content    |     time_created    | URL | source_id |
+--------------+------------+--------------+---------------------+-----+-----------+
| 1            | First item | some content | 2012-09-03 22:53:14 | ... | 1         |
.                                                                                  .
.                                                                                  .
.                                                                                  .
+--------------+------------+--------------+---------------------+-----+-----------+

Table "source"

+-----------+-----------------------+
| source_id | name                  |
+-----------+-----------------------+
| 1         | dba.stackexchange.com |
.                                   .
.                                   .
.                                   .
+-----------+-----------------------+

You can write your SQL script only after answering at least the above questions.