Mysql – To normalize or not

database-designMySQL

I'm scraping data from a website and they have pages for Users and Groups. The pages are pretty similar; id, name, description, created date, number of elements and custom background.
They also have a Friendlist for Users and a Memberlist for Groups, which besides names, are pretty much the same. I have already normalized that bit.

But the User page also contains other snippets.

I started sketching out the tables Users, Groups and Page. But now I'm starting to think if I'm over-normalizing this. Because truly, it's not the same data that is being stored.

Should I go for 2 tables, each containing fields like page_id, page_name or should I add pivot table that contains all page data (with a side-effect of having some fields NULLable for the Groups pages)

Best Answer

Answering strictly in terms of MySQL, you do not want to overnormalize.

Here is why ;

REASON #1 : JOIN BUFFER

There is an in-memory buffer called the join buffer. Its size is regulated in RAM by the join_buffer_size option per DB Connection. The more normalized the data are, the more JOIN clauses. In turn, the more JOIN clauses, the more join buffers have to be allocated. If a join buffer is too small, it gets migrated to disk. This slows down the query's execution by

pausing the query to write the join buffer to disk
generating more disk I/O while executing the join

REASON #2 : QUERY EVALUATION

Because MySQL executes SQL through external storage engines, query evaluation for WHERE clauses and JOIN clauses are virtually identical. I wrote about this back on March 11, 2013 (Is there an execution difference between a JOIN condition and a WHERE condition?). The algorithm for JOINs are also in the MySQL Documentation. The JOIN algorithm can get rather wonky with multiple JOIN clauses.

REASON #3 : STORAGE ENGINE PERFORMANCE

Any storage engine in MySQL is basically comprised of 14 operations. Those operations, including JOINs, are layered in such a way that performance issues can occur in two layers

Query Evaluation (See Reason #2)
Storage Engine
- If you JOIN using large columns
- If you use the wrong Storage Engine
- If you mix Storage Engines in the JOIN

EPILOGUE

It is strictly up to you how deep you want to normalize your data. If you must normalize and do multiple JOINs, please join only integer values only. Joining on character values will have negative effects on the three reasons I just mentioned.

Related Solutions

One to at most one relation or multiple null columns

There is nothing wrong with nullable columns, even a lot of them, if that is what your data domain calls for.

On the other hand, the fact that you say these columns can be grouped logically implies to me that something else might be going on.

If they can be grouped logically because different sets of columns apply to different sets of rows, then you might have an entity-subtype situation.

Conversely, if columns can be grouped because they apply at different times, then you may have a normalization issue. For example, if your columns are something like "January Sales", "February Sales", etc. these should be rows in a child table.

While there is nothing innately wrong with nullable columns, neither is there anything wrong with joining. It's what RDBMS does for a living.

UPDATE:

Given additional information about the logical groups of columns:

There are two kinds of sub-typing that can be represented in a database using 1:1 relationships. If the logical groups were mutually exclusive, then the parent entity could have what is known as a partitioning attribute that tells you which one of the subtypes is applicable. However, without a partitioning attribute, it is possible to have zero, one or even multiple subtypes being applicable at the same time.

The same fundamental question applies then to what you do with this situation.

A good way to resolve it would be to look at the logical groups of columns. Are the columns in logical group A the same as in logical group B - or are the totally different? If they are different they might be best modeled in the single table with nullable fields. If they are the same, then this might be a clue that they should be multiple child rows instead.

Another clue to look at is whether it makes sense that a logical group of columns could take on a life of its own and start attracting relationships from other tables. If logical group B might sometime soon find itself with multiple child records from another table, then it might be a sign that it makes sense to promote that group to its own subtype style table.

One last thing to consider is physical implementation. If a logical subgroup is very sparsely populated, you might be able to make a case for segregating these columns into another table to optimize physical storage. This step shouldn't be done proactively. This kind of optimization should be done when performance testing proves it is necessary.

If none of these things are true, then you are probably best off leaving the nullable columns in the original table.

Mysql – Optimizing intersect query in MySQL

Please try: (edited 2014-12-18: replaced Left with INNER join):

SELECT 
    `pages_users`.`page_id` , 
    COUNT(`pu2`.`page_id`) AS `count`
FROM 
    `pages_users`
    INNER JOIN `pages_users` AS `pu2` 
        ON `pu2`.`user_id` = `pages_users`.`user_id`
           AND `pu2`.`page_id` = *PAGE_ID* 
WHERE 
    `pages_users`.`page_id` != *PAGE_ID*
GROUP BY 
    `pages_users`.`page_id`
ORDER BY 
    `count` DESC
LIMIT 10

or as an alternative

SELECT 
    `pages_users`.`page_id` , 
    COUNT(`pu2`.`pu2_page_id`) AS `count`
FROM 
    `pages_users`
    INNER JOIN 
    (
        SELECT  
            `pu2_intern`.`page_id` AS `pu2_page_id`
        FROM
            `pages_users` AS `pu2_intern`
        WHERE
            `pu2_intern`.`page_id` = *PAGE_ID*
    )  AS `pu2`
        ON `pu2`.`user_id` = `pages_users`.`user_id`
WHERE 
    `pages_users`.`page_id` != *PAGE_ID*
GROUP BY 
    `pages_users`.`page_id`
ORDER BY 
    `count` DESC
LIMIT 10

You could also add another index just on the field page_id:

KEY `page` (`page_id`)

I hope it could help you.