Optimisation Newbie: How much of a ‘sin’ is redundancy

database-designnormalizationoptimization

I've been developing for a while, but never really had to deal with DB/scaling issues before. That's suddenly changed and I've found myself in the deep end.

I have 2 SQL tables, as such:

VOTES
vote_id (PK)
question_id (FK)
user_id (FK)
option_id (FK) <The option the user voted for>

The 'questions' table looks like this:

QUESTIONS
question_id (PK)
option_1 (FK --> Options)
option_2 (FK --> Options)

The problem I have is that I very often have to retrieve the sum of all option_1 votes (or option_2 votes) for a question. This is currently done by selecting count where question_id = [@question.id] and option_id = [@question.option_1.id]"

I'm guessing it'd be a heckuva lot quicker to just add option_1_votes and option_2_votes columns to the "questions" table and increment them each time a vote is added. But ultimately, that's redundant data.

So, as someone who's pretty clueless re: database design princpiples, what's the rule of thumb here? Would a top DBA just add in the columns, or try some other solution?

Cheers…

Best Answer

Generally, I don't add redundant columns unless I really need too.

Running a COUNT over a set of data is quite efficient in any RDBMS.

Consider this is a read over indexed (hopefully) cached data to get the count will beat the the 2nd write in to maintain the denormlaised column. This write requires more resources/locking/longer transaction etc which impact reads more

If performance becomes an issue over time, then you can pre-calculate the COUNT more efficiently using an indexed (aka materialised) view

Related Solutions

Mysql – Understanding optimisation of complex SELECT queries

Why there are massive differences in performance between the queries?
(assuming that you have at least defined indexes for the columns used in the subquery criteria and in the joining conditions)

Basically because MySQL optimizer is not smart enough to figure out that all these queries are equivalent. So, it probably produces different execution plans for the different queries. If you do not have properly defined FOREIGN KEY constraints, the optimizer may be actually right, there is no guarantee that the queries return identical results.

So, what to do to increase performance and have a standard way of writing this type of query variants?

Several things that affect MySQL queries performance:

Do not use id IN (SELECT subquery) if you can avoid it. It's not very well optimized in most MySQL versions (see point 6 below). Use joins if you can.
Replace UNION with UNION ALL if that doesn't change the result set (you could do that in EG3 query).

Don't use implicit joins (with commas in the FROM clause and the joining condition in the WHERE clause). Use explicit JOIN syntax. As an example, your EG1 is actually the same as (this is not for performance but for consistence):

SELECT people.*
  FROM people
       LEFT JOIN (criterion1 SELECT) c1 ON people.id=c1.pid
       LEFT JOIN (criterion2 SELECT) c2 ON people.id=c2.pid,
       JOIN (criterion3 SELECT) c3 ON people.id=c3.pid
 WHERE ( c1.pid IS NOT NULL OR c2.pid IS NOT NULL );

The above query has an OR condition that is relevant to 2 tables (and their joins to the people, so 3 tables actually). This is usually not very good perfomance-wise.

You can try rewriting using EXISTS. This will make your queries easier to write and it may help with performance, too:

SELECT people.*
  FROM people
 WHERE EXISTS 
         (criterion3 SELECT modified with `people.id = some_table.pid`)
   AND ( EXISTS 
           (criterion1 SELECT modified with `people.id = some_table.pid`)
      OR EXISTS 
           (criterion2 SELECT modified with `people.id = some_table.pid`) 
       ) ;

If your host allows it, try/test MariaDB (it's a MySQL fork-replacement) that has introduced several improvements in queries execution in its latest versions. The optimizer will be a bit smarter identifying equivalent queries and certainly smarter as it has some new algorithms implemented that affect queries will joins, subqueries among other things.

MySQL 5.6 has also a few improvements in the optimizer but it's not available yet as a stable release.

Planning simple CMS: How much relation

Once you start stepping into areas like hierarchies, you probably want to know your RDBMS to some extent. If your RDBMS supports recursive queries natively your hierarchies will look very different then if each iteration is a round trip to the db.

This being said there are a few warnings that come to my mind looking at your schema. I believe these are likely to impact both performance and data integrity.

The separate parents table strikes me as somewhat dangerous. What are the rules for parents and children? Can you have blog -> menu_item -> blog -> comment -> menu_item? This first creates the possibility of problems that may be difficult to track down, but also consider that if the rules are based on smaller tables, actually enforcing the rules becomes possible If your menu items are broken off, a caching layer becomes a possibility. Same with other things that are less frequently changed, and the amount of effort to generate the menu on the part of the db will be reduced.
I actually think your second model, if things like menu items, etc. are broken off is likely to be both more scalable and faster. I would still suggest breaking apart comments and blog posts, and tracking the relationships there more closely. If you are worried about scalability and your site has 10000 blog posts and 1000000 comments, being able to filter the comments also by the attached blog post is likely to mean you can get to use better indexes at least on some db's.

I am reminded of the phrase "premature optimization is the root of all evil." Design a good db schema first and then worry about speed and scalability.

Best Answer

Related Solutions

Mysql – Understanding optimisation of complex SELECT queries

Planning simple CMS: How much relation

Related Question