Mysql – Querying data from a pivoted MySQL table takes a long time to execute

MySQLpivot

TL;DR
Querying data from a two table database I've setup a long time ago takes a really long time to execute.


Back when I was a student, quite some time ago, I developed a simple survey tool for a college project. I later launched it as a free service for fellow students to use. Now, many, many years later, my survey tool still gets used by students. I kind off neglected the tool for long time until I recently tried it again and noticed it has become painstakingly slow when viewing survey results. Oops. There are much beter survey tools/services available now. However, the collected survey data is still valuable to many people I guess. I wrote it in PHP with a MySQL database. The database contains well over 6,000 different surveys.

I am familiar with MySQL, but I am by no means a database expert. I certainly wasn't back then and I suspect I made some poor choices structuring the database.

Depending on the number if participants to a survey, running a query to filter results from a single survey can take up to 12 seconds(!). I upgraded to a better server, but this didn't helped as much as I hoped it would. It only slightly improved the query time back from 18 to 12 seconds.

Because of the required flexibility (every survey has different questions) I setup two database tables. On the main table all the fixed data gets inserted, such as unique record ID, timestamp, users IP address, ect. The second table is used insert all the answers on; one record for each answer on each question. To get all the answers the seconds table gets 'pivoted' using Case statement in the SQL query (shown below). I think this might not have been the most efficient setup, and maybe this is the reason to why filtering results is really slow.

So, this is how I structured the database.

The main table with fixed data (form_submits)

+--------------------+--------------+------+-----+---------+----------------+
| Field              | Type         | Null | Key | Default | Extra          |
+--------------------+--------------+------+-----+---------+----------------+
| id                 | bigint(20)   | NO   | PRI | NULL    | auto_increment |
| formid             | mediumint(9) | NO   | MUL | NULL    |                |
| email              | varchar(255) | NO   | MUL | NULL    |                |
| status             | tinyint(1)   | NO   |     | NULL    |                |
| submit_date        | int(11)      | NO   |     | NULL    |                |
| submit_ip          | varchar(46)  | NO   |     | NULL    |                |
| submit_ua          | text         | NO   |     | NULL    |                |
| confirm_date       | int(11)      | NO   |     | NULL    |                |
| confirm_ip         | varchar(46)  | NO   |     | NULL    |                |
| notifications_send | tinyint(4)   | NO   |     | NULL    |                |
| notification_last  | int(11)      | NO   |     | NULL    |                |
+--------------------+--------------+------+-----+---------+----------------+

The data table with the questions and answers (form_submits_data)

+-----------+--------------+------+-----+---------+-------+
| Field     | Type         | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| submit_id | mediumint(9) | NO   | PRI | NULL    |       |
| form_id   | int(11)      | NO   | MUL | NULL    |       |
| field_id  | mediumint(9) | NO   | PRI | NULL    |       |
| question  | text         | NO   |     | NULL    |       |
| answer    | text         | NO   |     | NULL    |       |
+-----------+--------------+------+-----+---------+-------+

If, for example, I want to get all the answers from participants that live in a certain postcode area this query is used:

SELECT 
fs.id, 
fs.email,
coalesce(MAX(case when field_id = 10 then answer end), 0) as 'C10',
coalesce(MAX(case when field_id = 11 then answer end), 0) as 'C11',
coalesce(MAX(case when field_id = 12 then answer end), 0) as 'C12',
coalesce(MAX(case when field_id = 15 then answer end), 0) as 'C15',
coalesce(MAX(case when field_id = 16 then answer end), 0) as 'C16',
coalesce(MAX(case when field_id = 17 then answer end), 0) as 'C17',
coalesce(MAX(case when field_id = 21 then answer end), 0) as 'C21',
coalesce(MAX(case when field_id = 32 then answer end), 0) as 'C32',
coalesce(MAX(case when field_id = 36 then answer end), 0) as 'C36',
coalesce(MAX(case when field_id = 38 then answer end), 0) as 'C38',
coalesce(MAX(case when field_id = 340 then answer end), 0) as 'C340',
coalesce(MAX(case when field_id = 341 then answer end), 0) as 'C341',
coalesce(MAX(case when field_id = 342 then answer end), 0) as 'C342',
coalesce(MAX(case when field_id = 343 then answer end), 0) as 'C343',
coalesce(MAX(case when field_id = 344 then answer end), 0) as 'C344'
FROM `form_submits` fs
LEFT JOIN `form_submits_data` fsd ON fsd.submit_id = fs.id
WHERE fs.form_id = 45023 
AND fs.status = 1
GROUP BY fs.id
HAVING C32 LIKE 'HA0 1AA%'
ORDER BY fs.submit_date DESC

Which on a survey with 21,000 responses takes 12 seconds to execute. I feel this can't be right and thats this is probably because I structured the database poorly. This makes me wonder, what should I have done differently? What setup/structure would be more efficient while retaining flexibility?

Best Answer

I wouldn't say you did anything wrong. Humorously, I just suggested to someone to do the same thing you did for a similar problem:

How should I model a database structure to retain calculation results?

You might want to explore a semi-structured table format where instead of your "data" table having so many rows, you add a column to the "main" table that stores all the answers in a JSON object that gets parsed at query time. This is relatively new in MySQL but PostgreSQL and SQL Server are a little farther along.

In other words, after a survey on your favorite fruit and favorite color, your main table would have an answer column that looked like this:

{"f1":"apples","f2":"blue"}

and you would parse it at query time into two columns called f1 and f2. There is some obvious overhead associated with parsing the JSON, but on a VM running on my laptop it was about .5 secs for 100K rows.