Mysql – Optimize very slow SQL joins on multiple tables or use different engine

join;MySQLnosql

We have something similar to Google Analytics, but decided to not use something that's already available (but didn't fit our needs exactly), and instead created our own "mini-analytics".

Now, this was all easy and fun, but as it grows, the architecture either wasn't designed properly, or the wrong tools were used to solve the problem.

The problem lies with queries that look like the following: "Get all user sessions that have the following events: login, chrome browser version 58, and profile_view".

Currently this hits the following tables:

CREATE TABLE `logins` (
    `session_id` bigint(20) NOT NULL,
    `request_id` INT(11) NOT NULL,
    `timestamp` INT(11) NOT NULL,
    `login_data` mediumblob,
     KEY `session_req_idx` (`session_id`, `request_id`),
     KEY `timestamp_idx` (`timestamp`)
)

CREATE TABLE `browsers` (
    `session_id` bigint(20) NOT NULL,
    `request_id` INT(11) NOT NULL,
    `timestamp` INT(11) NOT NULL,
    `browser_data` mediumblob,
     KEY `session_req_idx` (`session_id`, `request_id`),
     KEY `timestamp_idx` (`timestamp`)
)

CREATE TABLE `profile_views` (
    `session_id` bigint(20) NOT NULL,
    `request_id` INT(11) NOT NULL,
    `timestamp` INT(11) NOT NULL,
    `profile_data` mediumblob,
     KEY `session_req_idx` (`session_id`, `request_id`),
     KEY `timestamp_idx` (`timestamp`)
)

Some notes:

All the mediumblob columns are JSON objects, but we haven't upgraded to MySQL 5.7.8 yet.
All tables have the same columns and indexes.
Each table contains in between several million and several billion rows.

One of the problems I seem to have is that I can't limit the inner queries (if using them), and joins also don't appear to work.

What I wonder is mostly: can this be efficiently solved using a SQL solution, or would this lean more towards one of the NoSQL (for example, a graph database) solutions?

—
EDIT:

Queries are built up using a loop that concatenates subqueries in the following manner:

For a single table (e.g. "sessions that have a profile view after timestamp x"):

SELECT DISTINCT
    `grouped`.`session_id`
FROM (
    SELECT
        `session_id`
    FROM
        `profile_views`
    WHERE
        `timestamp` > x
) `grouped`
ORDER BY
    `session_id` DESC
LIMIT
    100

Two tables (e.g. "Sessions that have a profile view and login"):

SELECT DISTINCT
    `grouped`.`session_id`
FROM (
    SELECT
        `session_id`
    FROM
        `logins`
    WHERE
        `timestamp` >= x
    AND
        `session_id` IN (
        SELECT
            `session_id`
        FROM
            `profile_views`
        WHERE
            `timestamp` >= x
    )
) `grouped`
ORDER BY
    `session_id` DESC
LIMIT
    100

I'm looking into joins, but at the moment they appear to return different results, for example something like the following:

SELECT DISTINCT
    `A`.`session_id`
FROM
    `logins` `A`
INNER JOIN
    `profile_views` `B`
ON
    `B`.`session_id` = `A`.`session_id`
WHERE
    `A`.`timestamp` > x
ORDER BY
    `session_id` DESC
LIMIT 100;

Best Answer

Your query using joins looks ok but it will help speed up the queries if you convert the timestamp_idx key to include session_id. This way the database engine won't have to to another/second lookup/sort and on tables with millions of rows, it will have significant improvement in response times.

ALTER TABLE <table name> ADD KEY ix_covering(`timestamp`, `session_id`,`request_id`);

Drop both indexes on all tables AFTER you have tested that the new index as per above does improve the performance.

Also, instead of

SELECT DISTINCT
    `A`.`session_id`

try

SELECT `A`.`session_id`
FROM `logins` `A`
INNER JOIN `profile_views` `B` ON `B`.`session_id` = `A`.`session_id`
WHERE `A`.`timestamp` > x
GROUP BY `A`.`session_id`
ORDER BY `session_id` DESC
LIMIT 100;

Related Solutions

Mysql – find and insert row to another table using thesql trigger

This should do the trick for you:

DELIMITER $$
DROP TRIGGER IF EXISTS `employee_INSERT` $$
CREATE TRIGGER `employee_INSERT` 
AFTER INSERT ON `employee`
FOR EACH ROW
BEGIN
    INSERT INTO employee_tools (Id, Tool)
    SELECT new.Id, tools.Tool_Name 
            FROM tools
            WHERE tools.Division = new.Division;
END $$
DELIMITER ;

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Best Answer

Related Solutions

Mysql – find and insert row to another table using thesql trigger

MySQL looking up more rows than needed (indexing issue)

Related Question