Mysql – need help in Query optimization

mysql-5.5

I need help in query optimization,I am having table with structure

Create Table: CREATE TABLE `ip_country_mapping` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `cron` bigint(20) NOT NULL,
  `start_ip_number` bigint(20) NOT NULL,
  `end_ip_number` bigint(20) NOT NULL,
  `country` varchar(2) COLLATE utf8_bin NOT NULL,
  `state` varchar(30) COLLATE utf8_bin DEFAULT NULL,
  `city` varchar(30) COLLATE utf8_bin DEFAULT NULL,
  `zip` varchar(30) COLLATE utf8_bin DEFAULT NULL,
  `creation_date` datetime NOT NULL,
  `last_updation_date` datetime NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `UK_IP_COUNTRY_MAPPING_TEMP_START_IP_NUMBER` (`start_ip_number`),
  UNIQUE KEY `UK_IP_COUNTRY_MAPPING_TEMP_END_IP_NUMBER` (`end_ip_number`),
  KEY `FK_IP_COUNTRY_MAPPING_TEMP_CRON` (`cron`),
  KEY `ind_ipscan` (`end_ip_number`,`start_ip_number`),
  CONSTRAINT `FK_IP_COUNTRY_MAPPING_TEMP_CRON` FOREIGN KEY (`cron`) REFERENCES `cron` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2020168 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

The query to optimized is

SELECT ipcountrym0_.country as col_0_0_, ipcountrym0_.state as col_1_0_,ipcountrym0_.city as col_2_0_ 
FROM ip_country_mapping ipcountrym0_ 
WHERE ipcountrym0_.start_ip_number<=1376791568 
AND ipcountrym0_.end_ip_number>=1376791568;

EXPLAIN plan is giving the output as

EXPLAIN SELECT ipcountrym0_.country as col_0_0_, ipcountrym0_.state as col_1_0_, ipcountrym0_.city as col_2_0_ 
FROM ip_country_mapping ipcountrym0_ 
WHERE ipcountrym0_.start_ip_number<=1376791568 
AND ipcountrym0_.end_ip_number>=1376791568;
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: ipcountrym0_
             type: ALL
    possible_keys: UK_IP_COUNTRY_MAPPING_TEMP_START_IP_NUMBER,UK_IP_COUNTRY_MAPPING_TEMP_END_IP_NUMBER,ind_ipscan
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 2081584
            Extra: Using where

If i add limit 1 to the above query,it scans 199999 rows.The output generated by this query is only one row.if i put order by clause in the above query with limit 1 then it scans only 2 rows.

EXPLAIN SELECT ipcountrym0_.country as col_0_0_, ipcountrym0_.state as col_1_0_, ipcountrym0_.city as col_2_0_ 
FROM ip_country_mapping ipcountrym0_ 
WHERE ipcountrym0_.start_ip_number<=1376791568 
AND countrym0_.end_ip_number>=1376791568 
ORDER BY id LIMIT 1;
        *************************** 1. row ***************************
                   id: 1
          select_type: SIMPLE
                table: ipcountrym0_
                 type: index
        possible_keys: UK_IP_COUNTRY_MAPPING_TEMP_START_IP_NUMBER,UK_IP_COUNTRY_MAPPING_TEMP_END_IP_NUMBER,ind_ipscan
                  key: PRIMARY
              key_len: 8
                  ref: NULL
                 rows: 2
                Extra: Using where
        1 row in set (0.00 sec)

Is this the correct way my query is optimizing or there is some thing wrong the way i had tried to optimize.
My Manager is are not agreeing with this optimization as he consider that the optimizer is first scanning the index column id and then the start ip number and end ip number which is not relevant.

Can somebody please explain how optimizer is working here and is this correct way the optimizing the query .It is argued that optimizer is showing wrong plan.

Best Answer

Your manager seems to be right. Your second query (with the ORDER BY) indeed scans the primary index (key: PRIMARY and type: index in EXPLAIN output), and then checks wheather the start_ip_number and end_ip_number columns satisfy the WHERE condition (Extra: Using where). The amount of rows scanned highly depends on the ip address. In the worst case (no rows matching the value) you will do a table scan anyway.

Your query uses two range conditions, which can't be covered both by a B-tree index, so it has to choose one. To see which index is best, the optimizer estimates how many rows would satisfy the relevant WHERE clause from each index. The general rule of thumb is that if more than 20% of rows would be selected when using an index then it is more efficient to not use the index at all and do a full table scan.

The reason might be that indeed too many rows satisfy each one of the WHERE conditions, but it is also possible that index statistics are not accurate. You can check if the statistics are up to date with SHOW INDEX, and perhaps update them with ANALYZE TABLE. Note that ANALYZE TABLE will lock the table until it finishes

To address your specific issue you can design your table differently, using spatial indexes for the IP range and then use spatial functions to see wheather an address belongs to that range. See this post for more detail. Note that spatial indexes are available only for MyISAM tables. InnoDB supports spatial columns, but you can't have indexes on them.

There are also different approaches, requiring that the ip renges are not overlapping:

Rick James' Blocks of Addresses, such as IP Addresses Reference
Maciej Dobrzański's Implementing efficient Geo IP location system in MySQL

Related Solutions

MySQL optimization – year column grouping – using temporary table, filesort

I don't see a lot of opportunity for improvement.

The index you added was probably a big help, because it's being used for the range matching on the WHERE clause (type => range, key => tran_date), and it's being used as a covering index (extra => using index), avoiding the need to seek into the table to fetch the row data.

But since you're using functions to construct the financial_year value for the group by, both the "using filesort" and "using temporary" can't be avoided. But, those aren't the real problem. The real problem is that you're evaluating MONTH(tran_date) 346,485 times and YEAR(tran_date) at least that many times... ~700,000 function calls in one second doesn't seem too bad.

Plan B: I am definitely not a fan of storing redundant data, and I'm dead-set against making the application responsible for maintaining it... but one option I might be tempted to try would be to create a dashboard_stats_by_financial_year table, and use after-insert/update/delete triggers on the transactions1 table to manage keeping those stats current.

That option has a cost, of course -- adding to the amount of time it takes to update/insert/delete a transaction... but, waiting > 1200 milliseconds for stats for your dashboard is a cost, too. So it may come down to whether you want to pay for it now or pay for it later.

MySQL InnoDB locks primary key on delete even in READ COMMITTED

NEW ANSWER (MySQL-style dynamic SQL): Ok, this one tackles the problem in the way one of the other poster's described - reversing the order in which mutually incompatible exclusive locks are acquired so that regardless of how many occur, they occur only for the least amount of time at the end of transaction execution.

This is accomplished by separating the read part of the statement into it's own select statement and dynamically generating a delete statement that will be forced to run last due to order of statement appearance, and which will affect only the proc_warnings table.

A demo is available at sql fiddle:

This link shows the schema w/ sample data, and a simple query for rows that match on ivehicle_id=2. 2 rows result, as none of them have been deleted.

This link shows the same schema, sample data, but pass a value 2 to the DeleteEntries stored program, telling the SP to delete proc_warnings entries for ivehicle_id=2. The simple query for rows returns no results as they've all been successfully deleted. The demo links only demostrate that the code works as intended to delete. The user with the proper test environment can comment on whether this solves the problem of the blocked thread.

Here is the code as well for convenience:

CREATE PROCEDURE DeleteEntries (input_vid INT)
BEGIN

    SELECT @idstring:= '';
    SELECT @idnum:= 0;
    SELECT @del_stmt:= '';

    SELECT @idnum:= @idnum+1 idnum_col, @idstring:= CONCAT(@idstring, CASE WHEN CHARACTER_LENGTH(@idstring) > 0 THEN ',' ELSE '' END, CAST(id AS CHAR(10))) idstring_col
    FROM proc_warnings
    WHERE EXISTS (
        SELECT 0
        FROM day_position
        WHERE day_position.transaction_id = proc_warnings.transaction_id
        AND day_position.dirty_data = 1
        AND EXISTS (
            SELECT 0
            FROM ivehicle_days
            WHERE ivehicle_days.id = day_position.ivehicle_day_id
            AND ivehicle_days.ivehicle_id = input_vid
        )
    )
    ORDER BY idnum_col DESC
    LIMIT 1;

    IF (@idnum > 0) THEN
        SELECT @del_stmt:= CONCAT('DELETE FROM proc_warnings WHERE id IN (', @idstring, ');');

        PREPARE del_stmt_hndl FROM @del_stmt;
        EXECUTE del_stmt_hndl;
        DEALLOCATE PREPARE del_stmt_hndl;
    END IF;
END;

This is the syntax to call the program from within a transaction:

CALL DeleteEntries(2);

ORIGINAL ANSWER (still think it's not too shabby) Looks like 2 issues: 1) slow query 2) unexpected locking behavior

As regards issue #1, slow queries are often resolved by the same two techniques in tandem query statement simplification and useful additions of or modifications to indexes. You yourself already made the connection to indexes - without them the optimizer cannot search for a limited set of rows to process, and each row from each table multiplying per extra row scanned the amount of extra work which must be done.

REVISED AFTER SEEING POST OF SCHEMA AND INDEXES: But I imagine you'll get the most performance benefit for your query by making sure you have a good index configuration. To do so, you can go for better delete performance, and possibly even better delete performance, with trade off of larger indexes and perhaps noticeably slower insert performance on the same tables to which additional index structure is added.

SOMEWHAT BETTER:

CREATE TABLE  `day_position` (
    ...,
    KEY `day_position__id_rvrsd` (`dirty_data`, `ivehicle_day_id`)

) ;


CREATE TABLE  `ivehicle_days` (
    ...,
    KEY `ivehicle_days__vid_no_sort_index` (`ivehicle_id`)
);

REVISED HERE TOO: Since it takes as long as it does to run, I'd leave the dirty_data in the index, and I got it wrong too for sure when I placed it after the ivehicle_day_id in index order - it should be first.

But if I had my hands on it, at this point, since there must be a good amount of data to make it take that long, I'd would just go for all covering indexes just to make sure I was getting the best indexing that my troubleshooting time could buy, if nothing else to rule that part of the problem out.

BEST/COVERING INDEXES:

CREATE TABLE  `day_position` (
    ...,
    KEY `day_position__id_rvrsd_trnsid_cvrng` (`dirty_data`, `ivehicle_day_id`, `transaction_id`)
) ;

CREATE TABLE  `ivehicle_days` (
    ...,
    UNIQUE KEY `ivehicle_days__vid_id_cvrng` (ivehicle_id, id)
);

CREATE TABLE  `proc_warnings` (

    .., /*rename primary key*/
    CONSTRAINT pk_proc_warnings PRIMARY KEY (id),
    UNIQUE KEY `proc_warnings__transaction_id_id_cvrng` (`transaction_id`, `id`)
);

There are two performance optimization goals sought by the last two change suggestions:
1) If the search keys for successively accessed tables are not the same as the clustered key results returned for the currently accessed table, we eliminate what would have been a need to make a second set of index-seek-with-scan operations on the clustered index
2) If the latter is not the case, there is still at least the possibility that the optimizer can select a more efficient join algorithm since the indexes will be keeping the required join keys in sorted order.

Your query seems about as simplified as it can be (copied here in case it is edited later):

DELETE pw 
FROM proc_warnings pw 
INNER JOIN day_position dp 
    ON dp.transaction_id = pw.transaction_id 
INNER JOIN ivehicle_days vd 
    ON vd.id = dp.ivehicle_day_id 
WHERE vd.ivehicle_id=2 AND dp.dirty_data=1;

Unless of course there's something about written join order that affects the way the query optimizer proceeds in which case you could try some of the rewrite suggestions others have provided, including perhaps this one w/ index hints (optional):

DELETE FROM proc_warnings
FORCE INDEX (`proc_warnings__transaction_id_id_cvrng`, `pk_proc_warnings`)
WHERE EXISTS (
    SELECT 0
    FROM day_position
    FORCE INDEX (`day_position__id_rvrsd_trnsid_cvrng`)  
    WHERE day_position.transaction_id = proc_warnings.transaction_id
    AND day_position.dirty_data = 1
    AND EXISTS (
        SELECT 0
        FROM ivehicle_days
        FORCE INDEX (`ivehicle_days__vid_id_cvrng`)  
        WHERE ivehicle_days.id = day_position.ivehicle_day_id
        AND ivehicle_days.ivehicle_id = ?
    )
);

As regards #2, unexpected locking behavior.

As I can see both queries wants an exclusive X lock on a row with primary key = 53. However, neither of them must delete rows from proc_warnings table. I just don't understand why the index is locked.

I guess it would be the index that's locked because the row of data to be locked is in a clustered index, i.e. the single row of data itself resides in the index.

It would be locked, because:
1) according to http://dev.mysql.com/doc/refman/5.1/en/innodb-locks-set.html

...a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement. It does not matter whether there are WHERE conditions in the statement that would exclude the row. InnoDB does not remember the exact WHERE condition, but only knows which index ranges were scanned.

You also mentioned above:

...as for me the main feature of READ COMMITTED is how it deals with locks. It should release the index locks of non-matching rows, but it doesn't.

and provided the following reference for that:
http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html#isolevel_read-committed

Which states the same as you, except that according to that same reference there is a condition upon which a lock shall be released:

Also, record locks for nonmatching rows are released after MySQL has evaluated the WHERE condition.

Which is reiterated as well at this manual page http://dev.mysql.com/doc/refman/5.1/en/innodb-record-level-locks.html

There are also other effects of using the READ COMMITTED isolation level or enabling innodb_locks_unsafe_for_binlog: Record locks for nonmatching rows are released after MySQL has evaluated the WHERE condition.

So, we're told that the WHERE condition must be evaluated before the lock can be relased. Unfortunately we're not told when the WHERE condition is evaluated and it would probably something subject to change from one plan to another created by the optimizer. But it does tell us that lock release, is dependent somehow on performance of query execution, optimization of which as we discuss above is dependent on careful writing of the statement, and judicious use of indexes. It can also be improved by better table design but that would probably be left best to a separate question.

Moreover, the index is not locked either when proc_warnings table is empty

The database can't lock records within the index if there are none.

Moreover, the index is not locked when...the day_position table contains fewer number of rows (i.e. one hundred rows).

This could mean numerous things such as but probably not limited to: a different execution plan due to a change in statistics, a too-brief-to-be-observed-lock due to a much faster execution due to a much smaller data set/join operation.

Best Answer

Related Solutions

MySQL optimization – year column grouping – using temporary table, filesort

MySQL InnoDB locks primary key on delete even in READ COMMITTED

Related Question