Mysql – Optimizing a query that performs worse on a smaller dataset

join;MySQLoptimization

I am trying to run this query. The * is just for simplification, I'm selecting ~10 specific fields for some calculations.

SELECT * FROM `orders`
    LEFT JOIN `orders_processed` ON `orders_processed`.`order_id` = `orders`.`id`
    LEFT JOIN `returns` ON
        `orders_processed`.`date_id`=`returns`.`date_id`
        AND `orders_processed`.`product_id`=`returns`.`product_id`
        AND `orders_processed`.`variant_id`=`returns`.`variant_id`
WHERE `orders`.`date_id` BETWEEN @from AND @to

Where date_id is a date in the format '%Y%m%d' (e.g. 20150501), is not unique (multiple rows have the same date_ids), orders and returns have ~200k rows each, orders_processed has ~400k.
I'm running this query with @from := 20140505, @to := 20150505 and it stalls. I'm assuming it's because of

`orders_processed`.`date_id`=`returns`.`date_id`

since when I change it to

`orders_processed`.`order_id`=`returns`.`order_id` 

it runs in a split second. However, the calculation result changes and I do not want that. Is there any way to change this part of the join into something more efficient?

Also, it might be worth mentioning that it runs fine on an identical, just slightly bigger database (same schema, just updated with new data).

The EXPLAIN for query on the smaller database:

select_type |   table   | table type | possible_keys | key     |key_len| ref     | rows | extra
------------------------------------------------------------------------------------------------------------------
    SIMPLE  |   orders  |    range   |     date_id   | date_id |   5   | null    |89264 | Using where; Using index
------------------------------------------------------------------------------------------------------------------
    SIMPLE  |o_processed|    ref     | uni,order_id  |   uni   |   4   |orders.id|   1  | null
------------------------------------------------------------------------------------------------------------------
SIMPLE|returns|ref|date_id,returns_ibfk_1,returns_ibfk_5|returns_ibfk_1|3|orders_processed.variant_id|5|Using where

The explain result for the bigger database is identical except for the 'rows' for orders, it's 95020.

The CREATE TABLEs (simplified, removed irrelevant columns)

    CREATE TABLE IF NOT EXISTS `orders` (
    `id` int(10) unsigned NOT NULL DEFAULT '0',
    `customer` int(10) unsigned NOT NULL DEFAULT '0',
    `date_time` datetime DEFAULT NULL,
    `date_id` int(10) unsigned DEFAULT NULL,
    `shipping` decimal(10,2) unsigned DEFAULT NULL,
    `updated_on` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    PRIMARY KEY (`id`),
    KEY (`date_id`),
    CONSTRAINT `orders_ibfk_1` FOREIGN KEY (`id`) REFERENCES `ref_order_id`(`id`) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT `orders_ibfk_2` FOREIGN KEY (`date_id`) REFERENCES `cr_calendar` (`date_id`) ON DELETE CASCADE ON UPDATE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE IF NOT EXISTS `orders_processed` (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `order_id` int(10) unsigned NOT NULL DEFAULT '0',
    `date_id` int(10) unsigned DEFAULT NULL,
    `product_id` mediumint(8) unsigned NOT NULL DEFAULT '0',
    `variant_id` mediumint(8) unsigned NOT NULL DEFAULT '0',
    `price` decimal(10,2) unsigned NOT NULL DEFAULT '0.00',
    `quantity` smallint(5) unsigned DEFAULT '0',
    `updated_on` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    PRIMARY KEY (`id`),
    UNIQUE KEY `uni`(`order_id`,`product_id`,`variant_id`,`price`),
    KEY(`order_id`),
    KEY(`product_id`),
    KEY(`date_id`),
    KEY(`quantity`),
    KEY(`variant_id`),
    KEY(`updated_on`),
    CONSTRAINT `orders_processed_ibfk_1` FOREIGN KEY (`product_id`) REFERENCES `products`(`id`) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT `orders_processed_ibfk_2` FOREIGN KEY (`order_id`) REFERENCES `ref_order_id`(`id`) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT `orders_processed_ibfk_3` FOREIGN KEY (`date_id`) REFERENCES `calendar`(`date_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE IF NOT EXISTS `returns` (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `order_id` int(10) unsigned NOT NULL DEFAULT '0',
    `customer_id` int(8) unsigned not null default 0,
    `product_id` mediumint(8) unsigned NOT NULL DEFAULT '0',
    `variant_id` mediumint(8) unsigned NOT NULL DEFAULT '0',
    `date_id` int(10) unsigned DEFAULT NULL,
    `refund` decimal(10,2) unsigned DEFAULT '0.00',
    `reason` smallint(5) unsigned NOT NULL DEFAULT 0,
    `updated_on` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    PRIMARY KEY (`id`),
    UNIQUE KEY `uni`(`order_id`,`product_id`,`variant_id`,`reason`),
    KEY(`date_id`),
    KEY(`reason`),
    KEY(`updated_on`),
    CONSTRAINT `returns_ibfk_1` FOREIGN KEY (`product_id`) REFERENCES `products` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT `returns_ibfk_2` FOREIGN KEY (`reason`) REFERENCES `ref_reasons` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT `returns_ibfk_3` FOREIGN KEY (`order_id`) REFERENCES `ref_order_id` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT `returns_ibfk_4` FOREIGN KEY (`variant_id`) REFERENCES `variants`(`variant_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Best Answer

It looks like you're joining returns on the date and the type of item only, creating a semi-cartesian product. E.g. If you have 100 orders for item A, and 100 returns of item A, on the same day, you'll return 100 x 100 = 10,000 rows, with every return joined to every order. Is this what you want?

In terms of performance, the difference between your two databases may be that the data in the other database gives a smaller results set. E.g. It also has 100 orders for item A and 100 returns of item A but they're spread out over two days giving (50 x 50) + (50 x 50) = 5,000 (same number of source rows but query results are half as many rows).