Mysql – Group by with limit

group byMySQL

My question is regards the performance.

What exactly happening when you use group by and then limit?

I have a very big table, when I query with limit and without group by its immediate. When I use group by with limit 1 it takes forever. Now, the group by is running on an index, without the group by I get many potential "groups", items that will match the group by, so I don't need to worry there is only 1 group. So what I don't understand is why won't MySQL stop after the first group has been found when I put limit 1?

The query (even tried with force index):

SELECT fileso.hostId
FROM indicators.fileso fileso force index (hostId)
WHERE 1 = 1 
GROUP BY  fileso.hostId
LIMIT 1;

Without group by it's immediate.

Table Creation (large table):

CREATE TABLE `fileso2` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `hostId` INT(11) UNSIGNED NOT NULL COMMENT 'code',
    `sha256` BINARY(32) NOT NULL COMMENT 'meir',
    `fileName` VARCHAR(150) NULL DEFAULT NULL COMMENT 'meir',
    `fullPath` VARCHAR(350) NULL DEFAULT NULL COMMENT 'meir',
    `datein` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    `lastSeen` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    `lastRA` TIMESTAMP NULL DEFAULT NULL COMMENT '?',
    `riskLevel` SMALLINT(6) NOT NULL DEFAULT '-1' COMMENT '?',
    `policiesViolated` VARCHAR(1000) NULL DEFAULT NULL,
    `NTFSOwner` VARCHAR(255) NULL DEFAULT NULL COMMENT 'meir',
    `NTFSOwnerId` INT(11) NOT NULL DEFAULT '-1' COMMENT 'onIdle -done',
    `p_runningUser` VARCHAR(255) NULL DEFAULT NULL COMMENT 'meir',
    `p_parentUser` VARCHAR(255) NULL DEFAULT NULL,
    `p_commandLineParams` VARCHAR(1000) NULL DEFAULT NULL COMMENT 'meir',
    `p_runningUserId` INT(11) NOT NULL DEFAULT '-1' COMMENT 'onIdle -done',
    `p_ParentPath` VARCHAR(350) NULL DEFAULT NULL COMMENT 'meir',
    `ar_regPath` VARCHAR(350) NULL DEFAULT NULL COMMENT 'meir',
    `ar_regKey` VARCHAR(100) NULL DEFAULT NULL COMMENT 'meir',
    `ar_regValue` VARCHAR(350) NULL DEFAULT NULL COMMENT 'meir',
    `s_serviceName` VARCHAR(100) NULL DEFAULT NULL COMMENT 'meir',
    `d_driverName` VARCHAR(100) NULL DEFAULT NULL COMMENT 'meir',
    `hostName` CHAR(20) NULL DEFAULT NULL,
    `extension` CHAR(5) NOT NULL DEFAULT '',
    `uniqueness` BINARY(32) NOT NULL DEFAULT '00000000000000000000000000000000',
    PRIMARY KEY (`id`),
    UNIQUE INDEX `uniqueness` (`uniqueness`),
    INDEX `p_runningUserId` (`p_runningUserId`),
    INDEX `riskLevel` (`riskLevel`),
    INDEX `lastSeen` (`lastSeen`),
    INDEX `hostId` (`hostId`),
    INDEX `datein` (`datein`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=2384056;

Best Answer

Short answer: You did not give enough info to answer your question.

Long answer: This may cover your situation, plus explain a lot of other things...

In MySQL, GROUP BY implies the same ORDER BY. (This is a non-standard extension.) (But this point may be irrelevant.)

GROUP BY can be performed in two ways. Please provide EXPLAIN SELECT ... for both and, if possible, EXPLAIN FORMAT=JSON SELECT ...; those may show the difference.

Plan A: Gather all the data; sort it; walk through the sorted list doing the "grouping" stuff, such as SUM, COUNT, MAX. (Actually there is a leapfrogging optimization for MAX.) This lengthy process is what happens without the LIMIT. Note that it gives ORDER BY for 'free'.

Plan B: Keep a hash in memory; scan through the table (without any sorting); the result is in the hash. Then it needs sorting. (ORDER BY NULL avoids this sort.)

Both A and B involve at least a full table scan, so they do not explain your "immediate" response unless the table is quite small. Is it?

Neither of those necessarily took advantage of any index. So...

Plan C: INDEX(x) and GROUP BY x. In this case, the optimizer may choose to use the index as a way to avoid the 'sort'. Furthermore if you have LIMIT 1, it only needs to do one value of x. (Note: If there is a 'sort' in the pipeline, LIMIT cannot be done until after the sort; by then most of work has been done, so the limit does not save much time.)

Plan D (already alluded to): If you are only doing MAX/MIN, then only the last/first row need be looked at. If you also have a relevant index, it can reach into the index for the last/first row in each clump, thereby skipping all the other rows. In this case, even without a LIMIT, you may get a significant performance boost.

Caveat

When doing SELECT x, y FROM t GROUP BY x, the engine is under no obligation to pick any particular value of y. Furthermore the Optimizer seems to treat this case as "not worth optimizing", and may not do you what you expect (nor what I say above). More on ONLY_FULL_GROUP_BY