My question is regards the performance.
What exactly happening when you use group by
and then limit
?
I have a very big table, when I query with limit
and without group by
its immediate. When I use group by
with limit 1
it takes forever. Now, the group by
is running on an index, without the group by
I get many potential "groups", items that will match the group by
, so I don't need to worry there is only 1 group. So what I don't understand is why won't MySQL stop after the first group has been found when I put limit 1
?
The query (even tried with force index
):
SELECT fileso.hostId
FROM indicators.fileso fileso force index (hostId)
WHERE 1 = 1
GROUP BY fileso.hostId
LIMIT 1;
Without group by
it's immediate.
Table Creation (large table):
CREATE TABLE `fileso2` (
`id` BIGINT(20) NOT NULL AUTO_INCREMENT,
`hostId` INT(11) UNSIGNED NOT NULL COMMENT 'code',
`sha256` BINARY(32) NOT NULL COMMENT 'meir',
`fileName` VARCHAR(150) NULL DEFAULT NULL COMMENT 'meir',
`fullPath` VARCHAR(350) NULL DEFAULT NULL COMMENT 'meir',
`datein` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`lastSeen` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`lastRA` TIMESTAMP NULL DEFAULT NULL COMMENT '?',
`riskLevel` SMALLINT(6) NOT NULL DEFAULT '-1' COMMENT '?',
`policiesViolated` VARCHAR(1000) NULL DEFAULT NULL,
`NTFSOwner` VARCHAR(255) NULL DEFAULT NULL COMMENT 'meir',
`NTFSOwnerId` INT(11) NOT NULL DEFAULT '-1' COMMENT 'onIdle -done',
`p_runningUser` VARCHAR(255) NULL DEFAULT NULL COMMENT 'meir',
`p_parentUser` VARCHAR(255) NULL DEFAULT NULL,
`p_commandLineParams` VARCHAR(1000) NULL DEFAULT NULL COMMENT 'meir',
`p_runningUserId` INT(11) NOT NULL DEFAULT '-1' COMMENT 'onIdle -done',
`p_ParentPath` VARCHAR(350) NULL DEFAULT NULL COMMENT 'meir',
`ar_regPath` VARCHAR(350) NULL DEFAULT NULL COMMENT 'meir',
`ar_regKey` VARCHAR(100) NULL DEFAULT NULL COMMENT 'meir',
`ar_regValue` VARCHAR(350) NULL DEFAULT NULL COMMENT 'meir',
`s_serviceName` VARCHAR(100) NULL DEFAULT NULL COMMENT 'meir',
`d_driverName` VARCHAR(100) NULL DEFAULT NULL COMMENT 'meir',
`hostName` CHAR(20) NULL DEFAULT NULL,
`extension` CHAR(5) NOT NULL DEFAULT '',
`uniqueness` BINARY(32) NOT NULL DEFAULT '00000000000000000000000000000000',
PRIMARY KEY (`id`),
UNIQUE INDEX `uniqueness` (`uniqueness`),
INDEX `p_runningUserId` (`p_runningUserId`),
INDEX `riskLevel` (`riskLevel`),
INDEX `lastSeen` (`lastSeen`),
INDEX `hostId` (`hostId`),
INDEX `datein` (`datein`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=2384056;
Best Answer
Short answer: You did not give enough info to answer your question.
Long answer: This may cover your situation, plus explain a lot of other things...
In MySQL,
GROUP BY
implies the sameORDER BY
. (This is a non-standard extension.) (But this point may be irrelevant.)GROUP BY
can be performed in two ways. Please provideEXPLAIN SELECT ...
for both and, if possible,EXPLAIN FORMAT=JSON SELECT ...
; those may show the difference.Plan A: Gather all the data; sort it; walk through the sorted list doing the "grouping" stuff, such as
SUM
,COUNT
,MAX
. (Actually there is a leapfrogging optimization forMAX
.) This lengthy process is what happens without theLIMIT
. Note that it givesORDER BY
for 'free'.Plan B: Keep a hash in memory; scan through the table (without any sorting); the result is in the hash. Then it needs sorting. (
ORDER BY NULL
avoids this sort.)Both A and B involve at least a full table scan, so they do not explain your "immediate" response unless the table is quite small. Is it?
Neither of those necessarily took advantage of any index. So...
Plan C:
INDEX(x)
andGROUP BY x
. In this case, the optimizer may choose to use the index as a way to avoid the 'sort'. Furthermore if you haveLIMIT 1
, it only needs to do one value ofx
. (Note: If there is a 'sort' in the pipeline,LIMIT
cannot be done until after the sort; by then most of work has been done, so the limit does not save much time.)Plan D (already alluded to): If you are only doing
MAX
/MIN
, then only the last/first row need be looked at. If you also have a relevant index, it can reach into the index for the last/first row in each clump, thereby skipping all the other rows. In this case, even without aLIMIT
, you may get a significant performance boost.Caveat
When doing
SELECT x, y FROM t GROUP BY x
, the engine is under no obligation to pick any particular value ofy
. Furthermore the Optimizer seems to treat this case as "not worth optimizing", and may not do you what you expect (nor what I say above). More onONLY_FULL_GROUP_BY