SQL Server Performance – Optimizing MIN in Subquery

optimizationperformancequery-performancesql serversql-server-2005subquery

is there any effective way to tune the query below?

I have a filtered index on OrganisationID <> 0
Primary key is on another field, not relevant for this query ItemCode varchar(20).

I have tagged sql server 2005 because it needs to work there, but if there is anything else in any other sql server version, please bring it up.

--------------------------------------
-- START CLEAR
--------------------------------------

begin try
drop table #T
end try
begin catch
end catch

--------------------------------------
-- THE TABLE
--------------------------------------

CREATE TABLE #T
(
a1 int,
a2 int,
ID INT,
OrganisationID INT,
Distance INT,
constraint PKt1 PRIMARY KEY  CLUSTERED (A1,A2,ID)
)

--------------------------------------
-- ADDING SOME DATA
--------------------------------------

INSERT INTO #T
SELECT 1,1,0,10,100
UNION ALL
SELECT 1,1,1,10,200
UNION ALL
SELECT 1,1,3,10,50
UNION ALL
SELECT 1,1,4,20,80
UNION ALL
SELECT 1,2,5,20,300
UNION ALL
SELECT 1,2,6,0,100
UNION ALL
SELECT 1,3,7,0,100
UNION ALL
SELECT 1,3,4,10,100

GO
--------------------------------------
-- INDEX CREATION
--------------------------------------

create index idx_T_OrganisationID 
on #T (id)
WHERE OrganisationID <> 0
GO

--------------------------------------
-- THE QUERY
--------------------------------------
--SET STATISTICS IO ON
--SET STATISTICS TIME ON

    SELECT
        PID.ID,
        PID.OrganisationID
    FROM #t AS PID 
    WHERE id IN (
        SELECT MIN(id) FROM dbo.#t 
        WHERE OrganisationID <> 0
        GROUP BY OrganisationID
    )

After Comprehensive Testing – The Results

out of these 3 queries:

    -- query 1

        SELECT
            PID.ID,
            PID.OrganisationID
        FROM #t AS PID 
        WHERE id IN (
            SELECT MIN(id) FROM dbo.#t 
            WHERE OrganisationID <> 0
            GROUP BY OrganisationID
        )


    -- query 2

        SELECT  rn.ID, --... other columns go here
            rn.OrganisationID
    FROM (
        SELECT *, n = ROW_NUMBER() OVER(PARTITION BY OrganisationID ORDER BY id)
    FROM #t 
    ) rn WHERE n= 1
           AND OrganisationID <> 0


    -- query 3
    SELECT OrganisationID, MIN(ID) 
FROM #t T 
WHERE OrganisationID <> 0 
GROUP BY OrganisationID ;

the first one does not even bring the most correct results,
the query 3 (as suggested in the comments by spaghettidba and ypercubeᵀᴹ ) is the one with best performance in my live environment, with my real table and data, as you can see below, the query I had originally and the one based on query 3:

for this exercise in particular, using the table #T query 2 and query 3 perfom more of less equally as per the query plan below (picture):

Best Answer

This query usually perform better:

SELECT  rn.ID, --... other columns go here
        rn.OrganisationID
FROM (
    SELECT *, n = ROW_NUMBER() OVER(PARTITION BY OrganisationID ORDER BY id)
FROM #t 
) rn WHERE n= 1

Related Solutions

Mysql – Optimizing large MySQL SELECT WHERE IN clauses

The following is a long shot, as we do not know anything about your hardware, InnoDB configuration, and query specifics, but I bet you are using the wrong tool for the job (InnoDB Engine).

What you are trying to achieve is creating a very heavy index (up to 127 characters, which may take -this is a broad approximation- 127*3 bytes per entry), which is created using the only method available for InnoDB, a B+Tree. Also, as rows are clusterized around the primary key, the whole row is actually on the index, and accessing the primary key means accessing the page with the whole row content.

In short, you have a unique index, which contains your whole table, and which should fit more or less on memory (not necessarily all, but in this case your working set seems to be most of your the rows). How big is your InnoDB buffer pool? How is your buffer pool hit ratio? You can check both parameters with SHOW ENGINE INNODB STATUS. My bet is that your buffer pool is too small or even that your you do not have enough physical memory to hold your working set. In both cases, this may be forcing InnoDB to perform IOPS for every query. You may think that you should not need to have everything cache for everything to work well, and you should be right. But for your particular workload (large PKs), InnoDB is not the best engine. A hash index, available in other RDBMS and MySQL engines, would probably be smaller and faster, but it is not supported by InnoDB. Additionally, IN + list of values with a huge number of rows may not be the most optimal way of querying (at MySQL level), but it should be certainly faster than doing the queries individually.

Make sure that the query planner (you can check it with EXPLAIN) is using the range JOIN type, and not doing full table scans.
After that, the first thing I would recommend you to tune your InnoDB buffer to reduce InnoDB cache misses.
The next thing to try is to emulate a hash index by creating a small secondary index. This method is explained in the book "High Performance MySQL". This way, only the small secondary index would be cached on memory and it may better fit your physical memory.
If these do not work, before changing technology, I would recommend you to try to use a different engine with works well with key-value datasets. Maybe TokuDB could be better to handle this? Also, the memcached interface integrated into MySQL/InnoDB 5.6 could be another solution? Multi-get seems to fit your solution very well.
At last, if your load is mostly reads, you could try external technology, like full text search engines -as you mention-, but be careful as those kind of pieces of software tend to rely on fuzzy search, and may omit some results and they tend to not be fully ACID compliant (those are things that they have to sacrifice in exchange for query speed).

Is the WHERE-JOIN-ORDER-(SELECT) Rule for Index Column Order Incorrect?

Is the WHERE-JOIN-ORDER-(SELECT) rule for index column order wrong?

At the least it is incomplete and potentially misleading advice (I didn't bother to read the whole article). If you're going to read stuff on the Internet (including this), you should adjust your amount of trust according to how well you already know and trust the author, but always then verify for yourself.

There are a number of "rules of thumb" for creating indexes, depending on the exact scenario, but none are really a good substitute for understanding the core issues for yourself. Read up on the implementation of indexes and execution plan operators in SQL Server, go through some exercises, and come to a good solid understanding of how indexes can be used to make execution plans more efficient. There is no effective shortcut to attaining this knowledge and experience.

In general, I can say that your indexes should most often have columns used for equality tests first, with any inequalities last, and/or provided by a filter on the index. This is not a complete statement, because indexes can also provide order, which may be more useful than seeking directly to one or more keys in some situations. For example, ordering can be used to avoid a sort, to reduce the cost of a physical join option like merge join, to enable a stream aggregate, find the first few qualifying rows quickly...and so on.

I'm being a little vague here, because selecting the ideal index(es) for a query depends on so many factors - this is a very broad topic.

Anyway, it is not unusual to find conflicting signals for the 'best' indexes in a query. For example, your join predicate would like rows ordered one way for a merge join, the group by would like rows sorted another way for a stream aggregate, and finding the qualifying rows using the where clause predicates would suggest other indexes.

The reason indexing is an art as well as science is that an ideal combination is not always logically possible. Choosing the best compromise indexes for the workload (not just a single query) requires analytic skills, experience, and system-specific knowledge. If it were easy, the automated tools would be perfect, and performance-tuning consultants would be much less in demand.

As far as missing index suggestions are concerned: these are opportunistic. The optimizer brings them to your attention when it tries to match predicates and required sort order to an index that does not exist. The suggestions are therefore based on particular matching attempts in the specific context of the particular sub-plan variation it was considering at the time.

In context, the suggestions always make sense, in terms of reducing the estimated cost of data access, according to the optimizer's model. It does not do a wider analysis of the query as a whole (much less the wider workload), so you should think of these suggestions as a gentle hint that a skilled person needs to look at the available indexes, with the suggestions as a starting point (and usually no more than that).

In your case, the (Status) INCLUDE (ID) suggestion probably came about when it was looking at the possibility of a hash or merge join (example later). In that narrow context, the suggestion makes sense. For the query as a whole, maybe not. The index (ID, Status) enables a nested loop join with ID as an outer reference: equality seek on ID and inequality on Status per iteration.

One possible selection of indexes is:

CREATE INDEX i1 ON dbo.I (ID, [Status]);
CREATE INDEX i1 ON dbo.IP (Deleted, OPID, IID) INCLUDE (Q);

...which produces a plan like:

I am not saying these indexes are optimal for you; they happen to work to produce a reasonable-looking plan to me, without being able to see statistics for the tables involved, or the full definitions and existing indexing. Also, I know nothing of the wider workload or real query.

Alternatively (just to show one of the myriad additional possibilities):

CREATE INDEX i1 ON dbo.I ([Status]) INCLUDE (ID);
CREATE INDEX i1 ON dbo.IP (Deleted, IID, OPID) INCLUDE (Q);

Gives:

Execution plans were generated using SQL Sentry Plan Explorer.

Best Answer

Related Solutions

Mysql – Optimizing large MySQL SELECT WHERE IN clauses

Is the WHERE-JOIN-ORDER-(SELECT) Rule for Index Column Order Incorrect?

Related Question