Sql-server – Optimizing Negative Comparison in SQL

performanceperformance-tuningsql serversql-server-2016

In a SQL Server 2016 environment, I have a SELECT query which queries one table, which joins to itself on 2 indexed columns (both non-clustered):

SELECT
    A.REFERENCE,
    A.MEMBERID
FROM 
    TRANS A INNER JOIN 
    TRANS B ON A.REFERENCE = B.REFERENCE AND A.MEMBERID = B.MEMBERID

This one above returns in about 2 seconds.

But, when using the same query, but looking for matching references across different memberids e.g. changing = to <>, it takes about 20 seconds.

SELECT
    A.REFERENCE,
    A.MEMBERID,
    B.MEMBERID AS MEMBERID_B
FROM 
    TRANS A INNER JOIN 
    TRANS B ON A.REFERENCE = B.REFERENCE AND A.MEMBERID <> B.MEMBERID

I realize that this 2nd query uses a negative comparison and thus, must fully review table B vs. the 1st query which can just find the single match and move on (I think). In the query plan, it uses the same index and both are seeks, but the = estimates a single row involved, and the <> estimates 20 million rows (entire table).

My question is: without being able to alter any indexes or table structure, etc. how might I go about optimizing the query with the negative comparison? (or re-writing it to achieve the same results?) I've searched google, and found lots of info saying not to use negative comparisons where possible, but not much on optimizing when you must.

Unfortunately I can't post the exact code or anything from the db environment because I am at work and we're not permitted to post any actual code etc. e.g. Clipboard won't even work across the remote connection and can't use a browser in the machine I am accessing the DB from. My example is a dummy example to mimic the similar situation. I realize the best way is to examine indexes, structure, and query plans, but I was hoping to learn some general advice for the situation.

Best Answer

It's hard to say much given your need to keep your code and data confidential, but sometimes you can get better performance just by trying equivalent rewrites. The following query should return the same results but it's very likely to have a different query plan:

SELECT 
A.REFERENCE, 
A.MEMBERID, 
B.MEMBERID AS MEMBERID_B 
FROM 
TRANS A INNER JOIN 
TRANS B ON A.REFERENCE = B.REFERENCE AND A.MEMBERID > B.MEMBERID 

UNION ALL 

SELECT 
A.REFERENCE, 
A.MEMBERID, 
B.MEMBERID AS MEMBERID_B 
FROM 
TRANS A INNER JOIN 
TRANS B ON A.REFERENCE = B.REFERENCE AND A.MEMBERID < B.MEMBERID

As you said in chat, this rewrite finishes in 6 seconds which is a decent improvement over the original 20 seconds you were seeing with <>.

Query

Your query is forced to scan the whole table (or the whole index). Every row could be another distinct unit. The only way to substantially shorten the process would be a separate table with all available units - which would help as long as there are substantially fewer units than entries in all_units.
Since you have ~ 11k units (added in comment) for 25M entries, this should definitely help.

Depending on frequencies of values, there are a couple of query techniques to get your result considerably faster:

recursive CTE
JOIN LATERAL
correlated subquery

Details in this related answer on SO:

Optimize GROUP BY query to retrieve latest record per user

Only needing the implicit index of the primary key on (unit_id, unit_timestamp), this query should do the trick, using an implicit JOIN LATERAL:

SELECT u.unit_id, a.max_ts
FROM unit u
  , (SELECT unit_timestamp AS max_ts
     FROM   all_units
     WHERE  unit_id = u.unit_id
     ORDER  BY unit_timestamp DESC
     LIMIT  1
     ) a;

Excludes units without entry in all_units, like your original query.
Or a lowly correlated subquery (probably even faster):

SELECT u.unit_id
    , (SELECT unit_timestamp
       FROM   all_units
       WHERE  unit_id = u.unit_id
       ORDER  BY unit_timestamp DESC
       LIMIT  1) AS max_ts
FROM unit u;

Includes units without entry in all_units.

Efficiency depends on the number of entries per unit. The more entries, the more potential for one of these queries.

In a quick local test with similar tables (500 "units", 1M rows in big table), the query with correlated subqueries was ~ 500x faster than your original. Index-only scans on the PK index of the big table vs. sequential scan in your original query.

Since your table tends to get even larger rapidly, a materialized view is probably not an option.

There is also DISTINCT ON as another possible query technique, but it's hardly going to be faster than your original query, so not the answer you are looking for. Details here:

How do I efficiently get "the most recent corresponding row"?

Index

Your partial_idx:

CREATE INDEX partial_idx ON all_units (unit_id, unit_timestamp DESC);

is not in fact a partial index and also redundant. Postgres can scan indexes backwards at practically the same speed, the PK serves well. Drop this additional index.

Table layout

A couple of points for your table definition.

CREATE TABLE all_units (
unit_timestamp timestamp,
unit_id int4,
lon     float4,
lat     float4,
speed   float4,
status  varchar(255),   -- might be improved.
PRIMARY KEY (unit_id, unit_timestamp)
);

timestamp(6) doesn't make much sense, it's effectively the same as just timestamp, which already saves a maximum of 6 fractional digits.
I switched positions of the first two columns to save 4 bytes of padding, which amounts to ~ 100 MB for 25M rows (exact result depends on status). Smaller tables are typically faster for everything.
If status isn't free text, but some kind of standardized note, you could replace it with something a lot cheaper. More about varchar(255) in Postgres.

Server configuration

You need to configure your server. Most of your settings seem to be conservative defaults. 1 MB on shared_buffers or work_mem seems way to low for an installation with millions of rows. And random_pare_cost = 4 is to high for any modern system with plenty of RAM. Start with the manual and the Postgres Wiki:

Best Answer

Related Solutions

Sql-server – OPTION FORCE ORDER improves performance until rows are deleted

Optimizing Large Database Query with max() and GROUP BY

Query

Index

Table layout

Server configuration

Related Question