Sql-server – Optimization for branched WHERE

query-performancesql server

I have a query that executes against a table that can grow to millions of rows. The query comes out of a QA tool that we use that is outside the standard functionality of the DB (as far as what is indexed and how and why). The query is:

SELECT id FROM thisTable t
WHERE col = 'val'
AND ((not exists (SELECT 1 FROM thisTable WHERE refid = t.id) and refbool = 0) or refbool = 1)
ORDER BY newid()

Basically, let's say the table has id, refid, refbool, and col columns. So you could have data as follows:

  id  |  refid  |  refbool  |  col
------------------------------------
   1  |   NULL  |    0      |  val
   2  |   NULL  |    0      |  val
   3  |   NULL  |    0      |  val
   4  |    2    |    1      |  val
   5  |   NULL  |    0      |  val
   6  |    1    |    1      |  val

The query should never pick the rows for id in (1, 2) because they are referred to by other rows. It should only grab rows where refbool = 1, OR refbool = 0 AND that row's id is not any other row's refid. This statement is horribly non-performant, but I'm not sure what a better query would look like for this. Assume that no indexes, views, stored procedures, or other underlying machinations can be added – it must be a query.

The overall query is significantly larger, JOINS to two additional tables and gathers quite a bit of data. However, I've narrowed it does to this particular bit as commenting out this line takes the query execution time from 16s to <1s.

I'm also reordering the rows by newid() as I need to randomly select a sample item. Removing the ORDER BY also make the query significantly faster even leaving the third row in. It appears that the two operations combined causes the slowness. I've tried designing a CTE, but have failed to increase performance doing so.

I've looked at the execution plan. There are indices that would be added that would improve this query. However, performance of internal QA tools do not take precedence over performance in client production environments, and making changes to the structure in a QA environment for the utility in relation to indices, etc. invalidates it's usefulness as a QA environment as it will likely perform differently than a production environment.

I could certainly write a query that would perform worse than my current query by changing the logic of the query itself. I'm sure we all could. I'm asking to apply that sort of reasoning to improve the performance of the query instead.

Best Answer

An execution plan was not included, but the typical issue with queries like this (sorting aside) is the optimizer choosing a nested loops anti semi join without a good supporting index. It can also be a rogue top (1), or a poorly-performing transformation to a semi-join with nested start-up filters and an anti-semi join.

Regardless, there are two usual workarounds:

Rewrite the OR manually as a UNION (or, if guaranteed disjoint, as a UNION ALL).
Rewrite the NOT EXISTS as a left join filtering the preserved side for NULL.

The following incorporates both:

DECLARE @thisTable table
(
    id integer PRIMARY KEY,
    refid integer NULL,
    refbool bit NOT NULL,
    col varchar(10) NOT NULL
);

INSERT @thisTable
    (id, refid, refbool, col)
VALUES
    (1, NULL, 0, 'val'),
    (2, NULL, 0, 'val'),
    (3, NULL, 0, 'val'),
    (4,  2  , 1, 'val'),
    (5, NULL, 0, 'val'),
    (6,  1  , 1, 'val');

SELECT
    U.id
FROM 
(
    -- T.refbool = 1
    SELECT T.id 
    FROM @thisTable AS T
    WHERE 
        T.col = 'val'
        AND T.refbool = 1

    -- Or (disjoint)
    UNION ALL

    -- T.refbool = 0 and not exists
    SELECT T.id 
    FROM @thisTable AS T
    LEFT JOIN @thisTable AS T2
        ON T2.refid = T.id
    WHERE 
        T.col = 'val'
        AND T.refbool = 0
        AND T2.id IS NULL
) AS U
ORDER BY 
    CHECKSUM(NEWID());

db<>fiddle online demo

For more alternatives on random ordering see the existing Q & A:

What is the best way to get a random ordering?.

Don't just try the top answer.

Related Solutions

Sql-server – Improving performance of query with `WHERE <> ”`

Do those columns have indexes on them at all? That is rather vital information for questions like this. If there is no index covering either of those columns at all then there is nothing it can do except table scan to match those filters.

If there is at least one index covering one (or both) that it could use, it may be deciding against it anyway due to the SELECT *. If the selectivity of the index suggests to it that many rows will match then a table scan may actually be more efficient than an index seek/scan followed plus row lookups to get the rest of the data that is not contained within the index.

For instance if I run

SELECT parent_id FROM org WHERE parent_id <> ''

on one of our systems it performs an index seek only. Running

SELECT * FROM org WHERE parent_id <> ''

performs a table scan instead. If I force the index choice with

SELECT * FROM org (WITH(INDEX(ix_org_parent_id) WHERE parent_id <> ''

it performs an index seek to locate the start of the records, then the RID lookups to find all the other column data. According to the query plan output it judges this to be ~6 times as costly as just doing the table scan, the estimated costs of the three queries above being 0.22926, 1.81429, and 11.1612 respectively. In this case the number of rows matching the WHERE clause is ~55% of ~50,000.

Postgresql – Have Postgresql query planner use nested loop w/ indices over hash join

This closely related answer on SO should provide answers to your primary question:
Setting enable_seqscan = off in a single SELECT query

You could use in similar fashion, to disable hash joins for the current transaction:

SET LOCAL enable_hashjoin=off;

But that's not my advice. Read the answer over there.
And this one about statistics and cost settings, too.

More importantly, untangle your query first:

SELECT creation_epoch, user_screen_name, chunk
FROM  (
   SELECT id AS owner_user_id
   FROM   users
   WHERE  reputation > 100000
   ORDER  BY reputation 
   LIMIT  500
   ) u
JOIN   posts p USING (owner_user_id)
JOIN   post_tokenized t USING (id)
WHERE  type = 'tag'
AND    user_screen_name IS NOT NULL;

Should be considerably faster and also make it easier for the query planner to choose the best plan (given sane cost settings and table statistics).

Best Answer

Related Solutions

Sql-server – Improving performance of query with `WHERE <> ”`

Postgresql – Have Postgresql query planner use nested loop w/ indices over hash join

Related Question