Sql-server – How to optimize query where searching for ‘is not null’ on either of two linked tables

index-tuningperformancequery-performancesql server

I have a query like this:

SELECT * FROM my_first_table a   -- > 1 million records
INNER JOIN my_second_table b     -- > 1 million records
    on a.id = b.a_id
LEFT OUTER JOIN my_third_table c -- > 1 million records
    on a.id = c.a_id
WHERE   b.my_column_x IS NOT NULL
OR      c.my_column_y IS NOT NULL
AND     b.start_date  > isnull(@startdate, '17530101')
AND     b.end_date    < isnull(@enddate, '99991231')

my_column_x and my_column_y are both indexed, however as I'm doing an OR IS NOT NULL across both of these columns, neither index is being used, and the query is extremely slow.

I'm also doing a date range check. b.start_date and b.end_date are both datetime columns, and are indexed. The query accepts inputs where both the start and end dates can be null or a specified datetime.

How can I optimize the query? I'm thinking there must be a way I can do some kind of separate sub-selects for IS NOT NULL on my_second_table.my_column_x and my_third_table.my_column_y then combine those sub-selects in order to be able to join in the way I'm trying to do in the original query, so that the indexes on b.my_column_x and b.my_column_y are used.

Best Answer

Biggest mistake is you are using * in Select.

Do you require column from all three table in your resultset ?

Once you should try this,

if(@startdate is null)
set @startdate='17530101'

if(@enddate is null)
set @enddate='99991231'

SELECT * FROM dbo.my_first_table a   -- > 1 million records
INNER JOIN dbo.my_second_table b     -- > 1 million records
    on a.id = b.a_id and b.my_column_x IS NOT NULL
LEFT OUTER JOIN dbo.my_third_table c -- > 1 million records
    on a.id = c.a_id and c.my_column_y IS NOT NULL
WHERE   
 b.[start_date]  > @startdate
AND     b.[end_date]    < @enddate

Better index

Your index idx_pd_order is not a big help, since filtering on non-leading columns of a multi-column index is not very efficient. Postgres is using it - still a lot better than a sequential scan. Details here:
Is a composite index also good for queries on the first field?

idx_pd_where might be a better choice, but not a good one, either. The leading column status_id is not selective at all and just bloats the index. Same goes for the trailing column is_active. And priority is not in the index and has to be fetched from the table, making an index-only scan impossible.

I suggest this partial, multi-column index to start with. (But keep reading!)

CREATE INDEX idx_pd_covering ON tb_pd (next_execution_date, priority, st_id)
WHERE  status_id = 1 AND is_active = 1

Since we are only ever interested in rows with status_id = 1and is_active = 1 exclude other rows from the index right away. Size does matter.
The remaining (crucial) condition is on next_execution_date, which must come first in the index.
priority and st_id are only appended for a possible index-only scan (Postgres 9.2+). If that doesn't take, remove the columns from the index to make is smaller.

Special difficulty

We can now use idx_pd_covering to find qualifying rows quickly, unfortunately we have to look at all qualifying rows to collect the ones with highest priority. As the query plan reveals, Postgres estimates to process 34627017 rows. Sorting 35M rows is going to cost big. That's the tricky part I mentioned at the start. To demonstrate what I am talking about, run EXPLAIN on your query with and without priority in ORDER BY:

SELECT s.st                                               
FROM   tb_pd p
JOIN   tb_st s USING (st_id)
WHERE  p.status_id = 1
AND    p.is_active = 1
AND    p.next_execution_date < 1401402110830
ORDER  BY priority, next_execution_date
LIMIT  20000;

That's your query, formatted only slightly simplified. You should see a huge difference.

Solution

The solution depends on the number of distinct values for priority. For lack of information and for demo purposes I am going to assume only three. Priority 1, 2 and 3.

With a trivial number of distinct priority values, there is a simple solution. Create three partial indexes. All of them together are still smaller than your current indexes idx_pd_order or idx_pd_where (which you might not be needing any more).

CREATE INDEX idx_pd_covering_p1 ON tb_pd (next_execution_date, st_id)
WHERE  priority = 1 AND status_id = 1 AND is_active = 1;

CREATE INDEX idx_pd_covering_p2 ON tb_pd (next_execution_date, st_id)
WHERE  priority = 2 AND status_id = 1 AND is_active = 1;

CREATE INDEX idx_pd_covering_p3 ON tb_pd (next_execution_date, st_id)
WHERE  priority = 3 AND status_id = 1 AND is_active = 1;

Use this query:

SELECT s.st
FROM  (
   (
   SELECT st_id
   FROM   tb_pd
   WHERE  status_id = 1
   AND    is_active = 1
   AND    priority  = 1
   AND    next_execution_date < 1401402110830
   ORDER  BY next_execution_date
   )
   UNION ALL
   (
   SELECT st_id
   FROM   tb_pd
   WHERE  status_id = 1
   AND    is_active = 1
   AND    priority  = 2
   AND    next_execution_date < 1401402110830
   ORDER  BY next_execution_date
   )
   UNION ALL
   (
   ...
   AND    priority  = 3
   ...
   )
   LIMIT  20000
   ) p
JOIN   tb_st s USING (st_id);

This should be dynamite.

Strictly speaking, the final order is not guaranteed without an additional ORDER BY clause in the outer query. In the current implementation, the order from the inner query is preserved as long as the outer query is as simple as that. To be sure, you could join right away (which may be a bit slower):

)
SELECT s.st
FROM   tb_pd p
JOIN   tb_st s USING (st_id)
WHERE  p.status_id = 1
AND    p.is_active = 1
AND    p.priority  = 1
AND    p.next_execution_date < 1401402110830
ORDER  BY p.next_execution_date
)
UNION ALL
(
...
)
LIMIT  20000;

.. or carry along priority and next_execution_date to order once more in the outer query (to be absolutely sure), which is probably slower, yet.

All parentheses are needed! Related answer.
This query just reads tuples from the top of above partial indexes, no sort-step needed at all. All rows are pre-sorted, efficient to boot.
UNION ALL queries without a final ORDER BY can stop as soon the number of requested rows in the top-level LIMIT have been fetched. So if there are enough rows in the top priorities, subsequent legs of the UNION ALL query are never executed. This way, only the smaller partial indexes have to be touched.
JOIN to tb_st later, should be more efficient.
Again, the column st_id is only appended to the index in the hope for an index-only scan. If that works for you, the whole query does not even touch the table tb_pd at all.

General solution for any number of distinct `priority` values

We have solved this before. There is a complete recipe to automate the creation of partial indexes and a function .. the works:
Can spatial index help a "range - order by - limit" query

Optimize table

Since you are trying to optimize performance and your table is big, I suggest a slightly altered layout for your table tb_pd:

   Column            |  Type
---------------------+--------
 pd_id               | integer
 st_id               | integer
 next_execution_date | bigint
 priority            | integer  -- or smallint? -- or "char"?
 status_id           | smallint -- or "char"
 is_active           | boolean

This occupies 52 bytes per row on disk, while your current design needs 60 bytes. Indexes profit as well. Details:
Configuring PostgreSQL for read performance

Of course, all the basic advice for performance-optimization applies as well.

About "char":

The type "char" (note the quotes) is different from char(1) in that it only uses one byte of storage. It is internally used in the system catalogs as a simplistic enumeration type.

Best Answer

Related Solutions

Sql-server – SQL Select taking too much time to execute

Postgresql – Optimize a query on two big tables

Better index

Special difficulty

Solution

General solution for any number of distinct priority values

Optimize table

Related Question

General solution for any number of distinct `priority` values