I think in a million records query, you have to avoid things like OUTER JOINS
. I suggest you use UNION ALL
Instead of LEFT JOIN
.
As long as I think CROSS APPLY
is more efficient than sub-query in the select clause I will modify the query written by Conard Frix, which I think is correct.
now: when I started to modify your query I noticed that you have a WHERE clause saying: JoinedTable.WhereColumn IN (1, 3)
. in this case, if the field is null the condition will become false. then why are you using LEFT JOIN while you are filtering null valued rows?
just replace LEFT JOIN
With INNER JOIN
, I guarantee that it will become faster.
about INDEX:
please note that when you have an index on a table, say
table1(a int, b nvarchar)
and your index is :
nonclustered index ix1 on table1(a)
and you want to do something like this:
select a,b from table1
where a < 10
in your index you have not included the column b
so what happens?
if sql-server uses your index, it will have to search in the index, called "Index Seek" and then refer to main table to get column b
, called "Look Up". This procedure might take much longer than scanning the table itself: "Table Scan".
but based on the statistics that sql-server has, in such situations, it might not use your index at all.
so first of all check the Execution Plan
to see if the index is used at all.
if yes or no both, alter your index to include all columns that you are selecting. say like:
nonclustered index ix1 on table1(a) include(b)
in this case Look Up will not be needed, and your query will execute so much faster.
This is a tricky one. Your main condition is on next_execution_date
, but the output is sorted by priority
first. The conditions on status_id
and is_active
only play a minor part.
Better index
Your index idx_pd_order
is not a big help, since filtering on non-leading columns of a multi-column index is not very efficient. Postgres is using it - still a lot better than a sequential scan. Details here:
Is a composite index also good for queries on the first field?
idx_pd_where
might be a better choice, but not a good one, either. The leading column status_id
is not selective at all and just bloats the index. Same goes for the trailing column is_active
. And priority
is not in the index and has to be fetched from the table, making an index-only scan impossible.
I suggest this partial, multi-column index to start with. (But keep reading!)
CREATE INDEX idx_pd_covering ON tb_pd (next_execution_date, priority, st_id)
WHERE status_id = 1 AND is_active = 1
Since we are only ever interested in rows with status_id = 1
and is_active = 1
exclude other rows from the index right away. Size does matter.
The remaining (crucial) condition is on next_execution_date
, which must come first in the index.
priority
and st_id
are only appended for a possible index-only scan (Postgres 9.2+). If that doesn't take, remove the columns from the index to make is smaller.
Special difficulty
We can now use idx_pd_covering
to find qualifying rows quickly, unfortunately we have to look at all qualifying rows to collect the ones with highest priority
. As the query plan reveals, Postgres estimates to process 34627017 rows. Sorting 35M rows is going to cost big. That's the tricky part I mentioned at the start. To demonstrate what I am talking about, run EXPLAIN
on your query with and without priority
in ORDER BY
:
SELECT s.st
FROM tb_pd p
JOIN tb_st s USING (st_id)
WHERE p.status_id = 1
AND p.is_active = 1
AND p.next_execution_date < 1401402110830
ORDER BY priority, next_execution_date
LIMIT 20000;
That's your query, formatted only slightly simplified. You should see a huge difference.
Solution
The solution depends on the number of distinct values for priority
. For lack of information and for demo purposes I am going to assume only three. Priority 1
, 2
and 3
.
With a trivial number of distinct priority values, there is a simple solution. Create three partial indexes. All of them together are still smaller than your current indexes idx_pd_order
or idx_pd_where
(which you might not be needing any more).
CREATE INDEX idx_pd_covering_p1 ON tb_pd (next_execution_date, st_id)
WHERE priority = 1 AND status_id = 1 AND is_active = 1;
CREATE INDEX idx_pd_covering_p2 ON tb_pd (next_execution_date, st_id)
WHERE priority = 2 AND status_id = 1 AND is_active = 1;
CREATE INDEX idx_pd_covering_p3 ON tb_pd (next_execution_date, st_id)
WHERE priority = 3 AND status_id = 1 AND is_active = 1;
Use this query:
SELECT s.st
FROM (
(
SELECT st_id
FROM tb_pd
WHERE status_id = 1
AND is_active = 1
AND priority = 1
AND next_execution_date < 1401402110830
ORDER BY next_execution_date
)
UNION ALL
(
SELECT st_id
FROM tb_pd
WHERE status_id = 1
AND is_active = 1
AND priority = 2
AND next_execution_date < 1401402110830
ORDER BY next_execution_date
)
UNION ALL
(
...
AND priority = 3
...
)
LIMIT 20000
) p
JOIN tb_st s USING (st_id);
This should be dynamite.
- Strictly speaking, the final order is not guaranteed without an additional
ORDER BY
clause in the outer query. In the current implementation, the order from the inner query is preserved as long as the outer query is as simple as that. To be sure, you could join right away (which may be a bit slower):
)
SELECT s.st
FROM tb_pd p
JOIN tb_st s USING (st_id)
WHERE p.status_id = 1
AND p.is_active = 1
AND p.priority = 1
AND p.next_execution_date < 1401402110830
ORDER BY p.next_execution_date
)
UNION ALL
(
...
)
LIMIT 20000;
.. or carry along priority
and next_execution_date
to order once more in the outer query (to be absolutely sure), which is probably slower, yet.
All parentheses are needed! Related answer.
This query just reads tuples from the top of above partial indexes, no sort-step needed at all. All rows are pre-sorted, efficient to boot.
UNION ALL
queries without a final ORDER BY
can stop as soon the number of requested rows in the top-level LIMIT
have been fetched. So if there are enough rows in the top priorities, subsequent legs of the UNION ALL
query are never executed. This way, only the smaller partial indexes have to be touched.
JOIN
to tb_st
later, should be more efficient.
Again, the column st_id
is only appended to the index in the hope for an index-only scan. If that works for you, the whole query does not even touch the table tb_pd
at all.
General solution for any number of distinct priority
values
We have solved this before. There is a complete recipe to automate the creation of partial indexes and a function .. the works:
Can spatial index help a "range - order by - limit" query
Optimize table
Since you are trying to optimize performance and your table is big, I suggest a slightly altered layout for your table tb_pd
:
Column | Type
---------------------+--------
pd_id | integer
st_id | integer
next_execution_date | bigint
priority | integer -- or smallint? -- or "char"?
status_id | smallint -- or "char"
is_active | boolean
This occupies 52 bytes per row on disk, while your current design needs 60 bytes. Indexes profit as well. Details:
Configuring PostgreSQL for read performance
Of course, all the basic advice for performance-optimization applies as well.
About "char"
:
The type "char"
(note the quotes) is different from char(1)
in that it
only uses one byte of storage. It is internally used in the system
catalogs as a simplistic enumeration type.
Best Answer
Biggest mistake is you are using
*
inSelect
.Once you should try this,