Use a subquery (as displayed) or CTE for that purpose:
SELECT *
FROM (
SELECT qid, gid
FROM table1
ORDER BY date DESC
LIMIT 10
OFFSET ?
) q
JOIN table2 a USING (qid, gid)
USING (qid, gid)
is just a shortcut for ON q.qid = a.qid AND q.gid = a.gid
with the side effect that the two columns are only included once in the result.
From your query plans, it looks like you're comparing ints to ints in the first query plan, and int to numeric in the second plan.
Your first compare:
Index Cond: (("timestamp" >= 1431100800) AND ("timestamp" <= 1431108000))
and
timestamp >= 1431100800 and timestamp <= 1431108000
In the second query, it's numeric values:
Filter: ((numvalues[1] IS NOT NULL) AND (("timestamp")::numeric >= 1431100800.00) AND (("timestamp")::numeric <= 1431108000.00))
and
timestamp >= 1431093600.00 and timestamp <= 1431100800.00
Casting to numeric causes the index to be ignored in favor of a sequential scan.
You can see this with a very simple example, set up below:
CREATE TABLE t2 (a int);
CREATE INDEX t2_a_idx ON t2(a);
INSERT INTO t2 (a) SELECT i FROM generate_series(1,1000000) AS i;
VACUUM ANALYZE VERBOSE t2;
My first query plan looks like this:
EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000;
Index Only Scan using t2_a_idx on t2 (cost=0.42..7134.65 rows=250413 width=4)
(actual time=0.019..29.926 rows=250000 loops=1)
Index Cond: (a > 750000)
Heap Fetches: 0
Planning time: 0.137 ms
Execution time: 39.114 ms
(5 rows)
Time: 39.540 ms
While a second query using numerics looks like this:
EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000.00;
Seq Scan on t2 (cost=0.00..19425.00 rows=333333 width=4) (actual time=122.803..175.326 rows=250000 loops=1)
Filter: ((a)::numeric > 750000.00)
Rows Removed by Filter: 750000
Planning time: 0.058 ms
Execution time: 184.194 ms
(5 rows)
Time: 184.487 ms
In the second instance here, the index is ignored in favor of a sequential scan because of the cast to a numeric value, which looks like exactly what's happening in your two examples.
One last aside, you might be able to speed your query up via a SET query before executing it as well:
SET work_mem = 2GB;
If your server can handle it, because your sorts are spilling to disk, as noted in this line from your query plan:
Sort Method: external merge Disk: 1387704kB
Hope this helps. =)
Best Answer
Use the window function
ntile()
in a subquery (requires Postgres 8.4 or later).Then select the segments you are interested in (corresponding to percentiles) and pick the row with the lowest value from it:
The Postgres-specific
DISTINCT ON
comes in handy for the last step. Detailed explanation in this related answer on SO:Select first row in each GROUP BY group?
To get the
90
,99
and99.9
percentile I picked the matching granularity withntile(1000)
. And added a60
percentile as per comment.This algorithm picks the row at or above the exact value. You can add a line to the subquery with
percent_rank()
to get the exact relative rank of the select row in addition:Aside: I replaced the column name
date
withthe_date
since I am in the habbit of avoiding reserved SQL key words as identifiers, even if Postgres would permit them.