The LEFT JOIN
in @dezso's answer should be good. An index, however, will hardly be useful (per se), because the query has to read the whole table anyway - the exception being index-only scans in Postgres 9.2+ and favorable conditions, see below.
SELECT m.hash, m.string, count(m.method) AS method_ct
FROM methods m
LEFT JOIN nostring n USING (hash)
WHERE n.hash IS NULL
GROUP BY m.hash, m.string
ORDER BY count(m.method) DESC;
Run EXPLAIN ANALYZE
on the query. Several times to exclude cashing effects and noise. Compare the best results.
Create a multi-column index that matches your query:
CREATE INDEX methods_cluster_idx ON methods (hash, string, method);
Wait? After I said an index wouldn't help? Well, we need it to CLUSTER
the table:
CLUSTER methods USING methods_cluster_idx;
ANALYZE methods;
Rerun EXPLAIN ANALYZE
. Any faster? It should be.
CLUSTER
is a one-time operation to rewrite the whole table in the order of the used index. It is also effectively a VACUUM FULL
. If you want to be sure, you'd run a pre-test with VACUUM FULL
alone to see what can be attributed to that.
If your table sees a lot of write operations, the effect will degrade over time. Schedule CLUSTER
at off-hours to restore the effect. Fine tuning depends of your exact use-case. The manual about CLUSTER
.
CLUSTER
is a rather crude tool, needs an exclusive lock on the table. If you can't afford that, consider pg_repack
which can do the same without exclusive lock. More in this later answer:
If the percentage of NULL
values in the column method
is high (more than ~ 20 percent, depending on actual row sizes), a partial index should help:
CREATE INDEX methods_foo_idx ON methods (hash, string)
WHERE method IS NOT NULL;
(Your later update shows your columns to be NOT NULL
, so not applicable.)
If you are running PostgreSQL 9.2 or later (as @deszo commented) the presented indexes may be useful without CLUSTER
if the planner can utilize index-only scans. Only applicable under favorable conditions: No write operations that would effect the visibility map since the last VACUUM
and all columns in the query have to be covered by the index. Basically read-only tables can use this any time, while heavily written tables are limited. More details in the Postgres Wiki.
The above mentioned partial index could be even more useful in that case.
If, on the other hand, there are no NULL
values in column method
, you should
1.) define it NOT NULL
and
2.) use count(*)
instead of count(method)
, that's slightly faster and does the same in the absence of NULL
values.
If you have to call this query often and the table is read-only, create a MATERIALIZED VIEW
.
Exotic fine point:
Your table is named nostring
, yet seems to contain hashes. By excluding hashes instead of strings, there is a chance that you exclude more strings than intended. Extremely unlikely, but possible.
Index
A plain multicolumn B-tree index should work after all:
CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);
Why DESC NULLS LAST
?
It's safe to assume you have an equipment
table? Then performance won't be a problem:
Correlated subquery
Based on this equipment
table, run a lowly correlated subquery to great effect:
SELECT equipment_id
,(SELECT created_at
FROM geoposition_records
WHERE equipment_id = eq.equipment_id
ORDER BY created_at DESC NULLS LAST
LIMIT 1) AS latest
FROM equipment eq;
For a small number of rows in the equipment
table (57 judging from your EXPLAIN ANALYZE
output), that's very fast.
LATERAL
join in Postgres 9.3+
SELECT eq.equipment_id, r.latest
FROM equipment eq
LEFT JOIN LATERAL (
SELECT created_at
FROM geoposition_records
WHERE equipment_id = eq.equipment_id
ORDER BY created_at DESC NULLS LAST
LIMIT 1
) r(latest) ON true;
Detailed explanation:
Performance similar to the correlated subquery.
Function
If you can't talk sense into the query planner (which shouldn't occur), a function looping through the equipment table is certain to do the trick. Looking up one equipment_id
at a time uses the index.
CREATE OR REPLACE FUNCTION f_latest_equip()
RETURNS TABLE (equipment_id int, latest timestamp)
LANGUAGE plpgsql STABLE AS
$func$
BEGIN
FOR equipment_id IN
SELECT e.equipment_id FROM equipment e ORDER BY 1
LOOP
SELECT g.created_at
FROM geoposition_records g
WHERE g.equipment_id = f_latest_equip.equipment_id
-- prepend function name to disambiguate
ORDER BY g.created_at DESC NULLS LAST
LIMIT 1
INTO latest;
RETURN NEXT;
END LOOP;
END
$func$;
Makes for a nice call, too:
SELECT * FROM f_latest_equip();
Performance comparison:
db<>fiddle here
OLD sqlfiddle
Best Answer
So why does Postgres 9.2 still show a sequential scan? I quote the Postgres Wiki:
Emphasis mine.
There is hardly anything to gain from an index scan here, as long as your talbe isn't bloated with more (big) columns.