Postgresql – Postgres odd behavior with indices

indexpostgresql

I've got a datavalue table with ~200M rows or so, with indices on both site_id and paramter_id. I need to execute queries like "return all sites with data" and "return all parameters with data". The site table has only 200 rows or so, and the parameter table has only 100 or so rows.

The site query is fast and uses the index:

EXPLAIN ANALYZE
select *
from site
where exists (
      select 1 from datavalue
      where datavalue.site_id = site.id limit 1
);

Seq Scan on site  (cost=0.00..64.47 rows=64 width=113) (actual time=0.046..1.106 rows=89 loops=1)
  Filter: (SubPlan 1)
  Rows Removed by Filter: 39
  SubPlan 1
    ->  Limit  (cost=0.44..0.47 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=128)
          ->  Index Only Scan using ix_datavalue_site_id on datavalue  (cost=0.44..8142.71 rows=248930 width=0) (actual time=0.008..0.008 rows=1 loops=128)
                Index Cond: (site_id = site.id)
                Heap Fetches: 0
Planning time: 0.361 ms
Execution time: 1.149 ms

The same query for parameters is rather slow and does NOT use the index:

EXPLAIN ANALYZE
select *
from parameter
where exists (
      select 1 from datavalue
      where datavalue.parameter_id = parameter.id limit 1
);

Seq Scan on parameter  (cost=0.00..20.50 rows=15 width=2648) (actual time=2895.972..21331.701 rows=15 loops=1)
  Filter: (SubPlan 1)
  Rows Removed by Filter: 6
  SubPlan 1
    ->  Limit  (cost=0.00..0.34 rows=1 width=0) (actual time=1015.790..1015.790 rows=1 loops=21)
          ->  Seq Scan on datavalue  (cost=0.00..502127.10 rows=1476987 width=0) (actual time=1015.786..1015.786 rows=1 loops=21)
                Filter: (parameter_id = parameter.id)
                Rows Removed by Filter: 7739355
Planning time: 0.123 ms
Execution time: 21331.736 ms

What the deuce is going on here? Alternatively, whats a good way to do this?

Some of the table description:

id BIGINT DEFAULT nextval('datavalue_id_seq'::regclass) NOT NULL,
value DOUBLE PRECISION NOT NULL,
site_id INTEGER NOT NULL,
parameter_id INTEGER NOT NULL,
deployment_id INTEGER,
instrument_id INTEGER,
invalid BOOLEAN,
Indexes:
    "datavalue_pkey" PRIMARY KEY, btree (id)
    "datavalue_datetime_utc_site_id_parameter_id_instrument_id_key" UNIQUE CONSTRAINT, btree (datetime_utc, site_id, parameter_id, instrument_id)
    "ix_datavalue_instrument_id" btree (instrument_id)
    "ix_datavalue_parameter_id" btree (parameter_id)
    "ix_datavalue_site_id" btree (site_id)
    "tmp_idx" btree (site_id, datetime_utc)
Foreign-key constraints:
    "datavalue_instrument_id_fkey" FOREIGN KEY (instrument_id) REFERENCES instrument(id) ON UPDATE CASCADE ON DELETE CASCADE
    "datavalue_parameter_id_fkey" FOREIGN KEY (parameter_id) REFERENCES parameter(id) ON UPDATE CASCADE ON DELETE CASCADE
    "datavalue_site_id_fkey" FOREIGN KEY (site_id) REFERENCES coastal.site(id) ON UPDATE CASCADE ON DELETE CASCADE
    "datavalue_statistic_type_id_fkey"

Edit: Here's the count distribution

select count(parameter_id), parameter_id from datavalue group by parameter_id

88169   14
2889171 8
15805   17
8570    12
4257262 21
3947049 15
1225902 2
4091090 3
103877  10
633764  11
994442  18
49232   20
14935   4
563638  13
2955919 7

Best Answer

Removing the limit 1 in the sub query fixed the problem. More discussion on S.O.