Postgresql – Simple PostgreSQL lookup table is inexplicably slow

I'm trying PostgreSQL 8.4.14 for storing triples, pieces of data of the form (String, String, String).
For speed, I'm not repeatedly storing strings but rather using two tables:

main table triples (subject BigInt, predicate BigInt, object BigInt)
lookup table entities (entityId BigInt, name Varying(40000))

I've added indexes and foreign keys on triples:

"nsubject_idx" hash (subject)
"npredicate_idx" btree (predicate)
"nobject_idx" hash (object)
"triples_subject_fkey" FOREIGN KEY (subject) REFERENCES entities(entityid)
"triples_predicate_fkey" FOREIGN KEY (predicate) REFERENCES entities(entityid)
"triples_object_fkey" FOREIGN KEY (object) REFERENCES entities(entityid)

and also indexes on entities:

"entities_pkey" PRIMARY KEY, btree (entityid) CLUSTER
"name_idx" hash (name)

Now it would be reasonable to assume that lookups are fast. They aren't.

With 151M elements in triples and 44M in entities, the following query is immensely slow:

SELECT entityId FROM entities WHERE entityId in (SELECT object FROM triples LIMIT 10000);

It's only 10.000 lookups, so I'd expect this to complete at high speed. Query plan:

 Nested Loop  (cost=288.69..3856.26 rows=43806140 width=8) (actual time=25.226..40110.699 rows=6959 loops=1)
   ->  HashAggregate  (cost=288.69..290.69 rows=200 width=8) (actual time=19.445..24.087 rows=6959 loops=1)
         ->  Limit  (cost=0.00..163.69 rows=10000 width=8) (actual time=0.013..15.792 rows=10000 loops=1)
               ->  Seq Scan on triples  (cost=0.00..2474009.68 rows=151135968 width=8) (actual time=0.012..14.101 rows=10000 loops=1)
   ->  Index Scan using entities_pkey on entities  (cost=0.00..17.82 rows=1 width=8) (actual time=5.756..5.759 rows=1 loops=6959)
         Index Cond: (entities.entityid = triples.object)
 Total runtime: 40112.383 ms

What would be happening here?

Note that this is even a trick query: because of the foreign key constraint, it is actually equivalent to SELECT object FROM triples LIMIT 10000.
For my use case, I'd need the actual lookup.

Best Answer

Rather than "VARCHAR(40000)" why not use "TEXT"?
HASH index use is discouraged (see docs at http://www.postgresql.org/docs/current/static/indexes-types.html).
Have you run "ANALYZE" on your tables before running the query?
Giant IN lists can be performance killers.

How do the following queries perform?

SELECT e.entityId
FROM entities e
INNER JOIN triples t ON (t.object = e.entityId)
LIMIT 10000;

SELECT e.entityId
FROM entities e
WHERE EXISTS (SELECT 1 FROM triples t WHERE t.object = e.entityId LIMIT 10000);

Best Answer

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Postgresql – postgres explain plan with giant gaps between operations

Related Question