PostgreSQL: simple equality query returning no results for indexed columns

csvindexpostgresqlpostgresql-9.5

I have a very large table (around 140M rows) with five columns:

CREATE TABLE IF NOT EXISTS citations 
(
   citing_id text,
   citing_issn text,
   cited_id text,
   cited_issn text,
   year int
);

There are hash indexes on citing_id and on cited_id, neither of which has any NULL values (confirmed), but queries selecting on equality in these columns are returning zero results no matter what. Even this query, which I would think would work in any populated table, returns zero rows:

SELECT * 
FROM citations 
WHERE citing_id IN (SELECT citing_id 
                    FROM citations LIMIT 1) 
LIMIT 100;

However, the same query on a non-indexed column works exactly as expected:

SELECT * 
FROM citations 
WHERE citing_issn IN (SELECT citing_issn 
                      FROM citations LIMIT 1) 
LIMIT 100;

The data was loaded from a CSV file using the COPY command, and I suspect there may be extra hidden characters on the end of the indexed values — querying with WHERE citing_id LIKE '<id>%' is very slow (hash indexes don't help with partial strings) but returns the right results. Still, the first query above should work no matter what, unless I'm mistaken.

I'm running Postgres 9.5.3.0 on MacOS 10.13.1

edit:

The query plan for the first query (on the indexed column, returning no rows) is:

Limit  (cost=0.03..1.38 rows=100 width=58)
   ->  Nested Loop  (cost=0.03..265.83 rows=19751 width=58)
         ->  HashAggregate  (cost=0.03..0.04 rows=1 width=17)
               Group Key: citations_1.citing_id
               ->  Limit  (cost=0.00..0.02 rows=1 width=17)
                     ->  Seq Scan on citations citations_1  (cost=0.00..3164425.68 rows=148131768 width=17)
         ->  Index Scan using citations_citing_id on citations  (cost=0.00..265.14 rows=65 width=58)
               Index Cond: (citing_id = citations_1.citing_id)

and for the second query (on the non-indexed column, returning expected results):

Limit  (cost=0.04..139.54 rows=100 width=58)
   ->  Hash Semi Join  (cost=0.04..3581836.35 rows=2567617 width=58)
         Hash Cond: (citations.citing_issn = citations_1.citing_issn)
         ->  Seq Scan on citations  (cost=0.00..3164425.68 rows=148131768 width=58)
         ->  Hash  (cost=0.03..0.03 rows=1 width=10)
               ->  Limit  (cost=0.00..0.02 rows=1 width=10)
                     ->  Seq Scan on citations citations_1  (cost=0.00..3164425.68 rows=148131768 width=10)

Best Answer

There are hash indexes on citing_id and on cited_id, neither of which has any NULL values (confirmed), but queries selecting on equality in these columns are returning zero results no matter what.

The index is probably corrupted. Hash indexes are only WAL-logged since PostgreSQL 10. In previous versions, they don't have that mechanism that makes them persist correctly across unclean shutdowns.

The doc on CREATE INDEX for 9.5 has this warning:

Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash if there were unwritten changes. Also, changes to hash indexes are not replicated over streaming or file-based replication after the initial base backup, so they give wrong answers to queries that subsequently use them. Hash indexes are also not properly restored during point-in-time recovery. For these reasons, hash index use is presently discouraged.

Consider upgrading to PostgreSQL 10, and in the meantime, REINDEX the hash indexes.