Postgresql – Postgres: query on huge (11gb ) index does not return

amazon-rdsperformancepostgresqlpostgresql-9.6postgresql-performance

Using Postgres 9.6, I have created a table with 435M rows which is 120GB in size and have added an index which is 11GB in size.

I now want to iterate on the distinct values of the index, but the query fails with no error, it just does not complete. I can see no cpu usage or ram being used. Nb the server is on aws rds with 15GB of ram.

How to best troubleshoot this? Trying to iterate through it with LIMIT and OFFSET fails after about 3 cycles. I haven't managed to get a number of the unique values in the index.

I will try reindexing to see if there is indeed anything corrupted, but any suggestions on using a 11GB index would be great.

Editing to add more info:

This is the format of my table

id bigint,
hit1 character varying(50),
hit2 character varying(50),
offset integer,
year character varying(50)

Row estimate is at 431M rows, approx size at 115GB according to postgres.

Index is on hit1 column with a size of 11GB.

I am trying to calculate the counts of hits:

select hit1,hit2, count(distinct id) as total_count_of_ids,
count(case when offset=-1 then 1 else null end) as prev_position, 
count(case when offset=0 then 1 else null end) as same_position,
count(case when offset=1 then 1 else null end) as next_position, 
min(cast(substring(year,1,4) as int)) as min_year, 
max(cast(substring(year,1,4) as int)) as max_year,
array_agg( id) as id_list 
from table where hit1='' 
group by hit1, hit2;

As the table is big, I have opted to do this one by one entity like so:

select distinct hit1 from table limit 250 offset x 
-- use above query to to store results into table. 

This function is wrapped in another function which provides x from a loop. The function is run in a dblink context so that the results can be stored and I can audit as I go along.

The cycle needs to run 1,000,000, I am yet to get a number for the count of distinct hit1 values as the query takes too long to come back. The first 750 values (3 iterations) come back on the iteration in under a minute each, but at the next iteration is where it keels over. I will provide log data when this iteration dies.

While the query is running (at the time when the distinct list is calculated) there seems to be minimal cpu usage. There also seems to be some memory being unavailable.

freeable memory graph

CPU utilisation chart

Edit 2 to show results:
In the functions I use dblink for auditing. After the first 3 cycles (0-2) it takes one hour to return results from
select distinct hit1 from table limit 250 offset 750;
compared to the previous hit1 order this one is no alphabetical and it seems to return or die only 77 results in.

Best Answer

I don't know if my answer is accurate as we don't have much informations in your question, but perhaps you should consider one of this lead: