Postgresql – Postgres: query on huge (11gb ) index does not return

amazon-rdsperformancepostgresqlpostgresql-9.6postgresql-performance

Using Postgres 9.6, I have created a table with 435M rows which is 120GB in size and have added an index which is 11GB in size.

I now want to iterate on the distinct values of the index, but the query fails with no error, it just does not complete. I can see no cpu usage or ram being used. Nb the server is on aws rds with 15GB of ram.

How to best troubleshoot this? Trying to iterate through it with LIMIT and OFFSET fails after about 3 cycles. I haven't managed to get a number of the unique values in the index.

I will try reindexing to see if there is indeed anything corrupted, but any suggestions on using a 11GB index would be great.

Editing to add more info:

This is the format of my table

id bigint,
hit1 character varying(50),
hit2 character varying(50),
offset integer,
year character varying(50)

Row estimate is at 431M rows, approx size at 115GB according to postgres.

Index is on hit1 column with a size of 11GB.

I am trying to calculate the counts of hits:

select hit1,hit2, count(distinct id) as total_count_of_ids,
count(case when offset=-1 then 1 else null end) as prev_position, 
count(case when offset=0 then 1 else null end) as same_position,
count(case when offset=1 then 1 else null end) as next_position, 
min(cast(substring(year,1,4) as int)) as min_year, 
max(cast(substring(year,1,4) as int)) as max_year,
array_agg( id) as id_list 
from table where hit1='' 
group by hit1, hit2;

As the table is big, I have opted to do this one by one entity like so:

select distinct hit1 from table limit 250 offset x 
-- use above query to to store results into table.

This function is wrapped in another function which provides x from a loop. The function is run in a dblink context so that the results can be stored and I can audit as I go along.

The cycle needs to run 1,000,000, I am yet to get a number for the count of distinct hit1 values as the query takes too long to come back. The first 750 values (3 iterations) come back on the iteration in under a minute each, but at the next iteration is where it keels over. I will provide log data when this iteration dies.

While the query is running (at the time when the distinct list is calculated) there seems to be minimal cpu usage. There also seems to be some memory being unavailable.

Edit 2 to show results:
In the functions I use dblink for auditing. After the first 3 cycles (0-2) it takes one hour to return results from
select distinct hit1 from table limit 250 offset 750;
compared to the previous hit1 order this one is no alphabetical and it seems to return or die only 77 results in.

Best Answer

I don't know if my answer is accurate as we don't have much informations in your question, but perhaps you should consider one of this lead:

Related Solutions

Mysql – Improve `Update` performance (rows locking issue)

First, each time you UPDATE the status column, you are having to update the index as well (source). Evaluate your indexing to see if you really need the index on the status column. My guess is no, since it has an extremely low cardinality and MySQL probably won't use it anyway.

If you ignore me and think you do need it, follow the advice in the article to drop the index before your loop and re-add it after you're done.

Here are some other things you might do if that doesn't help:

You are taking all the columns from the data but only using number. Don't do a SELECT *, but instead a SELECT number. That won't help your writes, but it is a good performance practice. Only select the columns you're using.
Your number index isn't getting used at all. This means it is not unique enough to be useful for updating. (Slight tangent: how many rows does a single UPDATE affect?) I would drop it, or at least add it to process index.
It looks like process is unique enough for MySQL to whittle the amount of rows down to 16k, instead of 1 million. In light of this, I would add AND process=x to your update statement (I'm assuming you know process from the original SELECT statement):
```
-- FAILED--
UPDATE data SET status = 2, error='$error' WHERE process=X AND number = $data['number']

-- SUCCESS --
UPDATE data SET status = 1 WHERE process=X AND number = $data['number']
```

A hint about unnecessary indexes in InnoDB. InnoDB is using a hidden 'primary key' (since you don't have one defined) and is using that when it writes the indexes. So for each Index you're using, you add the size of the index + the size of the hidden primary key to the data file. If you're not using the index (or MySQL can't use it), you are wasting space and adding overhead each time you insert a new number (same for status, as discussed earlier)

MySQL optimization – year column grouping – using temporary table, filesort

I don't see a lot of opportunity for improvement.

The index you added was probably a big help, because it's being used for the range matching on the WHERE clause (type => range, key => tran_date), and it's being used as a covering index (extra => using index), avoiding the need to seek into the table to fetch the row data.

But since you're using functions to construct the financial_year value for the group by, both the "using filesort" and "using temporary" can't be avoided. But, those aren't the real problem. The real problem is that you're evaluating MONTH(tran_date) 346,485 times and YEAR(tran_date) at least that many times... ~700,000 function calls in one second doesn't seem too bad.

Plan B: I am definitely not a fan of storing redundant data, and I'm dead-set against making the application responsible for maintaining it... but one option I might be tempted to try would be to create a dashboard_stats_by_financial_year table, and use after-insert/update/delete triggers on the transactions1 table to manage keeping those stats current.

That option has a cost, of course -- adding to the amount of time it takes to update/insert/delete a transaction... but, waiting > 1200 milliseconds for stats for your dashboard is a cost, too. So it may come down to whether you want to pay for it now or pay for it later.

Best Answer

Related Solutions

Mysql – Improve `Update` performance (rows locking issue)

MySQL optimization – year column grouping – using temporary table, filesort

Related Question