PostgreSQL: Running SELECT DISTINCT ON ~700m rows – how to optimize

amazon-rdsclustered-indexperformancepostgresqlpostgresql-performance

I have a table that, if we look at just the relevant parts, has two columns: id and raw_data. id is an integer, and raw_data is a text blob. At this point, the table has no constraints or indexes except for an index on id.

My goal is to deduplicate (by id) this data and dump it all to plaintext files (on Amazon S3).

Note that any row with the same id can be assumed to be an exact duplicate (so I only need one, random row's data per id).

The table is on an Amazon EC2 RDS database with 2TB of space, 15GB of RAM. I can expand settings if needed, but want this to run over a reasonable time (i.e. max 24-48 hours, preferably faster).

The queries I'm trying to run (but are too slow) are:

SELECT DISTINCT ON (id) id, data
FROM table
OFFSET <0 through end of table>
LIMIT 250000

The first few offsets run within a reasonable time, but quickly becomes unmanageable (at least minutes to return) when the offset hits 10m+.

Since starting, I've created that id index, removed all other constraints and indexes (there's other columns than I described, but not relevant), set maintenance_work_mem to 4GB (for creating the id index), and most recently tried making the id index a clustered index. But this happened:

cluster id using idx_0;
ERROR:  could not extend file "base/16390/46741.294": wrote only 4096 of 
8192 bytes at block 38558630
HINT:  Check free disk space.

enter image description here

Key questions:

1) Is SELECT DISTINCT ON with an OFFSET the right way to do this? Is there a more efficient query for pulling the data?

2) Is there anything else I can do to the DB/table to optimize? Would the clustered index solve my problem? Why is it taking over 1.1TB of extra space to deal with ~800GB of data?

Thanks for any advice!

Best Answer

Thanks to ypercube's tip in the comments above, I was able to keep the time per chunk of the query constant, and so good enough for my one-off purposes. I'm now running:

SELECT DISTINCT ON (id) id, data
FROM table
WHERE id > @last_max_id
ORDER BY id
LIMIT 250000

Takes about 1 minute per query. My whole process will take about 48 hours, but that's good enough.

Could definitely optimize further with multithreading if better performance were needed.

I also noticed that using a cursor would have probably worked similarly well:

BEGIN TRANSACTION;
DECLARE my_cursor CURSOR
FOR SELECT DISTINCT ON (id)
      id, raw_data
FROM table
ORDER BY id;

FETCH 100000 FROM my_cursor;
FETCH 100000 FROM my_cursor;
FETCH 100000 FROM my_cursor;

The FETCH's ran for about 100s, 110s, and 90s respectively so I imagine that scales linearly as well.