PostgreSQL: Running SELECT DISTINCT ON ~700m rows – how to optimize

amazon-rdsclustered-indexperformancepostgresqlpostgresql-performance

I have a table that, if we look at just the relevant parts, has two columns: id and raw_data. id is an integer, and raw_data is a text blob. At this point, the table has no constraints or indexes except for an index on id.

My goal is to deduplicate (by id) this data and dump it all to plaintext files (on Amazon S3).

Note that any row with the same id can be assumed to be an exact duplicate (so I only need one, random row's data per id).

The table is on an Amazon EC2 RDS database with 2TB of space, 15GB of RAM. I can expand settings if needed, but want this to run over a reasonable time (i.e. max 24-48 hours, preferably faster).

The queries I'm trying to run (but are too slow) are:

SELECT DISTINCT ON (id) id, data
FROM table
OFFSET <0 through end of table>
LIMIT 250000

The first few offsets run within a reasonable time, but quickly becomes unmanageable (at least minutes to return) when the offset hits 10m+.

Since starting, I've created that id index, removed all other constraints and indexes (there's other columns than I described, but not relevant), set maintenance_work_mem to 4GB (for creating the id index), and most recently tried making the id index a clustered index. But this happened:

cluster id using idx_0;
ERROR:  could not extend file "base/16390/46741.294": wrote only 4096 of 
8192 bytes at block 38558630
HINT:  Check free disk space.

Key questions:

1) Is SELECT DISTINCT ON with an OFFSET the right way to do this? Is there a more efficient query for pulling the data?

2) Is there anything else I can do to the DB/table to optimize? Would the clustered index solve my problem? Why is it taking over 1.1TB of extra space to deal with ~800GB of data?

Thanks for any advice!

Best Answer

Thanks to ypercube's tip in the comments above, I was able to keep the time per chunk of the query constant, and so good enough for my one-off purposes. I'm now running:

SELECT DISTINCT ON (id) id, data
FROM table
WHERE id > @last_max_id
ORDER BY id
LIMIT 250000

Takes about 1 minute per query. My whole process will take about 48 hours, but that's good enough.

Could definitely optimize further with multithreading if better performance were needed.

I also noticed that using a cursor would have probably worked similarly well:

BEGIN TRANSACTION;
DECLARE my_cursor CURSOR
FOR SELECT DISTINCT ON (id)
      id, raw_data
FROM table
ORDER BY id;

FETCH 100000 FROM my_cursor;
FETCH 100000 FROM my_cursor;
FETCH 100000 FROM my_cursor;

The FETCH's ran for about 100s, 110s, and 90s respectively so I imagine that scales linearly as well.

Related Solutions

Sql-server – Speed up large clustered index creation

You'll need to look at the wait type when the query is running. Odds are you need faster disks as building an index on a table that large is going to cause MASSIVE amounts of reads and writes.

In a nutshell you'll need to read the 120 Gig table, sorting it based on the clustering key (which is going to cause a ton of spill to tempdb writing probably 100 Gigs to tempdb), then write the clustered index to the database which causes 120 Gigs of writes.

On top of this any non-clustered indexes on the table will need to be rebuilt at the same time so those all need to be read and rewritten as well (along with all the memory spill that goes with rebuilding the non-clustered indexes).

You can speed things up by disabling the non-clusterd indexes, then manually building them after the clustered index has been rebuilt. You'll want to make sure that end users aren't able to touch the system while the non-clustered indexes are being built as their query performances will be awful until the non-clustered indexes are created.

Assuming Enterprise Edition you can create the commonly used indexes then let the users back in, then create the other indexes online so that users an work.

No matter what you'll be looking at a REALLY long maintenance window where you'll be slamming the disks as hard as they can go.

Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

First off, LIMIT / OFFSET without ORDER BY are of limited usefulness, since the order is arbitrary and can change any time (when VACUUM runs or when the table is manipulated in at any way). It is only somewhat reliable with read-only tables. That's fine if you don't care which rows you get back, but it may break paging.

You may be able to solve your conundrum with the good old EXISTS. PostgreSQL can stop searching for more hits as soon as the first is found - as opposed to your query with DISTINCT, where it tries to collect all matches.

It's hard to be more specific without knowing the table structure, cardinalities, index definitions and what's behind your peculiar WHERE expression. But this might just do it:

SELECT s.chem_compound_id AS chem1_0_ 
       -- 10 more columns all short varchar or date fields
FROM   simple_compound s
WHERE  EXISTS (
   SELECT 1
   FROM   compound_composition sc
   JOIN   chemical_structure   c  ON sc.chemical_structure_id = c.structure_id 
   WHERE  c.structure_id  @ ('CCNc1ccccc1', '')::bingo.sub
   AND    sc.chem_compound_id = s.chem_compound_id
   )
LIMIT  5 
OFFSET 5

This also assumes you are only interested in columns from simple_compound in the output.

Best Answer

Related Solutions

Sql-server – Speed up large clustered index creation

Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

Related Question