PostgreSQL – Left Join Multiple Tables for Millions of Rows on Multiple Columns

join;postgresql

I am using PostgreSQL via pgadmin3 on Ubuntu 16.04 with 7.7GiB Memory and 4x.3GHz processor. My only aim with this database is to merge multiple datasets (tables) and extract the final table. I am wondering what slows my join operation down the most: Having many rows, merging on multiple columns, or mergng multiple tables.

I have the following tables:

combinations (121M rows)
----
source | target | year
bigint | bigint | smallint

coauthorships
----
source | target | year
bigint | bigint | smallint

adjacencies
----
source | target | year     | auth_first_order | auth_second_order | com_first_order | com_second_order
bigint | bigint | smallint | double           | double            | double          | double

Eventually there will be more tables. Each row in each table is uniquely identified by (source,target,year). All tables have non-clustered indices on (source, target, year)

My initial idea was to construct a view and export the view to disk. This is the definition of view master:

SELECT comb.source,
 comb.target,
 comb.year,
 coauth.publications,
 adj.auth_first_order,
 adj.auth_second_order,
 adj.com_first_order,
 adj.com_second_order
FROM combinations comb
 LEFT JOIN coauthorships coauth ON comb.source = coauth.source AND comb.target = coauth.target AND comb.year = coauth.year
 LEFT JOIN adjacencies adj ON comb.source = adj.source AND comb.target = adj.target AND comb.year = adj.year;

Then I proceeded to export the view like so:

psql -U postgres -d department-networks -c "COPY master TO stdout DELIMITER ',' CSV HEADER;" > ~/master.csv

How can I improve the merge design? I have some freedom over the table look like. Given that the tables are not indexed, which method should I use? From pgadmin's help page it looks like both hash and GiST seem appropriate. Should I merge the keys so that I don't merge on multiple columns but just one? Should I join some tables separately so that in the end I only merge two tables? Should I not use postgres at all? I'm grateful for any help in this regard.

EDIT:

Here is the query plan:

"Merge Left Join  (cost=2615359.20..10115676.76 rows=146636992 width=52)"
"  Merge Cond: ((comb.source = adj.source) AND (comb.target = adj.target) AND (comb.year = adj.year))"
"  ->  Merge Left Join  (cost=2048532.06..8406982.78 rows=146636992 width=20)"
"        Merge Cond: ((comb.source = coauth.source) AND (comb.target = coauth.target) AND (comb.year = coauth.year))"
"        ->  Index Only Scan using combinations_source_target_year_idx on combinations comb  (cost=0.57..5107371.45 rows=146636992 width=18)"
"        ->  Materialize  (cost=2047466.59..2106034.64 rows=11713611 width=20)"
"              ->  Sort  (cost=2047466.59..2076750.61 rows=11713611 width=20)"
"                    Sort Key: coauth.source, coauth.target, coauth.year"
"                    ->  Seq Scan on coauthorships coauth  (cost=0.00..191745.11 rows=11713611 width=20)"
"  ->  Materialize  (cost=566827.14..581496.20 rows=2933811 width=50)"
"        ->  Sort  (cost=566827.14..574161.67 rows=2933811 width=50)"
"              Sort Key: adj.source, adj.target, adj.year"
"              ->  Seq Scan on adjacencies adj  (cost=0.00..51115.11 rows=2933811 width=50)"

Best Answer

@EvanCarroll: Good point, I wanted to add this as question. The tables are not indexed as I don't know which method to use. Question edited. – MERose 16 secs ago

Red flag. Stop your query. Add your indexes. Rerun it. Assuming you have no indexes, add these and then paste the result of \d table for coauthorships and adjacencies`.

CREATE INDEX ON combinations ( source, target, year );
CREATE INDEX ON adjacencies ( source, target, year );
CREATE INDEX ON coauthorships ( source, target, year );
VACUUM ANALYZE combinations;
VACUUM ANALYZE adjacencies;
VACUUM ANALYZE coauthorships;

Also why are you using bigint if you only have millions of rows?