PostgreSQL – How to Rewrite a Query Joining Three Tables for Efficiency

distinctjoin;postgresqlpostgresql-9.5

I have three tables (A, B and C) with various data regarding mining facilities in them. But only one table (C) has two columns with coordinates. My final goal is to produce a table which pulls similar data about mine name, owner, product etc. from A and B and places them next to similar columns in C. I am using mine name.

The DDL design is as follows:

-- Table "A"
CREATE TABLE "globalminfac_db".minfac (
    ROW_ID INT,
    MINERAL_COMMODITY_GENERAL TEXT,
    FACILITY_PRODUCES TEXT,
    COUNTRY TEXT,
    CITY TEXT,
    LOC_DESC TEXT,
    LOCATION_NAME TEXT,
    OPERATOR_NAME TEXT,
    OWNER_NAME TEXT,
    PRIMARY_OWNER TEXT,
    SECONDARY_OWNER TEXT,
    FINAL_DDLAT REAL,
    FINAL_DDLONG REAL,
);

-- Table "B"
CREATE TABLE "drc_db".OUTLOOK_TABLE (
    COUNTRY TEXT,
    COMMODITY TEXT,
    MINE_NAME TEXT,
    OPERATOR_NAME TEXT,
);

-- Table "C"
CREATE TABLE "drc_db".table2_drc (
    COMMODITY TEXT,
    MAJOR_OPERATOR_OWNER TEXT,
    LOCATION_MAIN_FACILITIES TEXT,
);

However, I'm only able to perform INNER JOIN on these tables and this produces redundant rows where it takes all rows from each table with name "x" and gives each of them a row in my new table. See my SELECT operation below and a screenshot of the sample data. I've already done the initial work of joining the tables using matching columns, which is location/mine name.

SELECT minfac.mineral_commodity_general a_commodity, 
       outlook.commodity b_commodity, 
       table2.commodity c_commodity, 
       minfac.location_name a_locationName, 
       outlook.mine_name b_locationName, 
       table2.location_main_facilities c_locationName,
       minfac.operator_name a_operator, 
       outlook.operator_name b_operator, 
       table2.major_operator_owner c_operator, 
       minfac.final_ddlat Latitude, 
       minfac.final_ddlong Longitude
    FROM "drc_db".minfac_drc minfac
    INNER JOIN "drc_db".outlook_drc outlook 
        ON minfac.location_name LIKE concat(outlook.mine_name, '%')
    INNER JOIN "drc_db".table2_drc table2 
        ON SPLIT_PART(table2.location_main_facilities, ' ', 1) = SPLIT_PART(minfac.location_name, ' ', 1)

This produces this table below:

enter image description here

As you can see, I have 6 rows listing "Dikulushi mine" with the same coordinates (3 from table A for Silver, Copper and Cobalt; 1 from table B and 2 from table C).

How can I rewrite my query to return the above table, but include only unique location names?

The final goal, to reiterate my earlier point is to have a final table which contains rows and coordinates from all three tables.

I am using pgAdmin4 with Postgres 9.5.14.

Any help with this question or towards making my query more efficient/better would be appreciated.

Best Answer

My first impression is that you should redesign your tables a bit. I'm not sure what SPLIT_PART does, but it sounds as if it grabs part of a string, and having such functions in an ON clause is really bad for performance. Assume table A and B has 1000 rows each:

SELECT ...
FROM A
JOIN B
    ON f(A.x) = f(B.x)

In a nested loop join there will be 2*1000*1000=2 miljon calls to f. I would investigate the possibilities to add a generated column for the join conditions, and add an index.

To remove redundancy from the answer you can use distinct, but you will still get redundant info in the following situation:

GOLD, SILVER, ...
SILVER, GOLD, ...

Does it matter if a_commodity and b_commodity switches place for a certain row? If not you can arrange the in lexigraphical order like:

SELECT distinct
       LEAST(minfac.mineral_commodity_general, outlook.commodity, table2.commodity)
     , ...
       GREATEST(minfac.mineral_commodity_general, outlook.commodity, table2.commodity)
     , ...

For the middlest one you will have to exclude the least and the greatest one.