Postgresql – Left join not working with sub-query

join;postgresqlsubquery

I've got table of scores where each entry corresponds to a particular (student, subject) pair.

CREATE TABLE score (
  id serial PRIMARY KEY,
  value integer NOT NULL,
  subject_id integer NOT NULL,
  student_id integer NOT NULL,
  CONSTRAINT s1_id FOREIGN KEY (subject_id) REFERENCES subject (id),
  CONSTRAINT s2_id FOREIGN KEY (student_id) REFERENCES student (id),
);

I want to pick the top 5 subjects with highest overall scores, and then compute the average score of each student across those 5 subjects. Some students may not have entries for some subjects. Those values would be given a default score.

Here's what I have:

SELECT student_id, AVG(COALESCE(score.value, default_value)) FROM 
(
    SELECT score.subject_id, subject.name, SUM(score.value) AS score_sum
    FROM score
    JOIN subject on subject.id = score.subject_id
    WHERE subject.name != 'skip me'
    GROUP BY score.subject_id
    ORDER BY score_sum DESC
    LIMIT 5
) AS score_sort
LEFT JOIN score ON score_sort.subject_id = score.subject_id 
GROUP BY student_id

The inner query works correctly to select the top 5. But the LEFT JOIN in outer query does not select the rows where a student does not have a score. What am I doing wrong here ?

Best Answer

You could use a CROSS JOIN from score_sort to all the students and then a LEFT join to score:

SELECT st.student_id, 
       AVG(COALESCE(sc.value, default_value)) AS average_score
FROM 
(
    SELECT score.subject_id, subject.name, SUM(score.value) AS score_sum
    FROM score
    JOIN subject on subject.id = score.subject_id
    WHERE subject.name != 'skip me'
    GROUP BY score.subject_id
    ORDER BY score_sum DESC
    LIMIT 5
) AS score_sort AS sort 
  CROSS JOIN student AS st
  LEFT JOIN score AS sc
    ON  sort.subject_id = sc.subject_id 
    AND   st.student_id = sc.student_id
GROUP BY st.student_id ;

A different approach would be to alter your query to count the available scores each student has (in the 5 top subjects):

SELECT score.student_id, 
       (SUM(score.value) + (5 - COUNT(score.subject_id)) * default_value) / 5 
           AS average_score
FROM 
(
    SELECT score.subject_id, subject.name, SUM(score.value) AS score_sum
    FROM score
    JOIN subject on subject.id = score.subject_id
    WHERE subject.name != 'skip me'
    GROUP BY score.subject_id
    ORDER BY score_sum DESC
    LIMIT 5
) AS score_sort
JOIN score ON score_sort.subject_id = score.subject_id 
GROUP BY score.student_id ;

But even with this, you would still not get any result for the students that have no score in any of the 5 top subjects.

Related Solutions

Mysql – get column from too many tables in thesql

If all the tables use the MyISAM Storage Engine and have the same table structure, I have some good news for you.

You can create a single table that consumes no additional space except a .frm file and some mapping info. The key is to take advantage of the MERGE (MRG_MyISAM) Storage Engine.

Here is how you can do this:

CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4) INSERT_METHOD=LAST;

Using this method, you can query the 4 tables at the same time like this:

SELECT ReqF FROM XMerge WHERE EmpName='John';

Was that simple, or what ???

In your case, you have 75 tables. You would do this:

CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4,X5,X6,X7,X8,X9,
X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,
X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,
X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,
X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,
X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,
X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,
X70,X71,X72,X73,X74,X75) INSERT_METHOD=LAST;
SELECT ReqF FROM XMerge WHERE EmpName='John';

The beauty of this is that creating a MERGE table takes milliseconds. Just make sure every table has an index on EmpName. Better to do 75 indexed lookups that 75 full table scans. If there is no index on EmpName, you need to do this:

ALTER TABLE X1 ADD UNIQUE KEY (EmpName);
ALTER TABLE X2 ADD UNIQUE KEY (EmpName);
.
.
.
ALTER TABLE X75 ADD UNIQUE KEY (EmpName);
CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4,X5,X6,X7,X8,X9,
X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,
X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,
X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,
X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,
X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,
X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,
X70,X71,X72,X73,X74,X75) INSERT_METHOD=LAST;
SELECT ReqF FROM XMerge WHERE EmpName='John';

Give it a Try !!!

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Best Answer

Related Solutions

Mysql – get column from too many tables in thesql

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Related Question