Database Performance – Are Individual Queries Faster Than Joins?

application-designjoin;performancetuning

Conceptual question: Are individual queries faster than joins, or: Should I try to squeeze every info I want on the client side into one SELECT statement or just use as many as seems convenient?

TL;DR: If my joined query takes longer than running individual queries, is this my fault or is this to be expected?

First of, I am not very database savvy, so it may be just me, but I have noticed that when I have to get information from multiple tables, it is "often" faster to get this information via multiple queries on individual tables (maybe containing a simple inner join) and patch the data together on the client side that to try to write a (complex) joined query where I can get all the data in one query.

I have tried to put one extremely simple example together:

SQL Fiddle

Schema Setup:

CREATE TABLE MASTER 
( ID INT NOT NULL
, NAME VARCHAR2(42 CHAR) NOT NULL
, CONSTRAINT PK_MASTER PRIMARY KEY (ID)
);

CREATE TABLE DATA
( ID INT NOT NULL
, MASTER_ID INT NOT NULL
, VALUE NUMBER
, CONSTRAINT PK_DATA PRIMARY KEY (ID)
, CONSTRAINT FK_DATA_MASTER FOREIGN KEY (MASTER_ID) REFERENCES MASTER (ID)
);

INSERT INTO MASTER values (1, 'One');
INSERT INTO MASTER values (2, 'Two');
INSERT INTO MASTER values (3, 'Three');

CREATE SEQUENCE SEQ_DATA_ID;

INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 1, 1.3);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 1, 1.5);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 1, 1.7);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 2, 2.3);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 3, 3.14);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 3, 3.7);

Query A:

select NAME from MASTER
where ID = 1

Results:

| NAME |
--------
|  One |

Query B:

select ID, VALUE from DATA
where MASTER_ID = 1

Results:

| ID | VALUE |
--------------
|  1 |   1.3 |
|  2 |   1.5 |
|  3 |   1.7 |

Query C:

select M.NAME, D.ID, D.VALUE 
from MASTER M INNER JOIN DATA D ON M.ID=D.MASTER_ID
where M.ID = 1

Results:

| NAME | ID | VALUE |
---------------------
|  One |  1 |   1.3 |
|  One |  2 |   1.5 |
|  One |  3 |   1.7 |

Of course, I didn't measure any performance with these, but one may observe:

Query A+B returns the same amount of usable information as Query C.
A+B has to return 1+2×3==7 "Data Cells" to the client
C has to return 3×3==9 "Data Cells" to the client, because with the join I naturally include some redundancy in the result set.

Generalizing from this (as far fetched as it is):

A joined query always has to return more data than the individual queries that receive the same amount of information. Since the database has to cobble together the data, for large datasets one can assume that the database has to do more work on a single joined query than on the individual ones, since (at least) it has to return more data to the client.

Would it follow from this, that when I observe that splitting a client side query into multiple queries yield better performance, this is just the way to go, or would it rather mean that I messed up the joined query?

Best Answer

Are individual queries faster than joins, or: Should I try to squeeze every info I want on the client side into one SELECT statement or just use as many as seems convenient?

In any performance scenario, you have to test and measure the solutions to see which is faster.

That said, it's almost always the case that a joined result set from a properly tuned database will be faster and scale better than returning the source rows to the client and then joining them there. In particular, if the input sets are large and the result set is small -- think about the following query in the context of both strategies: join together two tables that are 5 GB each, with a result set of 100 rows. That's an extreme, but you see my point.

I have noticed that when I have to get information from multiple tables, it is "often" faster to get this information via multiple queries on individual tables (maybe containing a simple inner join) and patch the data together on the client side that to try to write a (complex) joined query where I can get all the data in one query.

It's highly likely that the database schema or indexes could be improved to better serve the queries you're throwing at it.

A joined query always has to return more data than the individual queries that receive the same amount of information.

Usually this is not the case. Most of the time even if the input sets are large, the result set will be much smaller than the sum of the inputs.

Depending on the application, very large query result sets being returned to the client are an immediate red flag: what is the client doing with such a large set of data that can't be done closer to the database? Displaying 1,000,000 rows to a user is highly suspect to say the least. Network bandwidth is also a finite resource.

Since the database has to cobble together the data, for large datasets one can assume that the database has to do more work on a single joined query than on the individual ones, since (at least) it has to return more data to the client.

Not necessarily. If the data is indexed correctly, the join operation is more likely to be done more efficiently at the database without needing to scan a large quantity of data. Moreover, relational database engines are specially optimized at a low level for joining; client stacks are not.

Would it follow from this, that when I observe that splitting a client side query into multiple queries yield better performance, this is just the way to go, or would it rather mean that I messed up the joined query?

Since you said you're inexperienced when it comes to databases, I would suggest learning more about database design and performance tuning. I'm pretty sure that's where the problem lies here. Inefficiently-written SQL queries are possible, too, but with a simple schema that's less likely to be a problem.

Now, that's not to say there aren't other ways to improve performance. There are scenarios where you might choose to scan a medium-to-large set of data and return it to the client if the intention is to use some sort of caching mechanism. Caching can be great, but it introduces complexity in your design. Caching may not even be appropriate for your application.

One thing that hasn't been mentioned anywhere is maintaining consistency in the data that's returned from the database. If separate queries are used, it's more likely (due to many factors) to have inconsistent data returned, unless a form of snapshot isolation is used for every set of queries.

Related Solutions

Sql-server – Alternative query to this (avoid DISTINCT)

The two scripts in RThomas' answer are both useful. You could also use GROUP BY, which gives a similar advantage to RThomas' methods, but keeping a similar form to your original query.

select country 
from Users inner join
countries on users.CountryID=countries.CountryID
GROUP BY countries.CountryID, countries.country;

The reason why you group by CountryID is that it's the primary key of your countries table, giving the Query Optimizer some better options.

...except that it's not in your scripts.

Put PKs (with Clustered Indexes) on your tables, and a FK relationship between them. Index CountryID in the Users table, and put a Unique Index on the Country field.

Once you've done all that, using DISTINCT how you have will actually give you the ideal execution plan.

Mysql – Count consecutive null rows from a joined table

This is essentially a gaps-and-islands problem. And when I have my SQL Server hat on, I often solve this kind of problem with two ROW_NUMBER() calls. Sadly, MySQL, unlike many other major SQL products, does not support ROW_NUMBER(), nor any other ranking function. To make up for that, however, you can use variable assignment in SELECTs, which MySQL does support (unlike many other major SQL products).

Below is a solution followed by an explanation:

SELECT
  member_id,
  member_name,
  event_id,
  COUNT(*) AS consecutive_times_missed,
  MIN(event_date) AS first_date_missed,
  MAX(event_date) AS last_date_missed
FROM (
  SELECT
    member_id,
    member_name,
    event_id,
    event_date,
    is_missed,
    @occ_ranking := (event_id = @last_event) * (member_id = @last_member) * @occ_ranking + 1,
    @att_ranking := (event_id = @last_event) * (member_id = @last_member)
                                             * (is_missed = @last_missed) * @att_ranking + 1,
    @occ_ranking - @att_ranking AS grp,
    @last_member := member_id,
    @last_event  := event_id,
    @last_missed := is_missed
  FROM (
    SELECT
      m.member_id,
      m.member_name,
      e.event_id,
      e.event_date,
      (a.attendance_date IS NULL) AS is_missed
    FROM       members     m
    INNER JOIN event_dates e ON m.member_join_date <= e.event_date
    LEFT  JOIN attendance  a ON m.member_id = a.member_id
                            AND e.event_id = a.event_id
                            AND e.event_date = a.attendance_date,
    (
      SELECT
        @occ_ranking := 0,
        @att_ranking := 0,
        @last_member := 0,
        @last_event  := 0,
        @last_missed := 0
    ) v
    ORDER BY
      m.member_id,
      e.event_date
  ) s
) s
WHERE
  is_missed = 1
GROUP BY
  member_id,
  member_name,
  event_id,
  grp
HAVING
  COUNT(*) >= 3
;

Basically, you start with joining members and event_dates to get all the event occurrences the members could have attended based on their membership dates. Then you throw in the attendance table (via a left join) to flag the missed occurrences. Here's an example of what you get by this time:

member_id  event_id  event_date  is_missed
---------  --------  ----------  ---------
1          1         2012-07-10  0
2          1         2012-07-10  1
1          1         2012-07-14  0
2          1         2012-07-14  1
…          …         …           …

At this point, the resulting set needs to be sorted by member_id, event_id, because that is crucial to the subsequent rankings calculation.

Two different rankings are produced for every row. One is a ranking within the row's partition of (member_id, event_id) (it is reset as soon as a new event or a new member is encountered). The other is a ranking within the specific group of consecutive event occurrences, either attended or missed, that the row belongs to (this ranking, in addition to being reset upon coming across a new member or event, also gets reset whenever the other group is encountered). And so you get something like this:

member_id  event_id  event_date  is_missed  ranking1  ranking2
---------  --------  ----------  ---------  --------  --------
…          …         …           …          …         …
1          1         2012-07-27  0          4         4
1          1         2012-07-28  0          5         5
1          1         2012-07-29  1          6         1
2          1         2012-07-10  1          1         1
2          1         2012-07-14  1          2         2
2          1         2012-07-25  1          3         3
2          1         2012-07-27  0          4         1
…          …         …           …          …         …

As you may have noticed, the difference between the two ranking numbers is constant throughout the particular group of consecutive event occurrences of the same kind ("attended" or "missed") and is also unique for that group within its partition. Therefore, every such group can be identified by member_id, event_id and the just mentioned difference.

And it now remains simply to filter the events leaving just the missed ones, group the rows and get the necessary aggregated data, like the number of rows and, possibly, as in the query above, the dates of the first and the last event occurrence in the group. The number of rows is also used in an additional filter condition to omit groups with fewer rows than required.

Best Answer

Related Solutions

Sql-server – Alternative query to this (avoid DISTINCT)

Mysql – Count consecutive null rows from a joined table

Related Question