Sql-server – How to visualize joins between inner queries to their respective outer queries

existsjoin;sql serversql-server-2016subquery

I seem to always have trouble visualizing what the meaning is when an inner query is joined to it's outer query.

Here's an example:

SELECT NM.MovieId, NM.MovieTitle
FROM NinjaMovies AS NM
WHERE EXISTS
(
    SELECT TOP 1
    FROM Ninjas AS N
    INNER JOIN NinjaWeapons AS NW
        ON N.WeaponId = NW.WeaponId
    WHERE N.NinjaId = NM.NinjaId
)

Is there a way to represent this (or a similar kind of inner query joined to outer query) in a table or diagram of sorts?

Best Answer

Just break it down into stages and consider what the output of each stage represents

SELECT *
FROM Ninjas AS N
INNER JOIN NinjaWeapons AS NW
    ON N.WeaponId = NW.WeaponId

This can be thought of as "all ninjas with at least one weapon along with their weapons". The data model does not appear to tie weapons to particular movies.

So you are now left with the simpler query to analyse (there is no value in including TOP 1 in an EXISTS so that has also been removed.)

SELECT NM.MovieId, NM.MovieTitle
FROM NinjaMovies AS NM
WHERE EXISTS
(
    SELECT *
    FROM WeaponOwningNinjasWithTheirWeapons AS N
    WHERE N.NinjaId = NM.NinjaId
)

There seems to be a normalisation issue here as apparently NinjaMovies will contain multiple rows for the same movie starring more than one Ninja and repeats the title in each one - but perhaps NinjaMovies is itself a view resulting from a join of normalised tables.

In any event SELECT NM.MovieId, NM.MovieTitle clearly selects the Movie Ids and Title from it so the only thing left to analyse is the EXISTS.

This simply says to only return rows where the corresponding Ninja exists in the WeaponOwningNinjasWithTheirWeapons result set.

So to sum up. The query returns all rows in NinjaMovies where the corresponding Ninja has at least one weapon.

The semantics of the query seem a bit odd as NM.NinjaId is not projected so in the event that the movie stars multiple ninjas - some with weapons and some without - there is no indication as to which led to the inclusion of the row. If you don't care about this then you should probably add a DISTINCT to remove duplicates in the event that the movie stars multiple weapon bearing ninjas.

Related Solutions

Database Performance – Are Individual Queries Faster Than Joins?

Are individual queries faster than joins, or: Should I try to squeeze every info I want on the client side into one SELECT statement or just use as many as seems convenient?

In any performance scenario, you have to test and measure the solutions to see which is faster.

That said, it's almost always the case that a joined result set from a properly tuned database will be faster and scale better than returning the source rows to the client and then joining them there. In particular, if the input sets are large and the result set is small -- think about the following query in the context of both strategies: join together two tables that are 5 GB each, with a result set of 100 rows. That's an extreme, but you see my point.

I have noticed that when I have to get information from multiple tables, it is "often" faster to get this information via multiple queries on individual tables (maybe containing a simple inner join) and patch the data together on the client side that to try to write a (complex) joined query where I can get all the data in one query.

It's highly likely that the database schema or indexes could be improved to better serve the queries you're throwing at it.

A joined query always has to return more data than the individual queries that receive the same amount of information.

Usually this is not the case. Most of the time even if the input sets are large, the result set will be much smaller than the sum of the inputs.

Depending on the application, very large query result sets being returned to the client are an immediate red flag: what is the client doing with such a large set of data that can't be done closer to the database? Displaying 1,000,000 rows to a user is highly suspect to say the least. Network bandwidth is also a finite resource.

Since the database has to cobble together the data, for large datasets one can assume that the database has to do more work on a single joined query than on the individual ones, since (at least) it has to return more data to the client.

Not necessarily. If the data is indexed correctly, the join operation is more likely to be done more efficiently at the database without needing to scan a large quantity of data. Moreover, relational database engines are specially optimized at a low level for joining; client stacks are not.

Would it follow from this, that when I observe that splitting a client side query into multiple queries yield better performance, this is just the way to go, or would it rather mean that I messed up the joined query?

Since you said you're inexperienced when it comes to databases, I would suggest learning more about database design and performance tuning. I'm pretty sure that's where the problem lies here. Inefficiently-written SQL queries are possible, too, but with a simple schema that's less likely to be a problem.

Now, that's not to say there aren't other ways to improve performance. There are scenarios where you might choose to scan a medium-to-large set of data and return it to the client if the intention is to use some sort of caching mechanism. Caching can be great, but it introduces complexity in your design. Caching may not even be appropriate for your application.

One thing that hasn't been mentioned anywhere is maintaining consistency in the data that's returned from the database. If separate queries are used, it's more likely (due to many factors) to have inconsistent data returned, unless a form of snapshot isolation is used for every set of queries.

SQL INNER JOIN and OUTER JOIN in Same Table – How to Implement

The query is joining 2 tables (receipts, sales) that both have a many-to-one relationship with product. This creates a kind of cartesian (cross) product and will give wrong results in the SUM() calculations. To avoid that, you need to do the summations for the two tables in two different subqueries to avoid errors.

Something like this will work:

SELECT 
    m.month, 
    p.name, 
    COALESCE(r.receipts_quantity, 0) AS receipts_quantity, 
    COALESCE(s.sales_quantity, 0)    AS sales_quantity
FROM 
    product AS p
  CROSS JOIN
    ( VALUES (1),(2),(3),(4),(5),(6) ) AS m(month)     -- months wanted in the report
  LEFT OUTER JOIN 
    ( SELECT month, product_id, SUM(quantity) AS receipts_quantity
      FROM receipts
      GROUP BY month, product_id
    ) AS r
    ON  p.id_product = r.product_id
    AND m.month = r.month
  LEFT OUTER JOIN  
    ( SELECT month, product_id, SUM(quantity) AS sales_quantity
      FROM sales
      GROUP BY month, product_id
    ) AS s
    ON  p.id_product = s.product_id
    AND m.month = s.month ;

If you want to avoid having results for products that did not have any activity in a month, you can add the condition:

WHERE r.month IS NOT NULL 
   OR s.month IS NOT NULL

Best Answer

Related Solutions

Database Performance – Are Individual Queries Faster Than Joins?

SQL INNER JOIN and OUTER JOIN in Same Table – How to Implement

Related Question