Join implementation question

database-agnosticjoin;

I have a question regarding how joins are implemented. I understand that the specifics will depend on the exact DBMS and the indexes in the system, but I think that the question is general enough to be answered.

My question is the following: Using the next query as an example, when the inner join is performed, for each row in Table1 does the DBMS perform a search on the Table2?

And if in fact the DBMS does perform a search for every row, why doesn't the DBMS instead of just storing the foreign key, store a reference to the physical address where the "linked" data is?

SELECT Column_list
FROM TABLE1
INNER JOIN TABLE2
ON Table1.ColName = Table2.ColName

Best Answer

Most (all?) RDBMSs give you a way of finding out what they decide to do with your query under the hood.

For your query, the optimizer of the RDBMS could see that you have no filters other than the join filter. This could mean that it would expect to visit every row of both tables and join them together. This would usually be executing using something like a hash join - the first table is read from and the rows are placed into hash buckets depending on their join column value. The second table is then read from computing the hash bucket for it's join column value at the same time, the join condition is then evaluated and the row is pumped out. Reading tables completely via full table scans is (with few exceptions) considerably faster than reading every row via index lookups.

What you describe is a nested loop and your suggested execution method is pretty much what happens when the second table is stored in a clustered index structure (AKA Index Organized Table in Oracle, other RDBMS's will probably have their own term). The primary key value (the foreign key from the other table) is used to traverse one index and on the other end is the rest of the table's columns. The difference this makes is not huge, it's going to be one IO per row (and that could easily be cached) - it would make a difference if you are joining to every row in the table but at that point you would just use the hash join approach.

There are also other optimizations to be had with a nested loop, mainly to reduce latching (tiny contention required when reading a page/block from memory). This is where the RDBMS is able to batch up the reads it is required to do against an object and then do as many in one go as possible.

I would recommend you visit the docs for your RDBMS of choice and seeing how to obtain the query plan. Then look at the methods it can use to execute joins.

Related Solutions

Mysql – Subqueries run very fast individually, but when joined are very slow

You don't need all the derived tables. You are joining the basic (product) too many times. You can write the query joining it only once.

Compound indices are a must for EAV designs. Try adding an index on (attribute_id, product_id, value) and then the query:

SELECT t0.id, 
       t1.`value` AS length, 
       t2.`value` AS height, 
       t3.`value` AS family
FROM
  products t0

INNER JOIN 
  product_eav_decimal t1
    ON  t1.product_id = t0.id  
    AND t1.attribute_id = 91
    AND t1.`value` BETWEEN 15 AND 35

LEFT JOIN
  product_eav_decimal t2
    ON  t2.product_id = t0.id  
    AND t2.attribute_id = 80  
-- 
-- 
--

LEFT JOIN                              -- LEFT or INNER join
  product_eav_decimal t6
    ON  t6.product_id = t0.id  
 -- AND t6.attribute_id = 

ORDER BY t0.id ASC ;

Extensible Asset Database Schema

Option #3

NULLs are your friend. After you select the platform, study how NULLs work for the platform and embrace them. In this case they accurately reflect the absence of a check-in. In Oracle NULLS are not indexed, so you could create a function based index that swaps the NULL state giving you a very small index that contains only the entries not checked-in.

Option #1 would need a join anytime you want the Check-In date or status. Joins are your friend too, but in this case there isn't a benefit to separating this data unless there are times when a single checkout can produce multiple checkins. Option #2 requires repeating data or even more NULLS than option #3. Here are the three concepts fleshed out to tables columns and data.

#1
CheckOut
   Asset CheckInOut User DateTime
   1     1          1    3/10/2013
   2     2          1    3/10/2013
   3     3          2    3/11/2013

CheckIn
   CheckInOut DateTime   
   1            3/11/2013

#2
CheckInOut
   Asset CheckInOut User DateTime   InOrOut
   1     1          1    3/10/2013  O
   2     2          1    3/10/2013  O
   1     1          1    3/11/2013  I   
   3     3          2    3/11/2013  O

#3
CheckOut
   Asset CheckInOut User OutDateTime InDateTime
   1     1          1    3/10/2013   3/11/2013
   2     2          1    3/10/2013
   3     3          2    3/11/2013

SQL Fiddle's showing Option #2 and Option #3 with the additional requirements Joel Brown added and how various questions would have to be answered in SQL.

#2 - http://www.sqlfiddle.com/#!4/55e72/25

#3 - http://www.sqlfiddle.com/#!4/e2f27/6

Best Answer

Related Solutions

Mysql – Subqueries run very fast individually, but when joined are very slow

Extensible Asset Database Schema

Related Question