Join implementation question

database-agnosticjoin;

I have a question regarding how joins are implemented. I understand that the specifics will depend on the exact DBMS and the indexes in the system, but I think that the question is general enough to be answered.

My question is the following: Using the next query as an example, when the inner join is performed, for each row in Table1 does the DBMS perform a search on the Table2?

And if in fact the DBMS does perform a search for every row, why doesn't the DBMS instead of just storing the foreign key, store a reference to the physical address where the "linked" data is?

SELECT Column_list
FROM TABLE1
INNER JOIN TABLE2
ON Table1.ColName = Table2.ColName 

Best Answer

Most (all?) RDBMSs give you a way of finding out what they decide to do with your query under the hood.

For your query, the optimizer of the RDBMS could see that you have no filters other than the join filter. This could mean that it would expect to visit every row of both tables and join them together. This would usually be executing using something like a hash join - the first table is read from and the rows are placed into hash buckets depending on their join column value. The second table is then read from computing the hash bucket for it's join column value at the same time, the join condition is then evaluated and the row is pumped out. Reading tables completely via full table scans is (with few exceptions) considerably faster than reading every row via index lookups.

What you describe is a nested loop and your suggested execution method is pretty much what happens when the second table is stored in a clustered index structure (AKA Index Organized Table in Oracle, other RDBMS's will probably have their own term). The primary key value (the foreign key from the other table) is used to traverse one index and on the other end is the rest of the table's columns. The difference this makes is not huge, it's going to be one IO per row (and that could easily be cached) - it would make a difference if you are joining to every row in the table but at that point you would just use the hash join approach.

There are also other optimizations to be had with a nested loop, mainly to reduce latching (tiny contention required when reading a page/block from memory). This is where the RDBMS is able to batch up the reads it is required to do against an object and then do as many in one go as possible.

I would recommend you visit the docs for your RDBMS of choice and seeing how to obtain the query plan. Then look at the methods it can use to execute joins.