Sql-server – Efficient join with data spread into different tables

join;sql server

I have to work with an ancient database schema on Microsoft SQL Server where records are split into different tables but referenced in a central table.

A record in my central table will look like this:

| ID | Firstname | Lastname | Table    |
| 1  | John      | Smith    | users001 |
| 2  | North     | West     | users005 |

The database will then contain multiple tables users001, users002 through usersN where individual action records are stored. I can join a users-table with the central table using the id column (ID is a fk in every users-table).
These tables can be quite long (500k rows or more).

Now, in order to do calculations over all users, I created a view that basically does a UNION ALL of all users-tables which I can then query.
Obviously that is quite expensive hence my question:
Any idea how one can efficiently join these tables?

A stored procedure is not an option unhappily as the application performing the calculations can only work with tables or views (BI Suite I cannot influence).

Best Answer

You've got a couple of options - but none of them are likely to produce lightning-swift results.

Note I'll be making a few assumptions below. The main one is that all the usersXXX tables have the same structure (same column names, same datatypes).

Massive LEFT JOIN - if there are fewer than 256 usersXXX tables (or more, maybe - see comment below code), and you're only joining to the central table (we'll call that User, for convenience), then you can (in theory) create a single statement like the following:
```
SELECT U.ID, U.firstname, U.lastname, COALESCE(U001.field1, U002.field1, ... , U255.field1) as field1, ...
  FROM User U
         LEFT JOIN users001 U001 ON (U.ID = U001.ID)
         LEFT JOIN users002 U002 ON (U.ID = U002.ID)
        ...
         LEFT JOIN users255 U255 ON (U.ID = U255.ID)
```
Note that all references to a given field from the usersXXX tables, will need to COALESCE the field from the possible 255 tables.

SQL Server claims to have an upper limit of 256 tables in a query, which is why that's our limit here. However, the documentation is not consistent. On another page the maximum number of tables per SELECT statement is "Limited only by available resources". In addition, experiments have shown that queries can contain more than 256 tables. Another test had involved 599 tables.

The awkward nature of referring to the COALESCEd fields makes this hard to maintain and hard to read.
Your current view. I'm assuming (if you're using the view to do calculations over all users) that you may not necessarily be joining to the User table.

So, your view probably looks something like this (you can add an INNER JOIN to User if you want):
```
SELECT field1, field2, ... FROM users001
UNION ALL
SELECT field1, field2, ... FROM users002
UNION ALL
...
UNION ALL
SELECT field1, field2, ... FROM usersZZZ
```
You've already stated that this leaves something to be desired performance wise - but we'll get back to that.
Next, there's a direct use of UNION ALL. Instead of going through a view (simplifying the construction of the query), use the actual UNION ALL of the tables directly. This may perform better that option 2, if you're using a WHERE clause. It depends on how the query optimizer handles things - if there's relatively few records to bring back, but the query optimizer insists on materializing the view before applying the WHERE clause, then this will help. If (as I suspect) the query optimizer would apply the WHERE clause at the individual table level rather than trying to materialize the view, then this would probably have little performance benefit.
Last, you can query each table individually, and put the results in a temp table. The possible performance advantage here would be that each individual query would be much simpler, would take far less time to generate a plan for, and would run relatively quickly. The disadvantages are that a separate plan would be generated for each table (meaning that even the relatively short time to generate these query plans must be multiplied by the number of tables for the first run), and that the "simple" query must be run NNN times (where NNN is the total number of tables).

So, the real question - what's making the current query so costly? If it's simple the amount of data that must be gone through, then playing with alternate queries is unlikely to result in something dramatically faster. If, on the other hand, the sheer volume of different tables involved is causing a really convoluted plan to be generated, than an alternative that might create simpler plans (maybe option 3, definitely option 4) could be worth a try.

You'll probably want to look at the actual query plans being generated, in addition to looking at the time each alternative takes.

Also, keep in mind that perfection is not always desirable. If the current query is being used for reporting, for example, and the reports are taking longer than you'd like, but an acceptable amount of time, then spending a lot of time to shave off 25 of the current run time may simply not be worth it. If, on the other hand, a daily report is taking 25 hours to run, that's a different scenario.

Related Solutions

Postgresql – Retrieving data from inner join using LIMIT and OFFSET

Use a subquery (as displayed) or CTE for that purpose:

SELECT *
FROM  (
   SELECT qid, gid
   FROM   table1
   ORDER  BY date DESC
   LIMIT  10
   OFFSET ?
   ) q
JOIN   table2 a USING (qid, gid)

USING (qid, gid) is just a shortcut for ON q.qid = a.qid AND q.gid = a.gid with the side effect that the two columns are only included once in the result.

Postgresql – How to inner join some tables and outer join others

It sounds like you are looking for a FULL [OUTER] JOIN. Per documentation:

FULL OUTER JOIN

First, an inner join is performed. Then, for each row in T1 that does not satisfy the join condition with any row in T2, a joined row is added with null values in columns of T2. Also, for each row of T2 that does not satisfy the join condition with any row in T1, a joined row with null values in the columns of T1 is added.

SELECT *
FROM   users           u  
JOIN   user_products   up ON u.id = up.user_id
                         AND u.id = 1
JOIN   products        p ON p.id = up.product_id
FULL   JOIN categories c ON c.id = p.category_id;

This returns all categories and also returns all user-product combinations for the given user_id.

If you are enforcing referential integrity with foreign keys and products.category_id is defined NOT NULL, so that every product is assigned to an existing category (not in your question), you can replace the FULL JOIN with a RIGHT JOIN.

Plus, either way, the condition to select a specific user has to move to a subquery or (simpler) to a JOIN condition between the first three tables. It has to be applied before the last table categories is joined.

SQL Fiddle.

Best Answer

Related Solutions

Postgresql – Retrieving data from inner join using LIMIT and OFFSET

Postgresql – How to inner join some tables and outer join others

Related Question