Sql-server – Efficient join with data spread into different tables

join;sql server

I have to work with an ancient database schema on Microsoft SQL Server where records are split into different tables but referenced in a central table.

A record in my central table will look like this:

| ID | Firstname | Lastname | Table    |
| 1  | John      | Smith    | users001 |
| 2  | North     | West     | users005 |

The database will then contain multiple tables users001, users002 through usersN where individual action records are stored. I can join a users-table with the central table using the id column (ID is a fk in every users-table).
These tables can be quite long (500k rows or more).

Now, in order to do calculations over all users, I created a view that basically does a UNION ALL of all users-tables which I can then query.
Obviously that is quite expensive hence my question:
Any idea how one can efficiently join these tables?

A stored procedure is not an option unhappily as the application performing the calculations can only work with tables or views (BI Suite I cannot influence).

Best Answer

You've got a couple of options - but none of them are likely to produce lightning-swift results.

Note I'll be making a few assumptions below. The main one is that all the usersXXX tables have the same structure (same column names, same datatypes).

  1. Massive LEFT JOIN - if there are fewer than 256 usersXXX tables (or more, maybe - see comment below code), and you're only joining to the central table (we'll call that User, for convenience), then you can (in theory) create a single statement like the following:

    SELECT U.ID, U.firstname, U.lastname, COALESCE(U001.field1, U002.field1, ... , U255.field1) as field1, ...
      FROM User U
             LEFT JOIN users001 U001 ON (U.ID = U001.ID)
             LEFT JOIN users002 U002 ON (U.ID = U002.ID)
            ...
             LEFT JOIN users255 U255 ON (U.ID = U255.ID)
    

    Note that all references to a given field from the usersXXX tables, will need to COALESCE the field from the possible 255 tables.

    SQL Server claims to have an upper limit of 256 tables in a query, which is why that's our limit here. However, the documentation is not consistent. On another page the maximum number of tables per SELECT statement is "Limited only by available resources". In addition, experiments have shown that queries can contain more than 256 tables. Another test had involved 599 tables.

    The awkward nature of referring to the COALESCEd fields makes this hard to maintain and hard to read.

  2. Your current view. I'm assuming (if you're using the view to do calculations over all users) that you may not necessarily be joining to the User table.

    So, your view probably looks something like this (you can add an INNER JOIN to User if you want):

    SELECT field1, field2, ... FROM users001
    UNION ALL
    SELECT field1, field2, ... FROM users002
    UNION ALL
    ...
    UNION ALL
    SELECT field1, field2, ... FROM usersZZZ
    

    You've already stated that this leaves something to be desired performance wise - but we'll get back to that.

  3. Next, there's a direct use of UNION ALL. Instead of going through a view (simplifying the construction of the query), use the actual UNION ALL of the tables directly. This may perform better that option 2, if you're using a WHERE clause. It depends on how the query optimizer handles things - if there's relatively few records to bring back, but the query optimizer insists on materializing the view before applying the WHERE clause, then this will help. If (as I suspect) the query optimizer would apply the WHERE clause at the individual table level rather than trying to materialize the view, then this would probably have little performance benefit.

  4. Last, you can query each table individually, and put the results in a temp table. The possible performance advantage here would be that each individual query would be much simpler, would take far less time to generate a plan for, and would run relatively quickly. The disadvantages are that a separate plan would be generated for each table (meaning that even the relatively short time to generate these query plans must be multiplied by the number of tables for the first run), and that the "simple" query must be run NNN times (where NNN is the total number of tables).

So, the real question - what's making the current query so costly? If it's simple the amount of data that must be gone through, then playing with alternate queries is unlikely to result in something dramatically faster. If, on the other hand, the sheer volume of different tables involved is causing a really convoluted plan to be generated, than an alternative that might create simpler plans (maybe option 3, definitely option 4) could be worth a try.

You'll probably want to look at the actual query plans being generated, in addition to looking at the time each alternative takes.

Also, keep in mind that perfection is not always desirable. If the current query is being used for reporting, for example, and the reports are taking longer than you'd like, but an acceptable amount of time, then spending a lot of time to shave off 25 of the current run time may simply not be worth it. If, on the other hand, a daily report is taking 25 hours to run, that's a different scenario.