I have to work with an ancient database schema on Microsoft SQL Server where records are split into different tables but referenced in a central table.
A record in my central table will look like this:
| ID | Firstname | Lastname | Table |
| 1 | John | Smith | users001 |
| 2 | North | West | users005 |
The database will then contain multiple tables users001, users002 through usersN where individual action records are stored. I can join a users-table with the central table using the id column (ID is a fk in every users-table).
These tables can be quite long (500k rows or more).
Now, in order to do calculations over all users, I created a view that basically does a UNION ALL of all users-tables which I can then query.
Obviously that is quite expensive hence my question:
Any idea how one can efficiently join these tables?
A stored procedure is not an option unhappily as the application performing the calculations can only work with tables or views (BI Suite I cannot influence).
Best Answer
You've got a couple of options - but none of them are likely to produce lightning-swift results.
Note I'll be making a few assumptions below. The main one is that all the
usersXXX
tables have the same structure (same column names, same datatypes).Massive
LEFT JOIN
- if there are fewer than 256usersXXX
tables (or more, maybe - see comment below code), and you're only joining to the central table (we'll call thatUser
, for convenience), then you can (in theory) create a single statement like the following:Note that all references to a given field from the
usersXXX
tables, will need toCOALESCE
the field from the possible 255 tables.SQL Server claims to have an upper limit of 256 tables in a query, which is why that's our limit here. However, the documentation is not consistent. On another page the maximum number of tables per
SELECT
statement is "Limited only by available resources". In addition, experiments have shown that queries can contain more than 256 tables. Another test had involved 599 tables.The awkward nature of referring to the
COALESCE
d fields makes this hard to maintain and hard to read.Your current view. I'm assuming (if you're using the view to do calculations over all users) that you may not necessarily be joining to the
User
table.So, your view probably looks something like this (you can add an
INNER JOIN
toUser
if you want):You've already stated that this leaves something to be desired performance wise - but we'll get back to that.
Next, there's a direct use of UNION ALL. Instead of going through a view (simplifying the construction of the query), use the actual UNION ALL of the tables directly. This may perform better that option 2, if you're using a
WHERE
clause. It depends on how the query optimizer handles things - if there's relatively few records to bring back, but the query optimizer insists on materializing the view before applying theWHERE
clause, then this will help. If (as I suspect) the query optimizer would apply theWHERE
clause at the individual table level rather than trying to materialize the view, then this would probably have little performance benefit.Last, you can query each table individually, and put the results in a temp table. The possible performance advantage here would be that each individual query would be much simpler, would take far less time to generate a plan for, and would run relatively quickly. The disadvantages are that a separate plan would be generated for each table (meaning that even the relatively short time to generate these query plans must be multiplied by the number of tables for the first run), and that the "simple" query must be run NNN times (where NNN is the total number of tables).
So, the real question - what's making the current query so costly? If it's simple the amount of data that must be gone through, then playing with alternate queries is unlikely to result in something dramatically faster. If, on the other hand, the sheer volume of different tables involved is causing a really convoluted plan to be generated, than an alternative that might create simpler plans (maybe option 3, definitely option 4) could be worth a try.
You'll probably want to look at the actual query plans being generated, in addition to looking at the time each alternative takes.
Also, keep in mind that perfection is not always desirable. If the current query is being used for reporting, for example, and the reports are taking longer than you'd like, but an acceptable amount of time, then spending a lot of time to shave off 25 of the current run time may simply not be worth it. If, on the other hand, a daily report is taking 25 hours to run, that's a different scenario.