Sql-server – A Dynamic where clause in MS SQL Server

cdynamic-sqlgroup bysql serverstored-procedures

The Problem

My SQL Server database contains a view that looks like the following:

+-------------+-----------+------+-------+------+------+
| CAR_ID      | MAKE      |COLOR | Miles | Seats| MODEL|
+-------------+-----------+------+-------+------+------+
| 1           | FORD      | RED  | 100   | 4    | CAR  |
| 2           | TOYOTA    | RED  | 500   | 2    | TRK  |
| 3           | FORD      | BLACK| 10    | 4    | CAR  |
| 4           | HYUNDAI   | BLUE | 150   | 4    | VAN  |
| ...         | ...       | ...  | ...   | ...  | ...  |
| 1000        | TOYOTA    | GRN  | 200   | 8    | CAN  |
+-------------+-----------+------+-------+------+------+

I would like to create custom groupings from this table. My desired output might look like the following-

+-------------+-----------+
| Group       | sum(Miles)|
+-------------+-----------+
| Red Cars    | 500       |
| Red Toyota  | 325       |
| Trucks      | 563       | 
| Black Car   | 952       |
| ...         | ...       |
| Black Ford  | 100       |
+-------------+-----------+

Indeed, it looks like a simple "GROUP BY" query. However, in my case, the groups can become rather complex with many logical statements and the number of groups is also very large which makes a "group by" query intractable for me to use. My solution has been to use the WHERE clause to create custom groups. Some may call this Dynamic SQL.

My Solution

I have a created a C# script which I use to store the group definitions and the query itself. The group definitions are stored in a dictionary. I iterate through the dictionary, passing each grouping to the WHERE clause of a select statement, and then I send the SQL statement to the SQL server.

The server returns a Datatable with my custom grouping to my C# application. I then Bulk Insert the Datatable into another table on the SQL server. If I have 50 groupings, I query the SQL server 50 times and bulk insert 50 times. Here is an example:

Dictonary x = {GRP1, (COLOR = BLACK or Miles <50),
               GRP2, (Seats <5 AND MODEL = CAR)};

FOREACH(key in x)
{
  y = "SELECT * FROM CAR_TABLE WHERE " + x[key]
  Datatable dt = QueryServer(y);
  BulkInsertToDB(dt);
}

Question

There must be a better and more efficient way of doing this. I was thinking about throwing my query into a stored procedure on the server, but I would still have to make calls to the stored procedure for each grouping. I then thought perhaps there is a way to store these groupings server size and run all of the queries concurrently in a stored procedure. Any guidance would be helpful.

Clarifications:

One row could belong to many groups. I have sum of miles and sum of age of vehicles. I would use the key of the dictionary as the group name and pass the key to the SQL Server. I would update the datatable in c# with the group name.

Best Answer

There are a number of ways to do this purely in SQL. They will, logically enough, also require dynamic SQL.

First, copy your dictionary into a table in SQL Server - I'll refer to it as Group_Def, with columns Group_Name and Where_Clause.

Option 1: replicate your logic in SQL. For each group, generate a series of SQL statements (in a variable) that will add the group name and the CAR_ID for all cars matching your criteria to a temporary table. Execute the SQL statements in the variable to populate the temp table. Then, join that temporary table to CAR_TABLE on CAR_ID, group by Group_Name, and apply your aggregates.

Option 2: Grab the aggregate values as you go. This time, instead of writing all the matching CAR_ID values into the temp table, write the actual aggregate values. Then, simply select from the temp table.

May be easier to follow if you see the code:

/*
 1           | FORD      | RED  | 100   | 4    | CAR  |
| 2           | TOYOTA    | RED  | 500   | 2    | TRK  |
| 3           | FORD      | BLACK| 10    | 4    | CAR  |
| 4           | HYUNDAI   | BLUE | 150   | 4    | VAN  |
| ...         | ...       | ...  | ...   | ...  | ...  |
| 1000        | TOYOTA    | GRN  | 200   | 8    | CAN  |
*/

IF (OBJECT_ID('tempdb..#CAR_TABLE') IS NOT NULL) DROP TABLE #CAR_TABLE;
CREATE TABLE #CAR_TABLE
 ( CAR_ID int IDENTITY(1,1)
  ,Make   varchar(50)
  ,Color  varchar(50)
  ,Miles  int
  ,Seats  int
  ,Model  varchar(50)
);

INSERT INTO #CAR_TABLE (Make, Color, Miles, Seats, Model)
VALUES ('FORD', 'RED', 100, 4, 'CAR')
      ,('TOYOTA', 'RED', 500, 2, 'TRK')
      ,('FORD', 'BLACK', 10, 4, 'CAR')
      ,('HYUNDAI', 'BLUE', 150, 4, 'VAN')
      ,('TOYOTA', 'GRN', 200, 8, 'CAN')
;


/*
Dictonary x = {GRP1, (COLOR = BLACK or Miles <50),
               GRP2, (Seats <5 AND MODEL = CAR)};

*/

IF (OBJECT_ID('tempdb..#Group_Def') IS NOT NULL) DROP TABLE #Group_Def;
CREATE TABLE #Group_Def (Group_Name nvarchar(128), Where_Clause nvarchar(2000));

INSERT INTO #Group_Def
VALUES (N'GRP1', N'(Color = ''BLACK'' OR Miles < 50)')
      ,(N'GRP2', N'(Seats < 5 AND Model = ''CAR'')')
      ,(N'Red Cars', N'(Color = ''RED'' AND Model = ''CAR'')')
      ,(N'Red Toyotas', N'(color = ''RED'' AND Make = ''TOYOTA'')')
      ,(N'All', N'(1 = 1)')
;


--OPTION 1

IF (OBJECT_ID('tempdb..#cars_by_group') IS NOT NULL) DROP TABLE #cars_by_group;
CREATE TABLE #cars_by_group (Group_Name nvarchar(128), CAR_ID int);

DECLARE @sqlcmd nvarchar(MAX); 

SELECT @sqlcmd = N'INSERT INTO #cars_by_group
SELECT NULL, NULL FROM #Group_Def WHERE 1 = 0
';

SELECT @sqlcmd = @sqlcmd + N'UNION ALL
SELECT N''' + Group_Name + N''', CAR_ID FROM #CAR_TABLE WHERE ' + Where_Clause + N'
'
  FROM #Group_Def
;

EXECUTE sp_executesql @sqlcmd;

SELECT t.Group_Name
      ,SUM(c.Miles) as sum_miles
  FROM #cars_by_group t INNER JOIN #CAR_TABLE c ON (t.CAR_ID = c.CAR_ID)
 GROUP BY t.Group_Name
;




--OPTION 2:

IF (OBJECT_ID('tempdb..#group_totals') IS NOT NULL) DROP TABLE #group_totals;
CREATE TABLE #group_totals (Group_Name nvarchar(128), sum_miles int);

DECLARE @sqlcmd2 nvarchar(MAX); 

SELECT @sqlcmd2 = N'INSERT INTO #group_totals
SELECT NULL, NULL FROM #Group_Def WHERE 1 = 0
';

SELECT @sqlcmd2 = @sqlcmd2 + N'UNION ALL
SELECT N''' + Group_Name + N''', SUM(Miles) FROM #CAR_TABLE WHERE ' + Where_Clause + N'
'
  FROM #Group_Def
;

EXECUTE sp_executesql @sqlcmd2;

SELECT *
  FROM #group_totals
;

Option 2 may be slightly faster, as you aren't having to calculate the aggregates in the final SELECT. Note, however, that even though I've built @sqlcmd as a single SQL statement, the UNION ALL is joining together functionally separate queries. Performance-wise, you could rewrite this so that each row in #Group_Def was generating a distinct INSERT statement, and performance would not change significantly. You're basically running one statement for each group, regardless.

I can envision a third option that would collect all the aggregates in a single step; however, it's not as scalable, and requires that all your aggregates can be expressed in terms of sums or counts, and still ultimately requires a SELECT for each group.

You would build a single SELECT statement, with columns for each value you want to record for each group. You'd want to add a version of the group name without spaces. To collect the aggregates, you would use SUM and CASE statements. For example:

SELECT SUM(CASE WHEN (Color = 'RED' and Make = 'TOYOTA' THEN 1 ELSE 0 END) as red_toyotas_count
      ,SUM(CASE WHEN (Color = 'RED' and Make = 'TOYOTA' THEN Miles ELSE 0 END) as red_toyotas_sum_miles
...
  INTO #single_row_total
  FROM CAR_TABLE

(No GROUP BY would be necessary - all output columns would be aggregates).

Then, you build the SELECT statements to pull all the values you need out of the #single_row_total table (again, with dynamic SQL, going through the Group_Def table to build the statement), using UNION ALL to tie them together:

SELECT 'Red Toyotas' as [Group], red_toyotas_sum_miles as sum_miles, red_toyotas_sum_miles * 1.00 / red_toyotas_count as avg_miles FROM #single_row_total
UNION ALL
SELECT 'Red Cars' as [Group], red_cars_sum_miles as sum_miles, red_cars_sum_miles * 1.00 / red_cars_count as avg_miles FROM #single_row_total
UNION ALL
...

This would allow all the aggregates to be collected in a single set-based operation. As noted, breaking the single result into separate rows still requires one SELECT per group; however, these selects, being on a single row, should be quite quick.

This isn't scalable because there's an upper limit to the number of columns you can have in a single table, so a sufficient number of groups, times a sufficient number of aggregates needed per group, could lead to too many columns needed.

Also, while the query pulling the aggregates is a single SQL query, the process of evaluating each row to determine what the results of the CASE statement would be could cause the execution plan to be so hideous, that the one SELECT per group options would perform as well or better.

So, I leave the creation of Option 3 (as well as testing with much larger data sets, to determine the fastest option) up to the OP, or another interested party.

As to the dynamic SQL aspect of this: unless the groups can be defined on an ad hoc basis, you're building your SQL statements with no user-supplied values, so there's no risk of SQL injection. And (I think), if #Group_Def doesn't change, the plan generated might wind up being reusable (not sure, not planning on checking; wasn't part of the original question).