MYSQL: used variable on IN condition

MySQLstored-procedures

EDIT] The query below is not the actual query which I am having issues with performance. It is only a simplified query. As I only would like to know if it is possible to store multiple results into a variable and used that variable into the IN condition.

I had this query below inside a procedure which I would like to change for optimization.

SELECT * FROM request WHERE facility_id IN (SELECT facility_id FROM facility);

What I would like to do was to store the subquery into a variable and used that variable on the main query.

SET @facilities = (SELECT facility_id FROM facility);

SELECT * FROM request WHERE facility_id IN (@facilities);

But I got an error Subquery returns more than 1 row.
How should I do it?
Any help is appreciated.

I have found some link that suggest that way http://www.mysqlperformancetuning.com/how-to-speed-up-mysql-by-optimization?nocache=1#comment-961. But I cannot figure it out.

Best Answer

You can use the GROUP_CONCAT aggregate function to produce a comma-delimited list of values:

SET @facilities = (SELECT GROUP_CONCAT(facility_id) FROM facility);

The result will be just a string value, however. If you use it like this:

SELECT * FROM request WHERE facility_id IN (@facilities);

you cannot expect to get a match on every ID in the @facilities list. This is because @facilities is not expanded into a list and instead is treated as a single item of the IN list. The commas inside it will be seen as just characters inside a string value, not as syntactic delimiters.

In order to use @facilities the way you want, you will have to build and execute a dynamic query around the value of @facilities. It could be something like this:

@stmt = CONCAT('SELECT * FROM request WHERE facility_id IN (', @facilities, ')');
PREPARE stmt FROM @stmt;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

This way the query will not contain a reference to @facilities – instead, it will have the contents of @facilities as its integral part, and the commas will thus be treated as part of the syntax.

Each query that you want to use the @facilities list in would have to be executed in the above manner. I think you will agree it is not very convenient. There are other considerations to keep in mind as well. When you use a dynamic query and concatenate the contents of a variable into the dynamic query as described, you may be open to SQL injection attacks. Also, with a large number of items in an IN list the performance may degrade.

I would suggest that instead of a variable you use a temporary table to store the IDs:

CREATE TEMPORARY TABLE tmp_facilities AS SELECT facility_id FROM facility;

A table is much more flexible in that you are not tied to just one syntactic structure. Although you could still use the temporary table in an IN predicate, as in your example:

SELECT
  *
FROM
  request
WHERE
  facility_id IN (
    SELECT facility_id FROM tmp_facility
  )
;

you could also use it in an EXISTS subquery:

SELECT
  *
FROM
  request AS r
WHERE
  EXISTS (
    SELECT * FROM tmp_facility AS f WHERE f.facility_id = r.facility_id
  )

;

or in a filtering join:

SELECT
  r.*
FROM
  request AS r
  INNER JOIN tmp_facilities AS f ON f.facility_id = r.facility_id
;

Having many options how to write your queries may be beneficial when you need to tune them for performance.

Related Solutions

MySQL subqueries that use range based on values of main queries don’t use indices properly

This small data sample doesn't serve to illustrate that the behavior you are attempting to identify exists. Indeed, I've tested it on a larger data set and it did use the Index (using MySQL 5.5.30).

The problem is that when the optimizer determines that using an index would result in an inordinately large number of matches -- compared to the total number of rows in the table -- it won't use an index, because that could actually perform worse than simply scanning the whole table, and it will exhibit exactly the behavior this example illustrates... it knows the index is a candidate, but it chooses not to use it.

But I would suggest the problem lies in the fact that you're using a subquery in a place where a subquery isn't really necessary or called for.

I rewrote this as a join, because, from what I can tell, this is what you're asking the database to do: join each row in Person to the matching row(s) in CofeeBreaks where that person took their break during that window, and average the ages of the attendees.

I also built this here on SQL Fiddle. Removed TEMPORARY from the table definitions because the Fiddle doesn't seem to support them properly (because it probably uses a connection pool).

SELECT cb.id,
       cb.cofeeBreakStart,
       cb.cofeeBreakEnd,
       avg(p.age)
  FROM CofeeBreaks cb
  JOIN Person p ON p.lastCofee BETWEEN cb.cofeeBreakStart AND cb.cofeeBreakEnd
 GROUP BY cb.id;

EXPLAIN SELECT on this subquery shows that the index is being used, even on this small data set... although if you change that to a LEFT JOIN (which would show all coffee breaks even if nobody took that particular break, while the JOIN only includes breaks where at least one person did), the index shows up as being a candidate, but doesn't get used... again, likely because of the cost, and this behavior would likely be different with a larger data set.

The LEFT JOIN version of the query would produce identical results to your subquery regardless of the table data, while the JOIN version only produces identical results if every CofeeBreak had at least one person taking that break, which in your sample data, it does.

But using the indexes or not, a correlated subquery will not usually scale as well as a join.

http://dev.mysql.com/doc/refman/5.5/en/rewriting-subqueries.html

SQL Server – Table Valued Parameter Not Pulling Correct Data for Stored Procedure

I'm going to suggest working to the solution in two phases. The first part is to make sure all the data you need is being returned. The second part is to combine the rows the way that you want.

I'm having trouble getting SqlFiddle to work with table types as well, so I'm going to work with the query directly. Here is the base query from above:

;WITH f AS
(
SELECT FacilityID FROM @facilities
UNION ALL SELECT @facilitydivisionID
)
SELECT
  di.DocumentNumber, di.DocumentVersion, di.DocumentTitle,tsl.TransmittalSiteLeadFacility, 
  ts.TransmittalNumber, ts.TransmittalVersion,
  FacilityImpDate = COALESCE(tsl.TransmittalSiteLeadImpDate, ts.TransmittalImpDate)
FROM
  TransmittalSiteLead tsl
  INNER JOIN TransmittalSheet as ts
    ON ts.TransmittalID =   tsl.TransmittalSiteLeadTSID
  INNER JOIN documentInfo as di 
    ON ts.transmittalID = di.documentTransmittalImp
  INNER JOIN f
    ON tsl.TransmittalSiteLeadFacility = f.FacilityId
WHERE
  transmittalSiteLeadImpDate 
    BETWEEN  @startdate AND @enddate

Even if there are duplicates or extra information, all of the data you want to return should be in there. If it isn't, then look at the JOINs or WHERE clause to see how to include that info.

The second part of the problem is to condense or limit the rows that are returned. There are a lot of ways to do this and the best way depends on a lot of factors. Here's what I'm sure is a partial list:

SELECT DISTINCT to limit results
GROUP BY with MIN to include information about a related table (See below)
EXISTS to limit the rows return
ROW_NUMBER() to pick a row based on an ordering (See below)
APPLY with TOP to get some information from a joined table

All this is to say that there's a lot of ways to do this, so do what makes sense for your situation. I've given two ways of doing this below, these aren't best for performance critical situations, but are easier to understand.

This query selects the document info and the staff id of someone who is implementing it.

;WITH f AS
(
SELECT FacilityID FROM @facilities
UNION ALL SELECT @facilitydivisionID
)
SELECT
  di.DocumentNumber
  ,di.DocumentVersion
  ,di.DocumentTitle
  ,MIN(tsl.TransmittalSiteLeadStaffId) AS TransmittalSiteLeadStaffId
FROM
  TransmittalSiteLead tsl
  INNER JOIN TransmittalSheet as ts
    ON ts.TransmittalID =   tsl.TransmittalSiteLeadTSID
  INNER JOIN documentInfo as di 
    ON ts.transmittalID = di.documentTransmittalImp
  INNER JOIN f
    ON tsl.TransmittalSiteLeadFacility = f.FacilityId
WHERE
  transmittalSiteLeadImpDate 
    BETWEEN  @startdate AND @enddate
GROUP BY
    di.DocumentNumber
    ,di.DocumentVersion
    ,di.DocumentTitle
ORDER BY
    di.DocumentNumber

Another way of doing it is using ROW_NUMBER. This way is very flexible, but can be more difficult to construct and understand.

This query selects matching documents and the facility and contact of the first to implement it. PARTITION sets how you want to divide the rows and ORDER BY sets how you want to select which matching row you want to use.

;WITH f AS
(
SELECT FacilityID FROM @facilities
UNION ALL SELECT @facilitydivisionID
),
FullData AS (
    SELECT
      di.DocumentNumber
      ,di.DocumentVersion
      ,di.DocumentTitle
      ,tsl.TransmittalSiteLeadFacility
      ,tsl.TransmittalSiteLeadStaffId
      ,ROW_NUMBER() OVER (PARTITION BY di.DocumentNumber ORDER BY tsl.TransmittalSiteLeadImpDate) AS RowNum
    FROM
      TransmittalSiteLead tsl
      INNER JOIN TransmittalSheet as ts
        ON ts.TransmittalID =   tsl.TransmittalSiteLeadTSID
      INNER JOIN documentInfo as di 
        ON ts.transmittalID = di.documentTransmittalImp
      INNER JOIN f
        ON tsl.TransmittalSiteLeadFacility = f.FacilityId
    WHERE
      transmittalSiteLeadImpDate 
        BETWEEN  @startdate AND @enddate
)
SELECT
    DocumentNumber
    ,DocumentVersion
    ,DocumentTitle
    ,TransmittalSiteLeadFacility
    ,TransmittalSiteLeadStaffId
FROM
    FullData
WHERE
    RowNum = 1
ORDER BY
    DocumentNumber

Best Answer

Related Solutions

MySQL subqueries that use range based on values of main queries don’t use indices properly

SQL Server – Table Valued Parameter Not Pulling Correct Data for Stored Procedure

Related Question