Sql-server – Find the second highest salary of employees

execution-plansql serversql-server-2012

Imagine there is a table with these fields:

Employee: E-Number
          E-Name
          Department
          Salary

We need to write a query in order to find the second highest salary in this table.

I found this query :

select distinct salary
from Employee e1
where 2 = ( select count(distinct salary)
            from Employee e2
            where e1.salary < e2.salary)

I'm a bit confused by this part where 2 = because I have not seen something like that before.

Can anyone here explain more about the query and the part where 2 =? What is the result of the subquery and in what way are table rows (e1, e2) being compared with each other?

Actually this is the execution plan can anyone explain the steps more?

Best Answer

I'm going to address this part of the question:

...this is the execution plan can anyone explain the steps more?

Assuming the table definition looks something like this:

CREATE TABLE dbo.Employee
(
    [E-Number]  integer NOT NULL,
    [E-Name]    nvarchar(50) NOT NULL,
    Department  nvarchar(30) NOT NULL,
    Salary      smallmoney NOT NULL
);

Empty table

The following query (not my recommendation, just an example):

SELECT DISTINCT
    E1.Salary
FROM dbo.Employee AS E1
WHERE
    1 =
    (
        SELECT 
            COUNT(DISTINCT E2.Salary)
        FROM dbo.Employee AS E2
        WHERE
            E2.Salary > E1.Salary
    );

Gives this execution plan (with no data in the table):

The numbers correspond to the Node ID property of each operator.

The E1 Table Scan (4) returns all rows (in no particular order). For each row:
- The Nested Loops join (3) executes the inner side, passing the current E1.Salary value as a parameter.
- The E2 Table Scan (8) finds rows matching E2.Salary > E1.Salary
- The Distinct Sort (7) groups rows by E2.Salary.
- The Stream Aggregate (6) counts the number of rows (1 per distinct E2.Salary). The new count column is given the label Expr1008.
- The Compute Scalar (5) converts the native count data type (bigint) to integer because the query specifies COUNT rather than COUNT_BIG. The new column is given the label Expr1004.
As rows emerge from the Nested Loops join, the Filter (2) applies the predicate Expr1004 = 1
The final Distinct Sort returns the distinct E1.Salary values.

With duplicate Salary values

If we add a few rows to the table, with some duplicate Salary values:

INSERT dbo.Employee
    ([E-Number], [E-Name], Department, Salary)
VALUES
    (1, 'A', 'D1', $1),
    (2, 'B', 'D1', $1),
    (3, 'C', 'D1', $2),
    (4, 'D', 'D1', $2),
    (5, 'E', 'D1', $3),
    (6, 'F', 'D1', $3);

The query above produces a plan with a couple of new operators (highlighted):

The new Sort orders rows by E1.Salary. The ensures that any duplicate Salary values are presented to the join sequentially.

The Lazy Index Spool saves the output from the operators below it as they arrive. These are the Stream Aggregate, Distinct Sort, and E2 Table Scan (with predicate), which as described before filter E2 (using the current value of E1.Salary), remove duplicate E2.Salary values, and count the resulting rows. The Lazy Index Spool saves recomputing the result in case the same Salary value from table E1 is presented on the next iteration of the Nested Loops join.

Because the Sort ordered rows from E1 by E1.Salary, any duplicates are presented to the join one after the other, so the saved result in the spool can be reused instead of re-running the Stream Aggregate, Distinct Sort, and E2 Table Scan each time.

The index on the spool is keyed by Salary, making it quick to find the saved COUNT(DISTINCT E2.Salary) result for the current E1.Salary value.

If you look at the Actual Executions value for each operator in the plan, you will see that the Lazy Index Spool executes 6 times (once per outer row), whereas the Stream Aggregate, Distinct Sort, and E2 Table Scan execute only 3 times (once per distinct salary value).

The optimizer introduces the extra Sort and Spool as a performance optimization, based on automatically collected statistical information about the data in the table.

Related Solutions

Sql-server – How Can the Same Query in Two Nearly Identical Instances Generate Two Different Execution Plans

Execution plans (actual, not estimated) need to be added to the Q for a definitive answer but...

How Can the Same Query in Two Nearly Identical Instances Generate Two Different Execution Plans?

Because, by your admission, they are not identical. Most likely explanation for the different execution plans is a variance in statistics.

Table rows counts are slightly larger on Server A... Statistics are updated nightly at the same time on both servers.

Row counts are different and the statistics update was probably a default sample, rather than FULLSCAN. I've witnessed some comically unfortunate stats histograms as the result of sampling, each of which has been corrected by a FULLSCAN update.

Sql-server – Execution Plan not matching Stored Proc

Your code says you are doing a LEFT OUTER JOIN, but are you really? Your very first WHERE clause (as well as each that follow) filters rows for inclusion from the outer table:

WHERE     (A.[Operator_Name] LIKE '%'+ @OName +'%')

This turns your LEFT OUTER JOIN into an INNER JOIN, whether you meant it or not. Perhaps you meant to move those filters to the ON clause.

Of course I would probably remove the non-sargeable MONTH() function from the JOIN, since this is going to force a complete scan on the remote table, and write it this way instead - also eliminating the join to your local months table:

DECLARE 
  @OName      VARCHAR(50) = 'John', 
  @Start_Date DATE = NULL,--'20120101', -- use safe date formats 
  @End_Date   DATE = '20121231'; -- use semi-colons

-- deal with NULLs here so the query is simpler:

SELECT @Start_Date = COALESCE(@Start_Date, '20010101'),
       @End_Date   = COALESCE(@End_Date, CURRENT_TIMESTAMP);

;WITH n(n) AS 
(
  -- get the # of months you need instead of relying on your Connector table:
  SELECT TOP (DATEDIFF(MONTH, @Start_Date, @End_Date)+1) Number
    FROM master..spt_values
    WHERE [type] = N'P' AND Number >= 0
    ORDER BY Number
), m(m) AS
(
  -- convert those to months:
  SELECT DATEADD(MONTH, n, DATEADD(MONTH, DATEDIFF(MONTH, 0, @Start_Date), 0))
  FROM n
)
SELECT 
  m.m, 
  MonthName = DATENAME(MONTH, m.m),
  [Widget Count] = COUNT(w.Widget_ID)
FROM m
LEFT OUTER JOIN [Server].[Database].dbo.Widget AS w -- meaningful alias
ON 
  w.Widget_Date >= m.m
  AND w.Widget_Date < DATEADD(MONTH, 1, m.m) -- open-ended date range query
  AND w.Operator_Name LIKE '%' + @OName + '%'
GROUP BY m.m
ORDER BY m.m;

Just ignore the m column in the output (you can't order by it without including it in the output, and it doesn't make sense to order by the name).

Another suggestion: don't use 'single quotes' as alias delimiters. Forms of this are deprecated and it also makes column alias look like string literals. Use [square brackets] instead.