Sql-server – Find the second highest salary of employees

execution-plansql serversql-server-2012

Imagine there is a table with these fields:

Employee: E-Number
          E-Name
          Department
          Salary

We need to write a query in order to find the second highest salary in this table.

I found this query :

select distinct salary
from Employee e1
where 2 = ( select count(distinct salary)
            from Employee e2
            where e1.salary < e2.salary)

I'm a bit confused by this part where 2 = because I have not seen something like that before.

Can anyone here explain more about the query and the part where 2 =? What is the result of the subquery and in what way are table rows (e1, e2) being compared with each other?

Actually this is the execution plan can anyone explain the steps more?

enter image description here

Best Answer

I'm going to address this part of the question:

...this is the execution plan can anyone explain the steps more?

Assuming the table definition looks something like this:

CREATE TABLE dbo.Employee
(
    [E-Number]  integer NOT NULL,
    [E-Name]    nvarchar(50) NOT NULL,
    Department  nvarchar(30) NOT NULL,
    Salary      smallmoney NOT NULL
);

Empty table

The following query (not my recommendation, just an example):

SELECT DISTINCT
    E1.Salary
FROM dbo.Employee AS E1
WHERE
    1 =
    (
        SELECT 
            COUNT(DISTINCT E2.Salary)
        FROM dbo.Employee AS E2
        WHERE
            E2.Salary > E1.Salary
    );

Gives this execution plan (with no data in the table):

Basic plan

The numbers correspond to the Node ID property of each operator.

  • The E1 Table Scan (4) returns all rows (in no particular order). For each row:
    • The Nested Loops join (3) executes the inner side, passing the current E1.Salary value as a parameter.
    • The E2 Table Scan (8) finds rows matching E2.Salary > E1.Salary
    • The Distinct Sort (7) groups rows by E2.Salary.
    • The Stream Aggregate (6) counts the number of rows (1 per distinct E2.Salary). The new count column is given the label Expr1008.
    • The Compute Scalar (5) converts the native count data type (bigint) to integer because the query specifies COUNT rather than COUNT_BIG. The new column is given the label Expr1004.
  • As rows emerge from the Nested Loops join, the Filter (2) applies the predicate Expr1004 = 1
  • The final Distinct Sort returns the distinct E1.Salary values.

With duplicate Salary values

If we add a few rows to the table, with some duplicate Salary values:

INSERT dbo.Employee
    ([E-Number], [E-Name], Department, Salary)
VALUES
    (1, 'A', 'D1', $1),
    (2, 'B', 'D1', $1),
    (3, 'C', 'D1', $2),
    (4, 'D', 'D1', $2),
    (5, 'E', 'D1', $3),
    (6, 'F', 'D1', $3);

The query above produces a plan with a couple of new operators (highlighted):

Plan with performance spool

The new Sort orders rows by E1.Salary. The ensures that any duplicate Salary values are presented to the join sequentially.

The Lazy Index Spool saves the output from the operators below it as they arrive. These are the Stream Aggregate, Distinct Sort, and E2 Table Scan (with predicate), which as described before filter E2 (using the current value of E1.Salary), remove duplicate E2.Salary values, and count the resulting rows. The Lazy Index Spool saves recomputing the result in case the same Salary value from table E1 is presented on the next iteration of the Nested Loops join.

Because the Sort ordered rows from E1 by E1.Salary, any duplicates are presented to the join one after the other, so the saved result in the spool can be reused instead of re-running the Stream Aggregate, Distinct Sort, and E2 Table Scan each time.

The index on the spool is keyed by Salary, making it quick to find the saved COUNT(DISTINCT E2.Salary) result for the current E1.Salary value.

If you look at the Actual Executions value for each operator in the plan, you will see that the Lazy Index Spool executes 6 times (once per outer row), whereas the Stream Aggregate, Distinct Sort, and E2 Table Scan execute only 3 times (once per distinct salary value).

The optimizer introduces the extra Sort and Spool as a performance optimization, based on automatically collected statistical information about the data in the table.