Sql-server – Replace cursor with set-based approach

sql serversql-server-2008-r2t-sql

I am looking to replace my cursor-based solution if possible in a particular stored procedure. If it makes any difference, this is running on SQL Server 2008 R2. I am looking more for an algorithm than precise code.

Background:

The SP is part of a system for a company that sends mailings via direct mail or email. The mailings contain a personalized code that the recipient can enter when visiting the merchant to get special discounts or offers. Code usage is tracked and aggregate reports on the response to various mailings is provided to the merchants. A "customer" is defined as the target of one of the mailings with a unique first name, last name, and address; if there is no address, then email replaces address.

The tables at issue are as follows (simplified versions):

CREATE TABLE job (
    id INT PRIMARY KEY IDENTITY (1,1),
    job_num VARCHAR(32) NOT NULL,
    mailing_id INT NOT NULL,
    personal_code NVARCHAR(50) NOT NULL,
    fname NVARCHAR(50) NOT NULL,
    lname NVARCHAR(50) NOT NULL,
    email NVARCHAR(50),
    address NVARCHAR(50),
    city NVARCHAR(50),
    state CHAR(2),
    zip NVARCHAR(10),
    extra NVARCHAR(150)
);

CREATE TABLE customer (
    id INT PRIMARY KEY IDENTITY (1,1),
    fname NVARCHAR(50) NOT NULL,
    lname NVARCHAR(50) NOT NULL,
    email NVARCHAR(50),
    address NVARCHAR(50),
    city NVARCHAR(50),
    state CHAR(2),
    zip NVARCHAR(10)
);

CREATE TABLE personal_code (
    id INT PRIMARY KEY IDENTITY (1,1),
    customer_id INT NOT NULL,
    mailing_id INT NOT NULL,
    personal_code NVARCHAR(50) NOT NULL,
    email NVARCHAR(50),
    FOREIGN KEY (customer_id) REFERENCES customer(id)
);

CREATE TABLE personal_code_extra (
    personal_code_id INT PRIMARY KEY,
    extra NVARCHAR(150),
    FOREIGN KEY (personal_code_id) REFERENCES personal_code(id)
);

The job table is populated by an external process which is involved in the creation of the address lists to which the mailing will be sent. It then calls the SP I wish to optimize (hopefully), passing the job_num. The SP then reads all the records from the job table with that job_num and inserts the data stored there into the other three tables.

The customer table is, as expected, where the data on the various mailing recipients are stored.

The personal_code table is a store for data on the personal codes associated with mailings. Each code is tied to a specific mailing as well as to a particular customer.
Currently, the SP opens a cursor for the job table. It then iterates through each row and for each does the following:

  1. If address is not null, it sets @customer_id equal to customer.id where fname, lname, and address match; otherwise, it sets it equal to customer.id where fname, lname, and email match and address is null or empty.
  2. At this point, if @customer_id is null, there was no matching record and a new record is added to customer. @customer_id is set to scope_identity().
  3. A record is inserted into personal_code.
  4. If job.extra is not null, a record is inserted into personal_code_extra with (of course) the id generated by the insert in Step 3.

customer and personal_code are the largest tables in the database, with 40 million and 64 million records respectively. Even though the queries on customer have covering indexes, doing a separate search for each row in job has to be slowing things down. I would very much like to ditch the cursor and replace this RBAR approach with a set-based approach. What is preventing me from doing so is having to potentially use a newly-created customer_id for the personal_code insert, as well as a newly-created personal_code_id for the "extras" table. If not for that, I could do something like

INSERT INTO personal_code (fields)
SELECT (fields)
FROM job j
INNER JOIN customer c ON j.fname = c.fname AND j.lname = c.lname AND j.address = c.address
WHERE j.address IS NOT NULL;

INSERT INTO personal_code (fields)
SELECT (fields)
FROM job j
INNER JOIN customer c ON j.fname = c.fname AND j.lname = c.lname AND j.email = c.email
WHERE j.address IS NULL;

INSERT INTO personal_code (fields)
SELECT (fields) -- but won't have a value for customer_id !
FROM job j
LEFT JOIN customer c ON either_address_or_email
WHERE c.id IS NULL;

And how to handle the insert of the "extra" stuff in a set-based approach, I have no idea at present. Thanks in advance for any ideas.

EDIT: Cursor code added per request. This is simplified but has all the essentials–hopefully my edits are accurate.

DECLARE job_cur CURSOR FOR
SELECT mailing_id, personal_code, email, fname, lname, address, city, state, zip, extra
FROM job
WHERE job_num = @job_no;

OPEN job_cur;

FETCH NEXT FROM job_cur INTO @mailing_id, @personal_code, @email, @fname, @lname, @address, @city, @state, @zip, @extra;

WHILE @@FETCH_STATUS = 0
BEGIN

    IF ISNULL(@address, '') != ''
    SET @customer_id = (
        SELECT id 
        FROM customer 
        WHERE fname = @fname
        AND lname = @lname
        AND address = @address
    );
    ELSE
    SET @customer_id = (
        SELECT id 
        FROM customer 
        WHERE fname = @fname
        AND lname = @lname
        AND email = @email
        AND (address IS NULL OR address = '')
    ); 

    IF @customer_id IS NULL
    BEGIN

        INSERT INTO customer (
            fname, lname, address, city, state, zip, email
        )
        VALUES (
            @fname, @lname, @address, @city, @state, @zip, @email
        );

        SET @customer_id = SCOPE_IDENTITY();

    END

    INSERT INTO personal_code (
        customer_id, mailing_id, personal_code, email
    )
    VALUES (
        @customer_id, @mailing_id, @personal_code, @email
    );
    SET @personal_code_id = SCOPE_IDENTITY();

    IF @extra IS NOT NULL           
        INSERT INTO personal_code_extra (
            personal_code_id, extra
        )
        VALUES (
            @personal_code_id, @extra
        );

    FETCH NEXT FROM job_cur INTO @mailing_id, @personal_code, @email, @fname, @lname, @address, @city, @state, @zip, @extra;

END

Best Answer

I would probably just do this the brute force way, and add indexes to support these joins where they don't exist. Not much gain to treating new customers and old customers any different once you've inserted all the customers that don't already exist:

INSERT dbo.Customer(fname, lname, address, city, state, zip, email)
    SELECT fname, lname, address, city, state, zip, email
     FROM dbo.job AS j
     WHERE job_no = @job_no
     AND NOT EXISTS
     (
        SELECT 1 FROM dbo.Customer
        WHERE fname = j.fname
        AND lname = j.lname
        AND (address = j.address OR email = j.email)
     );

INSERT INTO dbo.personal_code (customer_id, mailing_id, personal_code, email)
SELECT c.customer_id, j.mailing_id, j.personal_code, c.email
  FROM dbo.Customer AS c
  INNER JOIN dbo.job AS j
  ON c.fname = j.fname AND c.lname = j.lname 
  AND (c.address = j.address OR c.email = j.email)
  WHERE j.job_no = @job_no;

INSERT dbo.personal_code_extra(personal_code_id, extra)
SELECT pc.personal_code_id, j.extra
  FROM dbo.personal_code AS pc
  INNER JOIN dbo.Customer AS c
  ON pc.customer_id = c.customer_id
  INNER JOIN dbo.job AS j
  ON c.fname = j.fname AND c.lname = j.lname 
  AND (c.address = j.address OR c.email = j.email)
  WHERE j.job_no = @job_no;