SQL Server Data Extraction – Handling Multiple Delimiters

sql servert-sql

I have a legacy data source column that is delimitted by semicolons and commas. The first semicolon indicates the last name, the second indicates the first and middle name (or initials), and the last semicolon indicates the type of individual. The comma indicates that a new name has began. Here is a sample of this data.

+-------+---------------------------------------------------------------------------------------------------------------------+
|  ID   | SOURCE                                                                                                              |
+-------+---------------------------------------------------------------------------------------------------------------------+
| 62963 | RENZ;MICHAEL;DECEASED,WANDER;MARIA;MINOR,WANDER;HENRY RUDOLPH;MINOR,WANDER;ROSA;MINOR,WANDER;PAUL EMIL;MINOR        |
| 62964 | HERNDON;A C;ESTATE,BERRING;A F;DECEASED,BEIRING;A F;DECEASED,BEIRING;ANDREAS FREDERICK;DECEASED                     |
| 62965 | ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED                                                                   |
| 62965 | ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED                                                                   |
| 62966 | KRAUS;JOSEPHINE;MINOR,KENNEDY;GEORGE;DECEASED                                                                       |
| 62967 | CAREY;JAMES;ESTATE,DE LA GARZA;REFUGIO;DECEASED                                                                     |
| 62968 | LEWIS;FLORENCE;ESTATE,LOCKWOOD;ALBERT A;DECEASED                                                                    |
| 62969 | GLAESER;EMMA;MINOR,GLAESER;HERMAN JR;MINOR,GLAESER;HERMAN;MINOR,RODRIGUEZ;HILARIO;DECEASED,RODRIGUEZ;MARIE;DECEASED |
| 62970 | STORY;BETTIE;ESTATE,EIGENDORFF;FRANZ;DECEASED                                                                       |
| 62971 | HOWELL;MAMIE;MINOR,HOWELL;ETHEL;MINOR                                                                               |
+-------+---------------------------------------------------------------------------------------------------------------------+

I am attempting to pull the data in a manner such as this so that it can be adapted to a different schema:

+-----------+------------+-------------+-------------------+----------+
|      ID   |   SEQUENCE |    LAST     |    FIRSTMIDDLE    |   TYPE   |
+-----------+------------+-------------+-------------------+----------+
|     62963 |          1 | RENZ        | MICHAEL           | DECEASED |
|     62963 |          2 | WANDER      | MARIA             | MINOR    |
|     62963 |          3 | WANDER      | HENRY RUDOLPH     | MINOR    |
|     62963 |          4 | WANDER      | ROSA              | MINOR    |
|     62963 |          5 | WANDER      | PAUL EMIL         | MINOR    |
|     62964 |          1 | HERNDON     | A C               | ESTATE   |
|     62964 |          2 | BERRING     | A F               | DECEASED |
|     62964 |          3 | BEIRING     | A F               | DECEASED |
|     62964 |          4 | BEIRING     | ANDREAS FREDERICK | DECEASED |
|     62965 |          1 | ZINCH       |                   | ESTATE   |
|     62965 |          2 | ZINTZ       |                   | ESTATE   |
|     62965 |          3 | HAYNES      | HENRY             | DECEASED |
|     62966 |          1 | KRAUS       | JOSEPHINE         | MINOR    |
|     62966 |          2 | KENNEDY     | GEORGE            | DECEASED |
|     62967 |          1 | CAREY       | JAMES             | ESTATE   |
|     62967 |          2 | DE LA GARZA | REFUGIO           | DECEASED |
|     62968 |          1 | LEWIS       | FLORENCE          | ESTATE   |
|     62968 |          2 | LOCKWOOD    | ALBERT A          | DECEASED |
|     62969 |          1 | GLAESER     | EMMA              | MINOR    |
|     62969 |          2 | GLAESER     | HERMAN JR         | MINOR    |
|     62969 |          3 | GLAESER     | HERMAN            | MINOR    |
|     62969 |          4 | RODRIGUEZ   | HILARIO           | DECEASED |
|     62969 |          5 | RODRIGUEZ   | MARIE             | DECEASED |
|     62970 |          1 | STORY       | BETTIE            | ESTATE   |
|     62970 |          2 | EIGENDORFF  | FRANZ             | DECEASED |
|     62971 |          1 | HOWELL      | MAMIE             | MINOR    |
|     62971 |          2 | HOWELL      | ETHEL             | MINOR    |
+-----------+------------+-------------+-------------------+----------+

This type of data extraction is something I am not all too familiar with. I am thinking I need to use a complex combination of SUBSTRING and CHARINDEX, but given that the number of entries that the source column can contain varies, I am not sure how best to approach this. Any guidance on where I should begin would be incredibly helpful.

Best Answer

As long as the legacy data is as described and does not have any odd variations, the following should work as it produces the desired output.

Notes:

The following two code sections should be run together in SSMS as they are using a table variable and hence need to be in the same query batch. I just separated it into two pieces to make it easier to focus on just the main query.
The splitter I used, String_Split, is SQLCLR-based and is available in the SQL# library (which I am the author of, but String_Split, and others, are available in the Free version). However, any splitter that returns the row/item # should work just the same (just don't use one that is based on a WHILE loop).

Test setup:

DECLARE @SampleData TABLE (ID INT NOT NULL, SourceStuff VARCHAR(MAX) NOT NULL);

INSERT INTO @SampleData (ID, SourceStuff) VALUES (62963, 'RENZ;MICHAEL;DECEASED,WANDER;MARIA;MINOR,WANDER;HENRY RUDOLPH;MINOR,WANDER;ROSA;MINOR,WANDER;PAUL EMIL;MINOR');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62964, 'HERNDON;A C;ESTATE,BERRING;A F;DECEASED,BEIRING;A F;DECEASED,BEIRING;ANDREAS FREDERICK;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62965, 'ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62966, 'KRAUS;JOSEPHINE;MINOR,KENNEDY;GEORGE;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62967, 'CAREY;JAMES;ESTATE,DE LA GARZA;REFUGIO;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62968, 'LEWIS;FLORENCE;ESTATE,LOCKWOOD;ALBERT A;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62969, 'GLAESER;EMMA;MINOR,GLAESER;HERMAN JR;MINOR,GLAESER;HERMAN;MINOR,RODRIGUEZ;HILARIO;DECEASED,RODRIGUEZ;MARIE;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62970, 'STORY;BETTIE;ESTATE,EIGENDORFF;FRANZ;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62971, 'HOWELL;MAMIE;MINOR,HOWELL;ETHEL;MINOR');

Main query:

;WITH cte AS
(
  SELECT sd.ID,
         split.SplitNum,
         split.SplitVal,
         CHARINDEX(N';', split.SplitVal) AS [FirstDelimeter],
         CHARINDEX(N';', split.SplitVal,
                         (CHARINDEX(N';', split.SplitVal) + 1)) AS [SecondDelimeter]
  FROM   #SampleData sd
  CROSS APPLY SQL#.String_Split(sd.SourceStuff, N',', 1) split
)
SELECT tmp.ID,
       tmp.SplitNum AS [Sequence],
       SUBSTRING(tmp.SplitVal,
                 1,
                 (tmp.FirstDelimeter - 1)) AS [Last],
       SUBSTRING(tmp.SplitVal,
                 (tmp.FirstDelimeter + 1),
                 (tmp.SecondDelimeter - (tmp.FirstDelimeter + 1))) AS [FirstMiddle],
       SUBSTRING(tmp.SplitVal,
                 (tmp.SecondDelimeter + 1),
                 LEN(tmp.SplitVal)) AS [Type]
FROM   cte tmp;

Related Solutions

T-sql – Merging data from multiple tables

You don't mention what engine this is going against so I'm assuming it's Microsoft SQL Server. One word of advice is that you put the join predicate and specify the type of join in the FROM clause, it will make errors like this much more visible as well as making reading the query easier. You forgot to specify the join type for the db1.dbo.Products table, which I think makes it a cross join, so vastly multiplying the number of rows you would get back. See below.

merge into factOrders fo
using (
    select CustomerKey
        ,ProductKey
        ,CategoryKey
        ,dt1.TimeKey as [OrderDateKey]
        ,dt2.TimeKey as [RequiredDateKey]
        ,dt3.TimeKey as [ShippedDateKey]
        ,o.OrderID
        ,Freight
        ,od.UnitPrice as [UnitPrice]
        ,od.Quantity as [Quantity]
        ,od.Discount as [Discount]
        ,od.UnitPrice * od.Quantity * (1 - Discount) as [TotalPrice]
        ,sh.CompanyName as [ShipperName]
        ,sh.Phone as [ShipperPhone]
    from db1.dbo.Orders o
        INNER JOIN db1.dbo.[Order Details] od ON o.OrderID = od.OrderID
        INNER JOIN db1.dbo.Shippers sh ON sh.ShipperID = o.ShipVia
        INNER JOIN dimCustomers dc ON dc.CustomerID = o.CustomerID
        INNER JOIN dimProducts dp ON dp.ProductID = od.ProductID
        INNER JOIN dimCategories dct ON dct.CategoryID = p.CategoryID
        INNER JOIN db1.dbo.Products p ON 
        INNER JOIN dimTime dt1 ON dt1.date = o.OrderDate
        INNER JOIN dimTime dt2 ON dt2.date = o.RequiredDate
        INNER JOIN dimTime dt3 ON dt3.date = o.ShippedDate
    ) o
    on (
            fo.ProductKey = o.ProductKey
            and fo.CustomerKey = o.CustomerKey
            and fo.CategoryKey = o.CategoryKey
            and fo.OrderDateKey = o.OrderDateKey
            )

Send Multiple Attachments with Send DBMail – SQL Server Guide

The semi-colon is a statement terminator, and you haven't correctly escaped your attachments. In other words, you've done this:

EXEC msdb.dbo.sp_send_dbmail
  @profile_name = 'Mod',
  @from_address = 'modis@modisglobal.com',
  @recipients= 'rsmith@gmail.com',
  @subject= 'Test Email', 
  @body = @body1,
  @file_attachments = @filename1;

@filename2; -- Syntax Error

According to the MSDN article on sp_send_dbmail:

[ @file_attachments= ] 'file_attachments'

Is a semicolon-delimited list of file names to attach to the e-mail message. Files in the list must be specified as absolute paths. The attachments list is of type nvarchar(max). By default, Database Mail limits file attachments to 1 MB per file.

You need to do this:

Declare @filenames varchar(max)
Set @filenames = 'C:\Testfiles\Test1.csv;C:\Testfiles\Test2.csv'

EXEC msdb.dbo.sp_send_dbmail
  @profile_name = 'Mod',
  @from_address = 'modis@modisglobal.com',
  @recipients= 'rsmith@gmail.com',
  @subject= 'Test Email', 
  @body = @body1,
  @file_attachments = @filenames;

If you want to programmatically add attachments, you would do something like this (notice how the second and third files begin with a semi-colon):

DECLARE @filenames varchar(max)
DECLARE @file1 VARCHAR(MAX) = 'C:\Testfiles\Test1.csv'
SELECT @filenames = @file1

-- Optional new attachments
DECLARE @file2 VARCHAR(MAX) = ';C:\Testfiles\Test2.csv'
DECLARE @file3 VARCHAR(MAX) = ';C:\Testfiles\Test3.csv'

-- Create list from optional files
SELECT @filenames = @file1 + @file2 + @file3

-- Send the email
EXEC msdb.dbo.sp_send_dbmail
  @profile_name = 'Mod',
  @from_address = 'modis@modisglobal.com',
  @recipients= 'rsmith@gmail.com',
  @subject= 'Test Email', 
  @body = @body1,
  @file_attachments = @filenames;

Best Answer

Related Solutions

T-sql – Merging data from multiple tables

Send Multiple Attachments with Send DBMail – SQL Server Guide

Related Question