SQL Server Data Extraction – Handling Multiple Delimiters

sql servert-sql

I have a legacy data source column that is delimitted by semicolons and commas. The first semicolon indicates the last name, the second indicates the first and middle name (or initials), and the last semicolon indicates the type of individual. The comma indicates that a new name has began. Here is a sample of this data.

+-------+---------------------------------------------------------------------------------------------------------------------+
|  ID   | SOURCE                                                                                                              |
+-------+---------------------------------------------------------------------------------------------------------------------+
| 62963 | RENZ;MICHAEL;DECEASED,WANDER;MARIA;MINOR,WANDER;HENRY RUDOLPH;MINOR,WANDER;ROSA;MINOR,WANDER;PAUL EMIL;MINOR        |
| 62964 | HERNDON;A C;ESTATE,BERRING;A F;DECEASED,BEIRING;A F;DECEASED,BEIRING;ANDREAS FREDERICK;DECEASED                     |
| 62965 | ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED                                                                   |
| 62965 | ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED                                                                   |
| 62966 | KRAUS;JOSEPHINE;MINOR,KENNEDY;GEORGE;DECEASED                                                                       |
| 62967 | CAREY;JAMES;ESTATE,DE LA GARZA;REFUGIO;DECEASED                                                                     |
| 62968 | LEWIS;FLORENCE;ESTATE,LOCKWOOD;ALBERT A;DECEASED                                                                    |
| 62969 | GLAESER;EMMA;MINOR,GLAESER;HERMAN JR;MINOR,GLAESER;HERMAN;MINOR,RODRIGUEZ;HILARIO;DECEASED,RODRIGUEZ;MARIE;DECEASED |
| 62970 | STORY;BETTIE;ESTATE,EIGENDORFF;FRANZ;DECEASED                                                                       |
| 62971 | HOWELL;MAMIE;MINOR,HOWELL;ETHEL;MINOR                                                                               |
+-------+---------------------------------------------------------------------------------------------------------------------+

I am attempting to pull the data in a manner such as this so that it can be adapted to a different schema:

+-----------+------------+-------------+-------------------+----------+
|      ID   |   SEQUENCE |    LAST     |    FIRSTMIDDLE    |   TYPE   |
+-----------+------------+-------------+-------------------+----------+
|     62963 |          1 | RENZ        | MICHAEL           | DECEASED |
|     62963 |          2 | WANDER      | MARIA             | MINOR    |
|     62963 |          3 | WANDER      | HENRY RUDOLPH     | MINOR    |
|     62963 |          4 | WANDER      | ROSA              | MINOR    |
|     62963 |          5 | WANDER      | PAUL EMIL         | MINOR    |
|     62964 |          1 | HERNDON     | A C               | ESTATE   |
|     62964 |          2 | BERRING     | A F               | DECEASED |
|     62964 |          3 | BEIRING     | A F               | DECEASED |
|     62964 |          4 | BEIRING     | ANDREAS FREDERICK | DECEASED |
|     62965 |          1 | ZINCH       |                   | ESTATE   |
|     62965 |          2 | ZINTZ       |                   | ESTATE   |
|     62965 |          3 | HAYNES      | HENRY             | DECEASED |
|     62966 |          1 | KRAUS       | JOSEPHINE         | MINOR    |
|     62966 |          2 | KENNEDY     | GEORGE            | DECEASED |
|     62967 |          1 | CAREY       | JAMES             | ESTATE   |
|     62967 |          2 | DE LA GARZA | REFUGIO           | DECEASED |
|     62968 |          1 | LEWIS       | FLORENCE          | ESTATE   |
|     62968 |          2 | LOCKWOOD    | ALBERT A          | DECEASED |
|     62969 |          1 | GLAESER     | EMMA              | MINOR    |
|     62969 |          2 | GLAESER     | HERMAN JR         | MINOR    |
|     62969 |          3 | GLAESER     | HERMAN            | MINOR    |
|     62969 |          4 | RODRIGUEZ   | HILARIO           | DECEASED |
|     62969 |          5 | RODRIGUEZ   | MARIE             | DECEASED |
|     62970 |          1 | STORY       | BETTIE            | ESTATE   |
|     62970 |          2 | EIGENDORFF  | FRANZ             | DECEASED |
|     62971 |          1 | HOWELL      | MAMIE             | MINOR    |
|     62971 |          2 | HOWELL      | ETHEL             | MINOR    |
+-----------+------------+-------------+-------------------+----------+

This type of data extraction is something I am not all too familiar with. I am thinking I need to use a complex combination of SUBSTRING and CHARINDEX, but given that the number of entries that the source column can contain varies, I am not sure how best to approach this. Any guidance on where I should begin would be incredibly helpful.

Best Answer

As long as the legacy data is as described and does not have any odd variations, the following should work as it produces the desired output.

Notes:

  • The following two code sections should be run together in SSMS as they are using a table variable and hence need to be in the same query batch. I just separated it into two pieces to make it easier to focus on just the main query.
  • The splitter I used, String_Split, is SQLCLR-based and is available in the SQL# library (which I am the author of, but String_Split, and others, are available in the Free version). However, any splitter that returns the row/item # should work just the same (just don't use one that is based on a WHILE loop).

Test setup:

DECLARE @SampleData TABLE (ID INT NOT NULL, SourceStuff VARCHAR(MAX) NOT NULL);

INSERT INTO @SampleData (ID, SourceStuff) VALUES (62963, 'RENZ;MICHAEL;DECEASED,WANDER;MARIA;MINOR,WANDER;HENRY RUDOLPH;MINOR,WANDER;ROSA;MINOR,WANDER;PAUL EMIL;MINOR');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62964, 'HERNDON;A C;ESTATE,BERRING;A F;DECEASED,BEIRING;A F;DECEASED,BEIRING;ANDREAS FREDERICK;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62965, 'ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62966, 'KRAUS;JOSEPHINE;MINOR,KENNEDY;GEORGE;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62967, 'CAREY;JAMES;ESTATE,DE LA GARZA;REFUGIO;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62968, 'LEWIS;FLORENCE;ESTATE,LOCKWOOD;ALBERT A;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62969, 'GLAESER;EMMA;MINOR,GLAESER;HERMAN JR;MINOR,GLAESER;HERMAN;MINOR,RODRIGUEZ;HILARIO;DECEASED,RODRIGUEZ;MARIE;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62970, 'STORY;BETTIE;ESTATE,EIGENDORFF;FRANZ;DECEASED');
INSERT INTO @SampleData (ID, SourceStuff) VALUES (62971, 'HOWELL;MAMIE;MINOR,HOWELL;ETHEL;MINOR');

Main query:

;WITH cte AS
(
  SELECT sd.ID,
         split.SplitNum,
         split.SplitVal,
         CHARINDEX(N';', split.SplitVal) AS [FirstDelimeter],
         CHARINDEX(N';', split.SplitVal,
                         (CHARINDEX(N';', split.SplitVal) + 1)) AS [SecondDelimeter]
  FROM   #SampleData sd
  CROSS APPLY SQL#.String_Split(sd.SourceStuff, N',', 1) split
)
SELECT tmp.ID,
       tmp.SplitNum AS [Sequence],
       SUBSTRING(tmp.SplitVal,
                 1,
                 (tmp.FirstDelimeter - 1)) AS [Last],
       SUBSTRING(tmp.SplitVal,
                 (tmp.FirstDelimeter + 1),
                 (tmp.SecondDelimeter - (tmp.FirstDelimeter + 1))) AS [FirstMiddle],
       SUBSTRING(tmp.SplitVal,
                 (tmp.SecondDelimeter + 1),
                 LEN(tmp.SplitVal)) AS [Type]
FROM   cte tmp;