SQL Server 2012 – Using LEFT or SUBSTRING with Wildcard

sql-server-2012t-sql

I am looking to retrieve information from a field, but am having troubles getting exactly what I want. The information I want has no set order and I am also looking for two different things. Here is an example of the information I am working with.

|PEvent|   PDate     |         PDescription                          |
|  3  | 2014-11-21 | Padding, 90 Series, Stage, SF550                |
|  4  | 2014-11-21 | Padding, 45 Series, 144 Stage SF90, 15 spor, AR |
|  5  | 2013-03-06 | 400 Series, 96 STAGE RE FLEX 30                 | 
|  6  | 2015-01-09 | Padding, 20 Series, 13 Stage SF 550, Housing    |

What I want to get is SF and the number and FLEX and the number that follows it, in these cases SF550, SF90, SF 550 FLEX 30. I would also like to get FLEX and the number that follows it. The problem I am having is that there are so many differences in the data. I cannot say 10 spaces from the left since it is not always 10 spaces. I have the same problem from the right. The other issue I have is that sometimes SF and FLEX will be right against the number where other times there will be a space in between them.

I have created a SQL Fiddle sample database with more records for testing.

SQL Fiddle

Best Answer

The following assumptions have been made (some of them possibly repeating parts of your description):

  1. SF/FLEX is always followed either by a number immediately or by a space character and then a number.

  2. There is always one space character before SF/FLEX, unless the item is at the beginning of the string.

  3. There is always a comma or a space after the number that follows SF/FLEX, unless the item is at the end of the string.

With those in mind, here is my approach:

SELECT
  t.*,
  Item = LEFT(item.S, CHARINDEX(' ', item.S + ' ', LEN(prefix.S) + spaces.N + 1))
FROM
  dbo.testdesc AS t
  CROSS JOIN (VALUES ('SF'), ('FLEX')) AS prefix (S)
  CROSS JOIN (VALUES (0), (1)) AS spaces (N)
  CROSS APPLY (SELECT NULLIF(PATINDEX('% ' + prefix.S + SPACE(spaces.N) + '[0-9]%', ' ' + t.PDescription), 0)) AS startpos (N)
  CROSS APPLY (SELECT CHARINDEX(',', t.PDescription + ',', startpos.N + LEN(prefix.S) + spaces.N)) AS endpos (N)
  CROSS APPLY (SELECT SUBSTRING(t.PDescription, startpos.N, endpos.N - startpos.N)) AS item (S)
WHERE
  startpos.N IS NOT NULL
;

This is the output of the above query:

PEvent  PDate                    PDescription                                               Item
------  -----------------------  ---------------------------------------------------------  -------
3       2014-11-21 00:00:00.000  Padding, 90 Series, Stage, SF550                           SF550
4       2014-11-21 00:00:00.000  Padding, 45 Series, 144 Stage SF90, 15 spor, AR            SF90
8       2013-06-14 00:00:00.000  Padding, 900 Series, 140 Stage SF1100, 1/16, Money, TS XB  SF1100
10      2014-12-30 00:00:00.000  Padding, 444 Series, 47 Stage SF330, 5/9, house            SF330
6       2015-01-09 00:00:00.000  Padding, 20 Series, 13 Stage SF 550, Housing               SF 550
5       2013-03-06 00:00:00.000  400 Series, 96 STAGE RE FLEX 30                            FLEX 30
7       2013-06-14 00:00:00.000  Padding, FLEX 10 400 Series, 11 Stage, BYHMP               FLEX 10
9       2015-02-27 00:00:00.000  Flex 10 40 87 stage                                        Flex 10

And to explain some of the trickery employed by the query, this is how it works:

  1. The original row set is cross joined with two inline views, one that represents the required prefixes (prefix) and one that specifies how many spaces there may be between the prefix and the subsequent number (spaces). This way the query will probe for all combinations of prefixes and numbers of delimiting spaces (SF, SF , FLEX, FLEX ).

  2. The PDescription string is searched for each combination, one after one, to calculate the starting position of the prefix. It uses PATINDEX, which, for each prefix and space number, builds its own search pattern. As per the assumptions above, PATINDEX expects a space before the prefix and at least one digit after the prefix and, possibly, a space.

    So, if, for instance, the current prefix is SF and the current expected number of spaces after it is 1, the pattern would be % SF [0-9]%. To make sure there is a space before the prefix, a space is added at the beginning of PDescription.

    That trick accomplishes two things: it helps to find the prefix if it is at the beginning of the string and it makes the resulting starting position accurate. The latter is important because technically the returned position would match not the prefix itself but the preceding space. So, to get the actual position of the prefix, we would need to increment the result by 1. However, because we had added a space at the beginning of the source, all the positions were shifted forward already. So, in the end, the result points exactly at the beginning of the prefix.

    One final touch here is that if the returned position is 0, it is transformed into NULL by NULLIF. That helps to avoid errors about negative length values passed to SUBSTRING later.

  3. The found position is then used in CHARINDEX to find the first comma in PDescription found after that position. Again, to make sure there is a comma, the character is appended to PDescription. The result constitutes the ending position of the sought item.

  4. Using the obtained starting and ending positions, the item is extracted from PDescription using the SUBSTRING function.

  5. Finally, in the SELECT clause, before actually returning the item, the query additionally checks if there is a space after the number, just in case the item was not properly delimited by a comma. By appending a space to the item, it uses the same trick as when searching for the comma position, also taking into account the length of the prefix and the expected number of spaces after the prefix to make sure the found space character will be the one after the number.

  6. In cases where the prefix is not found in the string, Item returns NULL. Such rows, however, are excluded from the output by the WHERE clause, which specifies that the starting position be not NULL.

I am not sure if the above description is clear enough but hopefully it will help together with the code itself.