How to Extract a Repeating String Pattern from Multiple Rows in SQL Server 2012

sql-server-2012t-sql

I have the following data:

|  ID  |                 BadCol                        | 
+------+-----------------------------------------------+         
|  1   | All of R489804 R489805 R489806 R489807 R489808|    
|  2   | SP fr R308999                                 |           
|  3   | R308777, R104621, R708931                     | 

I'm only interested in the R###### pattern (which will always be an R followed by 6 numerical characters). I can retrieve the location of first occurrence using the following function:

PATINDEX('R[0-9][0-9][0-9][0-9][0-9][0-9]', BadCol)   

Ideally I'd like to pivot the pattern into it's own row like so:

|  ID  | GoodCol | 
+------+---------+         
|  1   | R489804 |    
|  1   | R489805 |    
|  1   | R489806 |    
|  1   | R489807 |    
|  1   | R489808 |    
|  2   | R308999 |           
|  3   | R308777 |  
|  3   | R104621 |  
|  3   | R708931 |  

I've seen some great resources on how to do this when the data is delimited the same way, such as here. Unfortunately the data in this column is delimited differently from row to row. I've also explored a recursive CTE like what is detailed here, but unfortunately both of these methodologies use CHARINDEX() which allows an optional start_location parameter. PATINDEX() unfortunately does not take a start_location parameter.

Best Answer

Here is a sample of how I would do this:

select 
      a.id
    , substring(a.value,T.N,7)
from 
    (
    values
        (1,'All of R489804 R489805 R489806 R489807 R489808'),   
        (2,'SP fr R308999                                 '),          
        (3,'R308777, R104621, R708931                     ')
    ) a(id,value)
cross apply (select N from Tally where N<len(a.value)-5) T(N)
where substring(a.value,T.N,7) like 'R[0-9][0-9][0-9][0-9][0-9][0-9]';

Here, I am using a Tally (or numbers) table to walk the length of each value, extracting substrings of length 7 that match the required format.