SQL Server Pattern Matching – Using PATINDEX with Closing Square Bracket

pattern matchingsql serversql server 2014string-searchingt-sql

I am writing a custom JSON parser in T-SQL.

For the purpose of my parser, I am using the PATINDEX function that calculates the position of a token from a list of tokens. The tokens in my case are all single characters and they include these:

{ } [ ] : ,

Usually, when I need to find the (first) position of any of several given characters, I use the PATINDEX function like this:

PATINDEX('%[abc]%', SourceString)

The function will then give me the first position of a or b or c – whichever happens to be found first – in SourceString.

Now the problem in my case seems to be connected with the ] character. As soon as I specify it in the character list, e.g. like this:

PATINDEX('%[[]{}:,]%', SourceString)

my intended pattern apparently becomes broken, because the function never finds a match. It looks like I need a way to escape the first ] so that PATINDEX treats it as one of the lookup characters rather than a special symbol.

I have found this question asking about a similar problem:

However, in that case the ] simply does not need to be specified in brackets, because it is just one character and it can be specified without brackets around them. The alternative solution, which does use escaping, works only for LIKE and not for PATINDEX, because it uses an ESCAPE subclause, supported by the former and not by the latter.

So, my question is, is there any way to look for a ] with PATINDEX using the [ ] wildcard? Or is there a way to emulate that functionality using other Transact-SQL tools?

Additional Information

Here is an example of a query where I need to use PATINDEX with the […] pattern as above. The pattern here works (albeit somewhat) because it does not include the ] character. I need it to work with ] as well:

WITH
  data AS (SELECT CAST('{"f1":["v1","v2"],"f2":"v3"}' AS varchar(max)) AS ResponseJSON),
  parser AS
  (
    SELECT
      Level         = 1,
      OpenClose     = 1,
      P             = p.P,
      S             = SUBSTRING(d.ResponseJSON, 1, NULLIF(p.P, 0) - 1),
      C             = SUBSTRING(d.ResponseJSON, NULLIF(p.P, 0), 1),
      ResponseJSON  = SUBSTRING(d.ResponseJSON, NULLIF(p.P, 0) + 1, 999999)
    FROM
      data AS d
      CROSS APPLY (SELECT PATINDEX('%[[{]%', d.ResponseJSON)) AS p (P)
    UNION ALL
    SELECT
      Level         = ISNULL(d.OpenClose - 1, 0) + d.Level + ISNULL(oc.OpenClose, 0),
      OpenClose     = oc.OpenClose,
      P             = d.P + p.P,
      S             = SUBSTRING(d.ResponseJSON, 1, NULLIF(p.P, 0) - 1),
      C             = c.C,
      ResponseJSON  = SUBSTRING(d.ResponseJSON, NULLIF(p.P, 0) + 1, 999999)
    FROM
      parser AS d
      CROSS APPLY (SELECT PATINDEX('%[[{}:,]%' COLLATE Latin1_General_BIN2, d.ResponseJSON)) AS p (P)
      CROSS APPLY (SELECT SUBSTRING(d.ResponseJSON, NULLIF(p.P, 0), 1)) AS c (C)
      CROSS APPLY (SELECT CASE WHEN c.C IN ('[', '{') THEN 1 WHEN c.C IN (']', '}') THEN 0 END) AS oc (OpenClose)
    WHERE 1=1
      AND p.P <> 0
  )
SELECT
  *
FROM
  parser
OPTION
  (MAXRECURSION 0)
;

The output I get is:

Level  OpenClose  P   S      C   ResponseJSON
-----  ---------  --  -----  --  ---------------------------
1      1          1          {   "f1":["v1","v2"],"f2":"v3"}
1      null       6   "f1"   :   ["v1","v2"],"f2":"v3"}
2      1          7          [   "v1","v2"],"f2":"v3"}
2      null       12  "v1"   ,   "v2"],"f2":"v3"}
2      null       18  "v2"]  ,   "f2":"v3"}
2      null       23  "f2"   :   "v3"}
2      0          28  "v3"   }   

You can see that the ] is included as part of S in one of the rows. The Level column indicates the level of nesting, meaning bracket and braces nesting. As you can see, once the level becomes 2, it never returns to 1. It would have if I could make PATINDEX recognise ] as a token.

The expected output for the above example is:

Level  OpenClose  P   S     C   ResponseJSON
-----  ---------  --  ----  --  ---------------------------
1      1          1         {   "f1":["v1","v2"],"f2":"v3"}
1      NULL       6   "f1"  :   ["v1","v2"],"f2":"v3"}
2      1          7         [   "v1","v2"],"f2":"v3"}
2      NULL       12  "v1"  ,   "v2"],"f2":"v3"}
2      0          17  "v2"  ]   ,"f2":"v3"}
1      NULL       18        ,   "f2":"v3"}
1      NULL       23  "f2"  :   "v3"}
1      0          28  "v3"  }

You can play with this query at db<>fiddle.


We are using SQL Server 2014 and are unlikely to soon upgrade to a version that supports JSON parsing natively. I could write an application to do the job but the results of the parsing need to be processed further, which implies more work in the application than just the parsing – the kind of work that would be much easier, and probably more efficiently, done with a T-SQL script, if only I could apply it directly to the results.

It's very unlikely that I can use SQLCLR as a solution for this problem. However, I don't mind if someone decides to post a SQLCLR solution, since that could be useful for others.

Best Answer

My own solution, which is more of a workaround, consisted in specifying a character range that included the ] and using that range along with the other characters in the [ ] wildcard. I used a range based on the ASCII table. According to that table, the ] character is located in the following neighbourhood:

Hex  Dec  Char
---  ---  ----
…
5A   90   Z
5B   91   [
5C   92   \
5D   93   ]
5E   94   ^
5F   95   _
…

My range, therefore, took the form of [-^, i.e. it included four characters: [, \, ], ^. I also specified that the pattern use a Binary collation, to match the ASCII range exactly. The resulting PATINDEX expression ended up looking like this:

PATINDEX('%[[-^{}:,]%' COLLATE Latin1_General_BIN2, MyJSONString)

The obvious problem with this approach is that the range at the beginning of the pattern includes two unwanted characters, \ and ^. The solution worked for me simply because the extra characters could never occur in the specific JSON strings I needed to parse. Naturally, this cannot be true in general, so I am still interested in other methods, hopefully more universal than mine.