T-sql – split string for calculations

t-sql

I'm doing some basic statistical calculations on the results of a survey. A survey consists of several sections and each section has some questions in it. The answers to each question in each section are stored.

The ResultsTable table has Result column which contains computed result for the entire survey as well as overall results for each section, separated by double pipe. For example,

Moderate||Low||High||Low||Low||Low||NA||NA

Above, Moderate is overall survey result, Low is result for section 1, High is for section 2 and so on.

(Then I would be computing some statistics, such as % of people who have High as their result for a specific section. Or, of those who have High for a specific section the % who answered Yes or No to a specific question in that section, etc.)

The problem I'm facing is that I need to split string, which I think could cause performance problems. What I have right now is as follows.

SELECT COUNT(*) [Personal_Health_High_Risk] FROM 
    (
        SELECT rt.* 
        FROM ResultsTable rt
        WHERE 
            rt.CampaignID IN (SELECT Value FROM ufn_ParseDelimitedString(@campaignIDs, ',')) AND 
            rt.SectionID = @sectionID
    ) sq
WHERE 
    -- where 3rd section result is High
    (SELECT Value FROM ufn_ParseDelimitedString(sq.Result, '||') WHERE RowNumber = 4 ) LIKE '%High%'

The ufn_ParseDelimitedString function splits a string using given delimiter and returns resulting tokens in a table. For example,

-- ufn_ParseDelimitedString output:
--  [RowNumber] [Value]
--  1           Moderate
--  2           Low
--  3           High
--  ...

So, is it better to continue with what I'm doing – basically split the string on each select? OR I was thinking taking that inner SELECT in code above and saving output into temporary table and actually expanding the concatenated result string into separate columns. This way I could reuse temporary table for various other calculations. What are your thoughts on this?

Best Answer

Doing string manipulation in SQL is non-optimal for the engine; it's not meant for that kind of row by row processing.

Your option of using temporary tables to hold intermediary results is going to be your best bet on solving the performance problem. At least that way it's only split once. But there are solutions that may work better for you.

  • Can you modify the import process so the data is split up before it's saved to the database?
  • What about importing to staging tables and using either an Agent Job or external process to split the data up into permanent tables?

I'm not sure how much reporting you are doing on older data sets; if you only need to report on the data once or twice then temp tables are probably your best bet. But if you want or need to do longer term analysis then continually splitting the same data is unnecessarily expensive. Storage is usually cheaper than CPU.