Help with Replacing characters with text string chops of last character of string

sql-server-2012t-sql

I need assistance with fixing a replace statement that truncates the last character of the string. The original string (Question_Text) contains HTML characters and other erroneous characters that need to be cleaned up. I included a multiple Replace statement within the Function. it is the Function I need assistance with.

As you can see in the sample data, Question_Text2 has been cleaned up, except it cuts of the last character. Along with the removal of erroneous characters in the middle of some of the questions (see ID = 12165). What am I doing wrong?

I have include code to create the sample data results after the function has been used to illustrate the problem.

/****** Object:  Table [dbo].[tblQuestionsSample]    Script Date: 12/10/2021 5:12:16 PM ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[tblQuestionsSample]') AND type in (N'U'))
DROP TABLE [dbo].[tblQuestionsSample]
GO

/****** Object:  Table [dbo].[tblQuestionsSample]    Script Date: 12/10/2021 5:12:16 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[tblQuestionsSample](
    [QuestionsRecID] [int] IDENTITY(1,1) NOT NULL,
    [Question_ID] [int] NULL,
    [Question_Text] [varchar](1000) NULL,
    [Question_Text2] [varchar](1000) NULL
 CONSTRAINT [PK_dbo.[QuestionsRecID] PRIMARY KEY CLUSTERED 
(
    [QuestionsRecID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO [dbo].[tblQuestionsSample]
           ([Question_ID]
           ,[Question_Text]
           ,[Question_Text2])
     VALUES
        (11603,'<p>Date/Time</p>','Date/Tim')
        ,(11632,'<p>Attachments</p>','Attachment')
        ,(12166,'<p>Employee ID</p>','Employee I')
        ,(12166,'<p>Work Related?</p>','Work Related')
        ,(12165,'<p>Date &amp; Time of injury/onset illness?</p>','Date &amp; Time of injury/onset illness')
        ,(12165,'<p>Full Injury/Illness Description</p>','Full Injury/Illness Descriptio')
        ,(12165,'<p>Job Title</p>','Job Titl')
go

Results of query:

| id   |  question_text|                                Question_Text2| 
| ---- | -------- | -------------------------------------------------- |
| 11603| <p>Date/Time</p>|                               Date/Tim|
| 11632| <p>Attachments</p> |                            Attachment|
| 12166| <p>Employee ID</p> |                            Employee I|
| 12166| <p>Work Related?</p>|                            Work Related|
| 12165| <p>Date &amp; Time of injury/onset illness?</p>| Date &amp; Time of injury/onset illnes|
| 12165| <p>Full Injury/Illness Description</p>|       Full Injury/Illness Descriptio|
| 12165| <p>Job Title</p>|                                Job Titl|

As you can see I am using a function [dbo].[udf_StripHTML] to clean up the data:

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT

SET @Start = CHARINDEX('<',@HTMLText) SET @End = 
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) 
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
--RETURN replace(REPLACE((@HTMLText),'&nbsp;',' '), ''', '')
--RETURN replace(replace(LTRIM((@HTMLText)),'&nbsp;',' '), ''','')
RETURN replace(substring(REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),'&nbsp;',' '), '&#39',' '), '&amp;',''), charindex('.)',REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),'&nbsp;',' '), ''', ''), '&amp;','')), len(REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),'&nbsp;',' '), ''', ''), '&amp;',''))

END


Select [Question_ID]
,[Question_Text]
,[dbo].[udf_StripHTML]([question_text2])
From [dbo].[tblQuestionsSample]
WHERE  (question_text NOT LIKE '%Comments%') AND (question_text NOT LIKE '%Archived%')

I appreciate any assistance you may offer,

Karen

Best Answer

If your HTML is valid XHTML (and it appears to be so from the small sample provided), you can cast it to xml and then use .value with XQuery to parse it. This will do a far better job than mucking around with string manipulation and HTML escaping:

SELECT
  Question_Text_Parsed = TRY_CAST(Question_Text AS xml).value('(/p/text())[1]','nvarchar(1000)')
FROM tblQuestionsSample
Question_ID Question_Text_Parsed
11603 Date/Time
11632 Attachments
12166 Employee ID
12166 Work Related?
12165 Date & Time of injury/onset illness?
12165 Full Injury/Illness Description
12165 Job Title

db<>fiddle