I need assistance with fixing a replace statement that truncates the last character of the string. The original string (Question_Text) contains HTML characters and other erroneous characters that need to be cleaned up. I included a multiple Replace statement within the Function. it is the Function I need assistance with.
As you can see in the sample data, Question_Text2 has been cleaned up, except it cuts of the last character. Along with the removal of erroneous characters in the middle of some of the questions (see ID = 12165). What am I doing wrong?
I have include code to create the sample data results after the function has been used to illustrate the problem.
/****** Object: Table [dbo].[tblQuestionsSample] Script Date: 12/10/2021 5:12:16 PM ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[tblQuestionsSample]') AND type in (N'U'))
DROP TABLE [dbo].[tblQuestionsSample]
GO
/****** Object: Table [dbo].[tblQuestionsSample] Script Date: 12/10/2021 5:12:16 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[tblQuestionsSample](
[QuestionsRecID] [int] IDENTITY(1,1) NOT NULL,
[Question_ID] [int] NULL,
[Question_Text] [varchar](1000) NULL,
[Question_Text2] [varchar](1000) NULL
CONSTRAINT [PK_dbo.[QuestionsRecID] PRIMARY KEY CLUSTERED
(
[QuestionsRecID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO [dbo].[tblQuestionsSample]
([Question_ID]
,[Question_Text]
,[Question_Text2])
VALUES
(11603,'<p>Date/Time</p>','Date/Tim')
,(11632,'<p>Attachments</p>','Attachment')
,(12166,'<p>Employee ID</p>','Employee I')
,(12166,'<p>Work Related?</p>','Work Related')
,(12165,'<p>Date & Time of injury/onset illness?</p>','Date & Time of injury/onset illness')
,(12165,'<p>Full Injury/Illness Description</p>','Full Injury/Illness Descriptio')
,(12165,'<p>Job Title</p>','Job Titl')
go
Results of query:
| id | question_text| Question_Text2|
| ---- | -------- | -------------------------------------------------- |
| 11603| <p>Date/Time</p>| Date/Tim|
| 11632| <p>Attachments</p> | Attachment|
| 12166| <p>Employee ID</p> | Employee I|
| 12166| <p>Work Related?</p>| Work Related|
| 12165| <p>Date & Time of injury/onset illness?</p>| Date & Time of injury/onset illnes|
| 12165| <p>Full Injury/Illness Description</p>| Full Injury/Illness Descriptio|
| 12165| <p>Job Title</p>| Job Titl|
As you can see I am using a function [dbo].[udf_StripHTML] to clean up the data:
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText) SET @End =
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
--RETURN replace(REPLACE((@HTMLText),' ',' '), ''', '')
--RETURN replace(replace(LTRIM((@HTMLText)),' ',' '), ''','')
RETURN replace(substring(REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),' ',' '), ''',' '), '&',''), charindex('.)',REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),' ',' '), ''', ''), '&','')), len(REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),' ',' '), ''', ''), '&',''))
END
Select [Question_ID]
,[Question_Text]
,[dbo].[udf_StripHTML]([question_text2])
From [dbo].[tblQuestionsSample]
WHERE (question_text NOT LIKE '%Comments%') AND (question_text NOT LIKE '%Archived%')
I appreciate any assistance you may offer,
Karen
Best Answer
If your HTML is valid XHTML (and it appears to be so from the small sample provided), you can cast it to
xml
and then use.value
with XQuery to parse it. This will do a far better job than mucking around with string manipulation and HTML escaping:db<>fiddle