Unicode files should have BOM, is the recommended and accepted way, specially for LE:
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a
file or character stream to indicate the endianness (byte order) of
all the 16-bit code units of the file or stream. If the 16-bit units
are represented in big-endian byte order, this BOM character will
appear in the sequence of bytes as 0xFE followed by 0xFF. This
sequence appears as the ISO-8859-1 characters þÿ in a text display
that expects the text to be ISO-8859-1.
...
"The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian."
'linux' is a very generic term. You must be using some application for processing, and if the application is not recognizing the BOM it means is a bad application. Ditch it for something better.
Well, this is really awful, but okay, if you are going to refuse to even consider better alternatives... first, create a set-based split function that will track the order of the string (your looping function is really not optimal):
CREATE FUNCTION [dbo].[SplitStrings_Ordered]
(
@List VARCHAR(8000),
@Delimiter VARCHAR(255)
)
RETURNS TABLE
AS
RETURN (SELECT [Index] = ROW_NUMBER() OVER (ORDER BY Number), Item
FROM (SELECT Number, Item = SUBSTRING(@List, Number,
CHARINDEX(@Delimiter, @List + @Delimiter, Number) - Number)
FROM (SELECT ROW_NUMBER() OVER (ORDER BY [object_id])
FROM sys.all_columns) AS n(Number)
WHERE Number <= CONVERT(INT, LEN(@List))
AND SUBSTRING(@Delimiter + @List, Number, LEN(@Delimiter)) = @Delimiter
) AS y);
(If you have a numbers table in this database, use that instead of sys.all_columns
, and add WITH SCHEMABINDING
to the function definition.)
Now, let's look at a few examples of strings with commas embedded inside double quotes, and removing those before splitting and re-concatenating:
DECLARE @x TABLE(n VARCHAR(8000));
INSERT @x VALUES
('0150566115,"HEALTH 401K","IC,ON","ICON HEALTH 401K",,,1,08/21/2014'),
('0150566115,HEALTH 401K,"IC,ON","ICON HEALTH 401K",,,1,"08/21/2014"'),
('"01505,66115,","HEALTH 401K","IC,ON","ICON HEALTH 401K",,,1,08/21/2014');
;WITH x AS
(
SELECT x.n, s.[Index], s = REPLACE(s.Item, ',',
CASE s.[Index]%2 WHEN 0 THEN '' ELSE ',' END)
FROM @x AS x
CROSS APPLY dbo.SplitStrings_Ordered(x.n, '"') AS s
)
SELECT x.n, fixed = (SELECT x2.s
FROM x AS x2
WHERE x2.n = x.n
ORDER BY [Index]
FOR XML PATH, TYPE).value(N'.[1]',N'varchar(max)')
FROM x
GROUP BY x.n;
Results in the fixed
column for all three strings:
0150566115,HEALTH 401K,ICON,ICON HEALTH 401K,,,1,08/21/2014
0150566115,HEALTH 401K,ICON,ICON HEALTH 401K,,,1,08/21/2014
0150566115,HEALTH 401K,ICON,ICON HEALTH 401K,,,1,08/21/2014
Now, you can feed those results back into the split function, using the comma this time, depending on your ultimate goal. The question seemed to resolve only around being able to ignore the double quotes and any commas contained only inside double-quote pairs.
For more on splitting and concatenating strings:
For more on numbers tables and generating sets without loops:
Best Answer
You could use the
LEFT()
andRIGHT()
functions to do this.Result