Sql-server – Bulk insert not retaining special chars of UTF-8 Encoded txt file Sql Server 2008

encodingsql serversql-server-2008

I have a stored procedure that bulk imports a text file and inserts it into my database.

CREATE TABLE DBO.TEMP_STORE
(
            ID nvarchar(max),
            [MONTH] nvarchar(max),
            [YEAR] nvarchar(max),
            STORE nvarchar(MAX),
            SUBMITTAL nvarchar(MAX),
            ENTITY nvarchar(MAX),
            SUBMITTAL_TYPE nvarchar(MAX),
            iCOUNT nvarchar(MAX),
            STATE nvarchar(MAX),
            COUNTRY nvarchar(MAX),
            REGION nvarchar(MAX),
            GLOBAL_REGION nvarchar(MAX),
            LOCAL_CURRENCY nvarchar(MAX)

 )
--SELECT * FROM DBO.TEMP_STORE
--prepare bulk insert query to load data from file to temp table
SET @SQL='BULK INSERT DBO.TEMP_STORE FROM '''+ @FilePath+''''
SET @SQL=@SQL+'WITH ('
SET @SQL=@SQL+'DATAFILETYPE = ''char'',' 
SET @SQL=@SQL+'FIELDTERMINATOR = ''|'','
--SET @SQL=@SQL+'ROWTERMINATOR = ''\n'','
SET @SQL=@SQL+'ROWTERMINATOR = ''' + nchar(10) + ''','
SET @SQL=@SQL+'FIRSTROW =2)'

--print @SQL
EXEC (@SQL)

For instance, importing Update ROLLING_CONE_SHOP_DETAIL set SHOP_STATE = 'São Paulo' results in storing the data as'S+úo Paulo' when I select the same row that was imported. I can update the row and explicitly fix the fix the issue:

Update TEMP_STORE set STATE = 'São Paulo' where STATE= 'S+úo Paulo'

But when I do my Bulk insert, it does not retain the special character. I have confirmed the text file I am importing is saved at utf-8 and contains the correct character.

How can I make sure my bulk inset retains the special character properly?

Best Answer

SQL Server 2008 does not support UTF-8 natively prior to SQL Server 2019.

So how can you import your data correctly?

  1. Convert your data to UTF-16 prior to your insert
  2. Upgrade to SQL Server 2014 (SP2) or above and take advantage of code page 65001 in the bulk insert statement as below

    BULK INSERT #table
    
    FROM  'C:\MyFile.csv'+ WITH 
    
    ( 
        CODEPAGE = '65001',
        FIELDTERMINATOR = ',',
        ROWTERMINATOR ='\n'
    );
    
  3. Since UTF-8 is an encoding style, not a collation, store the data as VARBINARY and decode it in the front-end rather than the database.