SQL Server 2008 R2 – Having SharePoint on a Different Collation than System Databases

collationsharepointsql-server-2008-r2

After installing Sharepoint I realized that the collation of all Sharepoint databases is Latin1_General_CI_AS_KS_WS while my server collation is Latin1_General_CI_AS. From official documentation I see that all CI collations are supported. Anyway, would it be helpful for me to rebuild the system databases into the same collation to avoid conversion problems in temp db?

Quote

We support any CI collation for the SQL instance (for master, tempdb
databases). However we recommend using Latin1_General_CI_AS_KS_WS as
the instance default collation (master, tempdb databases).

I don't find the best practice here, please help me.

Best Answer

Having a different default collation for the instance than for the SharePoint DB is not necessarily a problem. The main thing to ensure is that the SharePoint DB is set to the correct collation for SharePoint, which is Latin1_General_CI_AS_KS_WS, and which you have done.

You quoted a section from Supportability regarding SQL collation for SharePoint Databases and TempDB (in the future, when you quote something, you really need to attribute the source) which clearly states that SharePoint does indeed "support any CI collation for the SQL instance (for master, tempdb databases)". Hence, Latin1_General_CI_AS being case-insensitive, is supported.

The only other guidance I have found, and which makes sense, even outside of SharePoint, is this:

SharePoint SQL Collation - Best practice

The main point expressed there is that if you have more running on this instance of SQL Server than SharePoint, then sure, use whatever collation makes sense. Else, it is easier to just have it match between SharePoint and the system DBs.

Related Solutions

Sql-server – SQL Server: Collation of the resource database

The database Collation for the first SELECT statement in my query is Latin1_General_CI_AS.

Well, not exactly. There are a few problems with this statement:

The Database's default Collation only matters in a query when using string literals, variables, and return values from UDFs. AND, that default Collation only matters if there is no column or COLLATE keyword being used.
SELECT statements / queries, as a whole, do not use Collations. Collation is assigned per each string field, and it can be different for each field in a query.
The column in the first / top query, stopword, is not using the Latin1_General_CI_AS Collation (more on this in a moment).

I see that the system-stopwords are stored in the resource database, which can explain the difference in collation.

A column (or expression) coming from a different Database does not necessarily explain a difference in Collation. As stated above, Collation is set per each field of a query, whether that field comes from a column in a table or is an expression. Collation is usually derived naturally from Collation Precedence: Column Collation overrides literals and variables, and the COLLATE keyword overrides both. When there is a conflict, then you need the COLLATE keyword.

The main point here, however, is that if the stopword column in sys.fulltext_system_stopwords comes from a column in a Table in the Resource Database (i.e. mssqlsystemresource), OR if it comes from an expression in a View in the Resource Database that has its Collation set via the COLLATE keyword, then the default Collation of the Resource Database is irrelevant.

I can 'solve' my error by using COLLATE in my query

Yes, the COLLATE keyword is the way to go. However, to fix this conflict, you only need to specify the COLLATE keyword in one query, though it does not matter which query.

For example, if I run the query with no COLLATE keyword, I get:

SELECT ftsw.stopword
FROM sys.fulltext_stopwords ftsw

UNION

SELECT ftssw.stopword
FROM sys.fulltext_system_stopwords ftssw;

Msg 468, Level 16, State 9, Line 4
Cannot resolve the collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "Latin1_General_BIN" in the UNION operation.

So, assuming that the conflict is in the second (i.e. bottom) SELECT statement, I could fix it by applying the current DB's default Collation:

SELECT ftsw.stopword
FROM sys.fulltext_stopwords ftsw
UNION
SELECT ftssw.stopword COLLATE SQL_Latin1_General_CP1_CI_AS
FROM sys.fulltext_system_stopwords ftssw;

And that works. BUT, what if we try putting COLLATE on the first / top SELECT:

SELECT ftsw.stopword COLLATE SQL_Latin1_General_CP1_CI_AS
FROM sys.fulltext_stopwords ftsw
UNION
SELECT ftssw.stopword
FROM sys.fulltext_system_stopwords ftssw;

That also works. And in fact, both of the following also work:

SELECT ftsw.stopword COLLATE Hebrew_100_CI_AS
FROM sys.fulltext_stopwords ftsw
UNION
SELECT ftssw.stopword
FROM sys.fulltext_system_stopwords ftssw;

-- and:

SELECT ftsw.stopword
FROM sys.fulltext_stopwords ftsw
UNION
SELECT ftssw.stopword COLLATE Hebrew_100_CI_AS
FROM sys.fulltext_system_stopwords ftssw;

Also, rather than use the DATABASE_DEFAULT option, which here equates to Latin1_General_CI_AS, I would use Latin1_General_BIN (or better yet: Latin1_General_100_BIN2, which is newer and better) in this particular case because it will ensure that different strings that can be normalized into the same string via the "distinct" behavior of the UNION (without the ALL) and the case-insensitivity of the Latin1_General_CI_AS Collation show up as different rows.

Where does the Latin1_General_BIN collation come from? It looks like the sys.fulltext_system_stopwords table has a different collation, but why?

That Collation comes from an unlikely source. Let's look at the system catalog views in the query:

EXEC sys.sp_help N'sys.fulltext_stopwords';

The results indicate that this is a View (as expected) and that the stopword column has a Collation of Latin1_General_BIN (not expected). But wait, if the Latin1_General_BIN Collation is coming from sys.fulltext_stopwords, then what about sys.fulltext_system_stopwords and where is the other Collation coming from? Let's look:

EXEC sys.sp_help N'sys.fulltext_system_stopwords';

The results indicate that this is a View (as expected) and that the stopword column has a Collation of SQL_Latin1_General_CP1_CI_AS (not expected).

We can now dig a little deeper into the definition of each of those system catalog views:

EXEC sys.sp_helptext N'sys.fulltext_stopwords';

Returns (simplified):

SELECT 
    fts.stopword,
FROM sys.sysftstops fts

And then:

EXEC sys.sp_help N'sys.sysftstops';

The results indicate that this is a system table, and that the stopword column indeed has a Collation of Latin1_General_BIN.

Next we can move on to the other system catalog view:

EXEC sys.sp_helptext N'sys.fulltext_system_stopwords';

Returns (simplified):

SELECT convert(nvarchar(64), stopword) as stopword, language_id
FROM OpenRowset(TABLE FTSYSSTPWD)

And FTSYSSTPWD comes from the Resource Database, so there isn't much more we can do at this point.

Still, there is one last thing we can do to be clear about the Collation of the data coming from the Resource Database -- sys.fulltext_system_stopwords.stopword:

CREATE DATABASE [FullTextCollationTest] COLLATE SQL_EBCDIC277_CP1_CS_AS;
GO
USE [FullTextCollationTest];

EXEC sys.sp_help N'sys.fulltext_stopwords';
-- Collation for [stopword] column: Latin1_General_BIN      (same as before)

EXEC sys.sp_help N'sys.fulltext_system_stopwords';
-- Collation for [stopword] column: SQL_EBCDIC277_CP1_CS_AS (same as DB's default Collation)

GO
USE [master];
DROP DATABASE [FullTextCollationTest];

SQL Server – Backwards Compatibility from SQL 2005 to 2008 Features

First of all, apologies for such a long answer, as I feel that still there is a lot of confusion when people talk about terms like collation, sort order, code page, etc.

From BOL :

Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data. Collations that are used with character data types such as char and varchar dictate the code page and corresponding characters that can be represented for that data type. Whether you are installing a new instance of SQL Server, restoring a database backup, or connecting server to client databases, it is important that you understand the locale requirements, sorting order, and case and accent sensitivity of the data that you will be working with.

This means that Collation is very important as it specifies rules on how character strings of the data are sorted and compared.

Note: More info on COLLATIONPROPERTY

Now Lets first understand the differences ......

Running below T-SQL :

SELECT *
FROM::fn_helpcollations()
WHERE NAME IN (
        'SQL_Latin1_General_CP1_CI_AS'
        ,'Latin1_General_CI_AS'
        )
GO

SELECT 'SQL_Latin1_General_CP1_CI_AS' AS 'Collation'
    ,COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'CodePage') AS 'CodePage'
    ,COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'LCID') AS 'LCID'
    ,COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'ComparisonStyle') AS 'ComparisonStyle'
    ,COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'Version') AS 'Version'

UNION ALL

SELECT 'Latin1_General_CI_AS' AS 'Collation'
    ,COLLATIONPROPERTY('Latin1_General_CI_AS', 'CodePage') AS 'CodePage'
    ,COLLATIONPROPERTY('Latin1_General_CI_AS', 'LCID') AS 'LCID'
    ,COLLATIONPROPERTY('Latin1_General_CI_AS', 'ComparisonStyle') AS 'ComparisonStyle'
    ,COLLATIONPROPERTY('Latin1_General_CI_AS', 'Version') AS 'Version'
GO

The results would be :

enter image description here

Looking at above results, the only difference is the Sort Order between the 2 collations.But that is not true, which you can see why as below :

Test 1 :

--Clean up previous query
IF OBJECT_ID('Table_Latin1_General_CI_AS') IS NOT NULL
    DROP TABLE Table_Latin1_General_CI_AS;

IF OBJECT_ID('Table_SQL_Latin1_General_CP1_CI_AS') IS NOT NULL
    DROP TABLE Table_SQL_Latin1_General_CP1_CI_AS;

-- Create a table using collation Latin1_General_CI_AS 
CREATE TABLE Table_Latin1_General_CI_AS (
    ID INT IDENTITY(1, 1)
    ,Comments VARCHAR(50) COLLATE Latin1_General_CI_AS
    )

-- add some data to it 
INSERT INTO Table_Latin1_General_CI_AS (Comments)
VALUES ('kin_test1')

INSERT INTO Table_Latin1_General_CI_AS (Comments)
VALUES ('Kin_Tester1')

-- Create second table using collation SQL_Latin1_General_CP1_CI_AS 
CREATE TABLE Table_SQL_Latin1_General_CP1_CI_AS (
    ID INT IDENTITY(1, 1)
    ,Comments VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS
    )

-- add some data to it 
INSERT INTO Table_SQL_Latin1_General_CP1_CI_AS (Comments)
VALUES ('kin_test1')

INSERT INTO Table_SQL_Latin1_General_CP1_CI_AS (Comments)
VALUES ('Kin_Tester1')

--Now try to join both tables
SELECT *
FROM Table_Latin1_General_CI_AS LG
INNER JOIN Table_SQL_Latin1_General_CP1_CI_AS SLG ON LG.Comments = SLG.Comments
GO

Results of Test 1:

Msg 468, Level 16, State 9, Line 35
Cannot resolve the collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "Latin1_General_CI_AS" in the equal to operation.

From above results we can see that the we cannot directly compare values on columns with different collations, you have to use COLLATE to compare the column values.

TEST 2 :

The major difference is performance, as Erland Sommarskog points out at this discussion on msdn.

--Clean up previous query
IF OBJECT_ID('Table_Latin1_General_CI_AS') IS NOT NULL
    DROP TABLE Table_Latin1_General_CI_AS;

IF OBJECT_ID('Table_SQL_Latin1_General_CP1_CI_AS') IS NOT NULL
    DROP TABLE Table_SQL_Latin1_General_CP1_CI_AS;

-- Create a table using collation Latin1_General_CI_AS 
CREATE TABLE Table_Latin1_General_CI_AS (
    ID INT IDENTITY(1, 1)
    ,Comments VARCHAR(50) COLLATE Latin1_General_CI_AS
    )

-- add some data to it 
INSERT INTO Table_Latin1_General_CI_AS (Comments)
VALUES ('kin_test1')

INSERT INTO Table_Latin1_General_CI_AS (Comments)
VALUES ('kin_tester1')

-- Create second table using collation SQL_Latin1_General_CP1_CI_AS 
CREATE TABLE Table_SQL_Latin1_General_CP1_CI_AS (
    ID INT IDENTITY(1, 1)
    ,Comments VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS
    )

-- add some data to it 
INSERT INTO Table_SQL_Latin1_General_CP1_CI_AS (Comments)
VALUES ('kin_test1')

INSERT INTO Table_SQL_Latin1_General_CP1_CI_AS (Comments)
VALUES ('kin_tester1')

--- Create Indexes on both tables

CREATE INDEX IX_LG_Comments ON  Table_Latin1_General_CI_AS(Comments)
go
CREATE INDEX IX_SLG_Comments ON  Table_SQL_Latin1_General_CP1_CI_AS(Comments)

--- Run the queries

DBCC FREEPROCCACHE
GO
SELECT Comments FROM Table_Latin1_General_CI_AS WHERE Comments = 'kin_test1'
GO

--- This will have IMPLICIT Conversion

enter image description here

--- Run the queries

DBCC FREEPROCCACHE
GO
SELECT Comments FROM Table_SQL_Latin1_General_CP1_CI_AS WHERE Comments = 'kin_test1'
GO

--- This will NOT have IMPLICIT Conversion

enter image description here

The reason for Implicit conversion is because, I have my database & Server collation both as SQL_Latin1_General_CP1_CI_AS and the table Table_Latin1_General_CI_AS has column Comments defined as VARCHAR(50) with COLLATE Latin1_General_CI_AS, so during the lookup SQL Server has to do an IMPLICIT conversion.

Test 3:

With the same set up, now we will compare the varchar columns with nvarchar values to see the changes in the execution plans.

-- run the query

DBCC FREEPROCCACHE
GO
SELECT Comments FROM Table_Latin1_General_CI_AS WHERE Comments =  (SELECT N'kin_test1' COLLATE Latin1_General_CI_AS)
GO

enter image description here

-- run the query

DBCC FREEPROCCACHE
GO
SELECT Comments FROM Table_SQL_Latin1_General_CP1_CI_AS WHERE Comments = N'kin_test1'
GO

enter image description here

Note that the first query is able to do Index seek but has to do Implicit conversion while the second one does an Index scan which prove to be inefficient in terms of performance when it will scan large tables.

Conclusion :

All of the above tests shows that having right collation is very important for your database server instance.
SQL_Latin1_General_CP1_CI_AS is a SQL collation with the rules that allow you to sort data for unicode and non-unicode are different.
SQL collation wont be able to use Index when comparing unicode and non-unicode data as seen in above tests that when comparing nvarchar data to varchar data, it does Index scan and not seek.
Latin1_General_CI_AS is a Windows collation with the rules that allow you to sort data for unicode and non-unicode are same.
Windows collation can still use Index (Index seek in above example) when comparing unicode and non-unicode data but you see a slight performance penalty.
Highly recommend to read Erland Sommarskog answer + the connect items that he has pointed to.

This will allow me to not have problems with #temp tables, but are there pitfalls?

See my answer above.

Would I lose any functionality or features of any kind by not using a "current" collation of SQL 2008?

It all depends on what functionality/features you are referring to. Collation is storing and sorting of data.

What about when we move (e.g. in 2 years ) from 2008 to SQL 2012? Will I have problems then? Would I at some point be forced to go to Latin1_General_CI_AS?

Cant vouch ! As things might change in and its always good to be inline with Microsoft's suggestion + you need to understand your data and the pitfalls that I mentioned above. Also refer to this and this connect items.

I read that some DBA's script complete the rows of complete databases, and then run the insert script into the database with the new collation - I'm very scared and wary of this - would you recommend doing this?

When you want to change collation, then such scripts are useful. I have found myself changing collation of databases to match server collation many times and I have some scripts that does it pretty neat. Let me know if you need it.

References :

Best Answer

Related Solutions

Sql-server – SQL Server: Collation of the resource database

SQL Server – Backwards Compatibility from SQL 2005 to 2008 Features

Related Question