Sql-server – Treating certain Arabic characters as identical

collationsql server

In Arabic we have characters like ا (alef) and أ (alef with hamza).

Users write them interchangeably and we want to search them interchangeably. SQL Server treats them as separate characters. How can I make SQL treat them as the same character?

I thought to replace any أ (alef with hamza) with ا (alef) at insertion but we have a lot of alternatives in Arabic language not just ا (alef) and أ (alef with hamza).

I tried Arabic_CI_AS and Arabic_CI_AI but that doesn't solve the problem.

Here is a script to regenerate the issue:

CREATE TABLE [dbo].[TestTable] (
    [ArabicChars] [nvarchar](50) NOT NULL,

    CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED 
    (
       [ArabicChars] ASC
    )
) ON [PRIMARY];


INSERT INTO TestTable values (N'احمد');
INSERT INTO TestTable values (N'أحمد');

SELECT * 
FROM TestTable 
WHERE ArabicChars like N'ا%';

The result is:

ArabicChars 

احمد

(1 row(s) affected)

The desired result would be both of the rows we inserted.

Best Answer

i did few tests and i guess it is a work around but can get your job done, since SQL it self isn't helping much.

if you notice that the unicodes of these characters are close to each other

select unicode(N'أ')
  = 1571

select unicode(N'ا')
  = 1575

select unicode(N'إ')
  = 1573

so between أ and ا , its from 1571 to 1575 or if you want to make sure you get every thing in between

make sure you include from 1569 to 1575

which are

Select NCHAR(1569) = ء
Select NCHAR(1570) = آ
Select NCHAR(1571) = أ
Select NCHAR(1572) = ؤ
Select NCHAR(1573) = إ
Select NCHAR(1574) = ئ 
Select NCHAR(1575) = ا

So to make sure that you include every thing similar in your search you can use regular expressions

SELECT * 
FROM TestTable 
WHERE ArabicChars like '%[ء-ا]%'

so in this case you get all characters between ء and ا which include all those between 1569 to 1575

so in this case if your table has

 CREATE TABLE [dbo].[TestTable]  (
    [ArabicChars] [nvarchar](50) COLLATE Arabic_CI_AI NOT NULL,
) 
INSERT INTO TestTable values (N'احمد');
INSERT INTO TestTable values (N'أحمد');
INSERT INTO TestTable values (N'إحمد');

the query above will get them all.

but you will notice something funny

if you have your column as a primary key

CREATE TABLE [dbo].[TestTable]  (
    [ArabicChars] [nvarchar](50) COLLATE Arabic_CI_AI NOT NULL,

    CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED 
    (
       [ArabicChars] ASC
    )
) ON [PRIMARY];

you wont be able to insert these 2 records

INSERT INTO TestTable values (N'أحمد');
INSERT INTO TestTable values (N'إحمد');
INSERT INTO TestTable values (N'ءحمد');

because the ء,أ,إ are all to SQL are part of hamza which is ء

So if you run the query

SELECT * 
FROM TestTable 
WHERE ArabicChars like 'ء%'

it will show you

أحمد
إحمد

so to get the long story short

to SQL أ is not = to ا because its 2 different letters hamza and alefp

but ء = آ = أ = ؤ = إ = ئ

they are all Hamza ء

Related Solutions

SQL Server – Why Non-Digits Match LIKE [0-9]

[0-9] is not some type of regular expression defined to just match digits.

Any range in a LIKE pattern matches characters between the start and end character according to collation sort order.

SELECT CodePoint,
       Symbol,
       RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM   #CodePage
WHERE  Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER  BY Symbol COLLATE Latin1_General_CI_AS

Returns

CodePoint            Symbol Rnk
-------------------- ------ --------------------
48                   0      1
188                  ¼      2
189                  ½      3
190                  ¾      4
185                  ¹      5
49                   1      5
50                   2      7
178                  ²      7
179                  ³      9
51                   3      9
52                   4      11
53                   5      12
54                   6      13
55                   7      14
56                   8      15
57                   9      16

So you get these results because under your default collation these characters sort after 0 but before 9.

It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between 0 and 1.

You could also use a set rather than a range. To avoid 2 matching ² you would need a CS collation

SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS

How to Transfer Right-to-Left Text from Excel to SQL Server

There are two ways to perform this:

Right click on Database -> Task -> Import Data -> Select Excel as your data source -> continue until the wizard is finished .
The longgest way:
- In Excel add the following Columns before your data as follow:

enter image description here

Copy and paste your data set in a new query on SQL Server.

Use Ctrl + H to replace the blank spots like follow:

enter image description here

format it as you like and you're good to go!

Best Answer

Related Solutions

SQL Server – Why Non-Digits Match LIKE [0-9]

How to Transfer Right-to-Left Text from Excel to SQL Server

Related Question