Postgresql – Unicode storage of \u202b RLE and \u202c PDE in a Unicode-aware database

encodingpostgresqlunicode

I'm building a new product for toponyms and in it the Arabic shows kinda like this:

^IArabic^I<202b>ﺰﻤﺑﺎﺑﻮﻳ<202c>^I<202b>ﺞﻫﻭﺮﻳﺓ ﺰﻤﺑﺎﺑﻮﻳ<202c>$

Actually not quite. This is a real problem for my ASCII-spewing terminal, so I'll make an exception and screenshot text.

Arabic

My question is about those U202B "Right-To-Left Embedding" (RLE), and U202C "Pop Directional Formatting" (PDF). Do those get stored as data? My first assumption was that the characters were rendered and not in the file, but alas they are there..

360    5E03 97E6 5171 548C 56FD 000A 0009 0041 0072 0061 0062 0069 0063 0009 202B 0632    布韦共和国..Arabic..ز
.............................................................................^HERE
389    0645 0628 0627 0628 0648 064A 202C 0009 202B 064F 062C 0647 0648 0631 064A 0629    مبابوي...ُجهورية
.....................................^HERE.....^HERE
422    0020 0632 0645 0628 0627 0628 0648 064A 202C 000A 0009 004E 006F 0074 0065 0073    .زمبابوي...Notes
...............................................^HERE

When storing Arabic in a database, do you typically store \u202b, and \u202c? They seem like they're rendering characters and not technically data? I'm simply wanting to process this text to throw in a database, and wondering if these characters should be present in the database, or stripped before insert.

Background

Best Answer

Arabic (as well as Hebrew and Syriac) are right-to-left languages. Hence they display in the opposite direction that the bytes are physical stored in. Having the proper display is controlled through non-printable characters that are interpreted only by the font / rendering system. These two characters in particular are used to control this (see original Unicode spec for starters: https://www.unicode.org/charts/PDF/U2000.pdf ), especially in the context of embedding right-to-left text in the same paragraph as left-to-right text (and the other way around).

So, you must keep them stored or else attempted to display this data later will render it backwards from how the language is supposed to appear and will hence be considered data loss. These are among many formatting control characters that are non-printable / zero-width.

The "official" description of how to work with these characters, from the Unicode Consortium, is (taken from "Chapter 23: Special Areas and Format Characters" top of page 868):

As with other format control characters, bidirectional ordering controls affect the layout of the text in which they are contained but should be ignored for other text processes, such as sorting or searching. However, text processes that modify text content must maintain these characters correctly, because matching pairs of bidirectional ordering controls must be coordinated, so as not to disrupt the layout and interpretation of bidirectional text. Each instance of a lre, rle, lro, or rlo is normally paired with a corresponding pdf. Likewise, each instance of an lri, rli, or fsi is normally paired with a corresponding pdi.

Regarding the importance of keeping (not discarding) these hidden formatting code points, the "Unicode® Standard Annex #9: UNICODE BIDIRECTIONAL ALGORITHM", in section "2.7 Markup and Formatting Characters" states (emphasis mine):

The explicit formatting characters introduce state into the plain text, which must be maintained when editing or displaying the text. Processes that are modifying the text without being aware of this state may inadvertently affect the rendering of large portions of the text, for example by removing a PDF.

and:

Whenever plain text is produced from a document containing markup (ed: HTML and/or CSS), the equivalent formatting characters should be introduced, so that the correct ordering is not lost.

Further explanation is provided in the (excellent) "Understanding Bidirectional (BIDI) Text in Unicode" document by Cal Henderson (taken from the O.P.'s answer) states:

... we could disallow these explicit characters (U+202A - U+202E) which is pretty easy. This does mean that anybody who wants to use them to include Neutrals at the edges of their Arabic usernames will be out of luck - and that sucks more when it's a comment they're posting, where the period jumps to the 'beginning' of the text.

If we want to allow use of these characters, the solution is fairly simple (if hard to implement): we need to make sure that every opening marker has a paired closing marker (PDF) so that the state stack coming out of the string is at the same state as when we went in. We also need to be careful that we don't allow any PDFs to be used without accompanying push markers, else we can't use any ourselves outside of the block.

So, even if the text of a particular cell is supposed to be entirely a right-to-left language, removing these markers could alter the placement of neutral characters (such as punctuation). For example (using SQL Server):

SELECT NCHAR(0x0671) + NCHAR(0x0679) +  N'!';
-- ٱٹ!

SELECT NCHAR(0x202B) + NCHAR(0x0671) + NCHAR(0x0679) + N'!' + NCHAR(0x202C);
-- ‫ٱٹ!‬

Planning to add them back in later, or having a client app add them back in, won't work because there is no inherent means of knowing that they were even being used, and if so, where they were placed.

The safest approach is to keep these characters

For example, you are attempting to include some of this text at the top of the question:

^IArabic^I<202b>ﺰﻤﺑﺎﺑﻮﻳ<202c>^I<202b>ﺞﻫﻭﺮﻳﺓ ﺰﻤﺑﺎﺑﻮﻳ<202c>$

but clearly that's not displaying in the correct order. The bytes, however, are in the correct order:

Looking at just the first <202b>...<202c> section (again, using SQL Server, hence it's Little Endian):

SELECT CONVERT(VARBINARY(MAX), N'<202b>ﺰﻤﺑﺎﺑﻮﻳ<202c>');

the bytes are:

3C00 3200 3000 3200 6200 3E00 B0FE E4FE 91FE 8EFE 91FE EEFE F3FE 3C00 3200 3000 3200 6300 3E00
<    2    0    2    b    >    .    .    .    .    .    .    .    <    2    0    2    c    >

As you can see, there are no additional formatting characters. Because the Arabic characters are strong right-to-left, the characters that follow – <202 – which are neutral (<) and weak (202), continue to display heading to the left (even turning the < into >). And to be clear, the 202 itself displays left-to-right, which would be clearer if the number wasn't a palindrome. If the number was 203, then it would still show as 203 and not 302. But the c is strong left-to-right so it (and the characters that follow) display as expected.

How to fix? Just add the implicit left-to-right markers just after the Arabic to indicate that the right-to-left directionality should end at that point. If we add Code Point U+200E after the last Arabic character (and just before the <) in each of those two segments, we get the following:

^IArabic^I<202b>ﺰﻤﺑﺎﺑﻮﻳ‎<202c>^I<202b>ﺞﻫﻭﺮﻳﺓ ﺰﻤﺑﺎﺑﻮﻳ‎<202c>$

Now, if StackOverflow were to strip out the formatting, then it would revert to the incorrect display, and there is no indication as to the intention of what is desired here that can be discovered programmatically.

If you want to strip out the formatting and add it back in later, are you 100% certain as to the intention of why those characters are there? They aren't always used, so how do you know why they are used when they are present? Don't think there will be any non-Arabic characters? Ok, so how do you classify <202>? I left out the "b" and "c" because there could be punctuation and numbers without any Latin characters and still be "fully Arabic".

This is why I said that keeping them in was the "safest" route to go. Not the only route. But if you don't control the input values, then I don't see how you can guarantee never accidentally altering the meaning of the data.