I am using mariadb and I store json data with unicode encoded strings eg \u0395, in a medium text column name remoteData with collation utf8_general_ci.
My problem is where query this column like that
where remoteData like '%ΕΛΕΥΘΕΡΙΟΣ%'
I get an empty set because the value is stored as \uxxxx sequence like this
"patientFirstName":"\u0395\u039b\u0395\u03a5\u0398\u0395\u03a1\u0399\u039f\u03a3"
I have been suggested to use JSON_EXTRACT function like this
WHERE JSON_EXTRACT(`remoteData`, "$.patientFirstName")
but as far as I know this will ignore any index on the column. The table in two weeks has gone to almast 100000 rows and it is growing fast, so I want to be indexed.
What would be the best approach to handle such data?
Should I change from sql starage to no sql like mongo etc?
Thanks in advance.
Best Answer
\u0395
is just 6 characters to MariaDB. Searching for the 5-character "word"u0395
would succeed.You need to change the JSON strings to have characters, not unicode codepoint encoding.
In PHP, I would consider write a 1-time script to bring in each line, decode it, and write it back out.
Meanwhile, the writing should be changed to
(or the equivalent in your app's language).
IntlChar::chr
may be part of the solution.Or:
$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $string), ENT_NOQUOTES, 'UTF-8');
Or: https://stackoverflow.com/questions/6058394/unicode-character-in-php-string
Or: https://stackoverflow.com/questions/2934563/how-to-decode-unicode-escape-sequences-like-u00ed-to-proper-utf-8-encoded-cha