Mariadb json store with unicode \u sequences

jsonmariadbnosql

I am using mariadb and I store json data with unicode encoded strings eg \u0395, in a medium text column name remoteData with collation utf8_general_ci.

My problem is where query this column like that

where remoteData like '%ΕΛΕΥΘΕΡΙΟΣ%'

I get an empty set because the value is stored as \uxxxx sequence like this

"patientFirstName":"\u0395\u039b\u0395\u03a5\u0398\u0395\u03a1\u0399\u039f\u03a3"

I have been suggested to use JSON_EXTRACT function like this

WHERE JSON_EXTRACT(`remoteData`, "$.patientFirstName") 

but as far as I know this will ignore any index on the column. The table in two weeks has gone to almast 100000 rows and it is growing fast, so I want to be indexed.

What would be the best approach to handle such data?

Should I change from sql starage to no sql like mongo etc?

Thanks in advance.

Best Answer

\u0395 is just 6 characters to MariaDB. Searching for the 5-character "word" u0395 would succeed.

You need to change the JSON strings to have characters, not unicode codepoint encoding.

In PHP, I would consider write a 1-time script to bring in each line, decode it, and write it back out.

Meanwhile, the writing should be changed to

json_encode($stuff, JSON_UNESCAPED_UNICODE)

(or the equivalent in your app's language).

IntlChar::chr may be part of the solution.

Or: $utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $string), ENT_NOQUOTES, 'UTF-8');

Or: https://stackoverflow.com/questions/6058394/unicode-character-in-php-string

Or: https://stackoverflow.com/questions/2934563/how-to-decode-unicode-escape-sequences-like-u00ed-to-proper-utf-8-encoded-cha