Mysql – UTF8 vs ASCII or custom binary format: high-performance optimization on very big tables

mariadbMySQLnosqlutf-8

The summary of my question is whether there are advantages, even if minimal, to use ASCII, or even a smaller format designed yet created specifically for the situation, instead of UTF-8 for strings.

Is possible to use webserver who will access the data translate one ASCII string to a more compact format.

On this question, just 1-3 tables on database handle a massive amount of data, so any byte can make difference if could fit on memory instead of access disk. The information will be accessed via a RESTfull service

Read vs Writes:

Project need much more read than writes. But writing has a peculiarity: every 10 seconds an amount between 40-300 new rows are inserted into the main table. These could be written in parallel, since do not depend of each other.

Memory vs Disk usage:

Recent inserted rows, that will be used immediately, will be inserted also on a cache for the webservice use, so no need to read they again. But for search on old records, the database will need, and it should be fast.

This is why I suppose that use less bytes to store some fields will make diference: even for larger amount of data, will be more easy for fit on memory.

If I cannot fit data on memory, and database cannot abstract to me some speed, or I will need to force slow table/partition scans each 10 seconds for just one user, or I will be forced to do single select and cache it on Webserver, but this break concept of "stateless" from REST concept.

Characters that must be supported

0-9,A-Z, "-", "_". Maybe will need "a-z". Just 38 or 64 caracters, and never more than this.

For now, most columns are

CHAR(3), CHAR(6), VARCHAR(8), VARCHAR(10). 

Examples:

  • ABC, XYZ
  • AB-EFG, XY-XPT
  • ABC123457
  • E47F6C, 34210A, E48D37 (hex strings, maybe exist a format specific to this?)

Technologies used

Database will be MariaDB. Maybe part of RAW data will be on some NoSQL database. The language of webservice does not really makes diference here, but will be PHP 5.4 with framework Phalcon PHP.

Different types of cache could be used, from Varnish-cache to APC/Mencached until caches inside database.

My initial hypothesis:

One field with VARCHAR(10) needs 80bits. If I use one custom table to use just 64 or less instead 255 characters, I save around 4 times, so instead of 80bits will use just 20bits.

Best Answer

Your list of characters that must be supported clearly indicates you need nothing more than plain ascii.

If you want to stored this as text, then this ascii is your most compact way. But here are a few clarifications:

  • VARCHAR(10) does not "need" 80 bits. It may need 80 bits, if all characters are used, under ascii character set. If you only store 3 characters (e.g. 'abc'), then it only needs 24 bits.

  • utf8 does not store more space than ascii when ascii characters are used. 'abc' on both utf8 and ascii encoding are 3 bytes long. That's why it's called utf-8: it attempts to only use 8 bits when possible.

  • However on temporary tables (vanilla MySQL; solved on Percona Server; I'm not sure about MariaDB) a utf8 character will take 3 bytes no matter what; same for MEMORY tables. So best use ascii if it fits your needs.

  • You could compress further. You can use the COMPRESS() function, for example, or encode via your own method (if you only need 64 different characters, this means you're using 6 bits. This means for every 3 bytes (24 bits) you use today, you could squeeze in another byte (using 2 buts from each of the 3 bytes). So you can certainly compress by 25%, and possibly more. But this leaves you with BINARY/VARBINAY types, which are not as easy to work with: you'll have to always compress/decompress, you will not be able to index the text (alphabetically, that is; you can certainly put indexes on the column).

  • The rest of the tools you mentioned are imho irrelevant; by the time your data reaches varnishd, your texts are uncompressed. Possibly so for PHP as well.