Mysql – the most efficient way to store a bunch of empty fields, and ones with only a few choices

MySQLstorage

When I say efficient, I mean total storage(kb), and not excessively burdening the CPU.

I have 70 columns of data. The last 35 fields have always been NULL.

  • The table has only 4 columns of completely unique data.
  • The rest of the fields have only few values.
  • One fields has only GET,POST and their are 4-5 other values that never occur.
  • The columns are 69 varchar, bigint,int, and 2 blob.(I am open to change)
    The blobs are text, but contain too many different characters set, or random junk from hackers. So rather than figuring it out, binary and done.
  • Most of the other columns have 4-5 distinct() values per column

    1. At what point does a lookup column make sense?
      Another table with ID Value, and only the ID is present in the first table. For example if you have GET,POST by the time you add overhead for lookup does it make sense to even do this.

      Should I use an enum or etc?

      The table will eventually hold 100,000+ rows.

      How many bytes in the value fields to actually save space before the whole lookup makes sense?

      What is the best way to handle the 35 blank columns in terms of storage?
      Create a separate table with a single ID, have the 35 columns with 1 row that has all NULL. Then just say ID 1, and if a different arrangement ever comes along then it is ID 2?

    One column is always "Apache", another "keep-alive","closed",NULL.
    Encoding is "gzip, deflate, sdch","identity, gzip, deflate","deflate, gzip","gzip,deflate,br","gzip,deflate","gzip, deflate, sdch, identity","gzip","identity",NULL,"x-gzip, gzip, deflate","gzip","deflate"

Answering questions:

  1. How is that table used?

    backend to apache via php. I will be looking for patterns, and marking them with a score indicating bad to good probability.

  2. How do you query the database?

    Mainly via php(prepare), probably put a web gui on it.

  3. When you perform queries that use this table: Do you need all the values for all the columns?

    No, certainly not the 35 blank columns.

    9 of the columns always have the same 1 record, and I will only care if they are different. Maybe I will setup up a trigger, if they are ever different, but probably otherwise ignore those columns.

    So 70 -35 -9 = 26 or less frequently queried columns.

Column    Type    Comment
REDIRECT_UNIQUE_ID    varchar(50) NULL     
REDIRECT_REQUEST_METHOD   varchar(20) NULL     
REDIRECT_STATUS   int(11) NULL     
UNIQUE_ID varchar(50) NULL     
HTTP_HOST varchar(30) NULL     
HTTP_USER_AGENT   varchar(300) NULL    
HTTP_ACCEPT   varchar(300) NULL    
HTTP_ACCEPT_LANGUAGE  varchar(120) NULL    
HTTP_ACCEPT_ENCODING  varchar(60) NULL     
HTTP_COOKIE   blob NULL    
HTTP_CONNECTION   varchar(30) NULL     
HTTP_UPGRADE_INSECURE_REQUESTS    varchar(10) NULL     
HTTP_CACHE_CONTROL    varchar(50) NULL     
PATH  varchar(300) NULL    
SERVER_SIGNATURE  varchar(300) NULL    
SERVER_SOFTWARE   varchar(20) NULL     
SERVER_NAME   varchar(50) NULL     
SERVER_ADDR   varchar(30) NULL     
SERVER_PORT   varchar(5) NULL  
REMOTE_ADDR   bigint(40) NULL  
DOCUMENT_ROOT varchar(20) NULL     
REQUEST_SCHEME    varchar(20) NULL     
CONTEXT_PREFIX    varchar(300) NULL    
CONTEXT_DOCUMENT_ROOT varchar(50) NULL     
SERVER_ADMIN  varchar(40) NULL     
SCRIPT_FILENAME   varchar(110) NULL    
REMOTE_PORT   int(11) NULL     
REDIRECT_URL  blob NULL    
GATEWAY_INTERFACE varchar(15) NULL     
SERVER_PROTOCOL   varchar(170) NULL    
REQUEST_METHOD    varchar(20) NULL     
QUERY_STRING  varchar(123) NULL    
REQUEST_URI   blob NULL    
SCRIPT_NAME   varchar(100) NULL    
PHP_SELF  varchar(75) NULL     
REQUEST_TIME_FLOAT    decimal(15,4) NULL   
REQUEST_TIME  int(15) unsigned NULL    
HTTP_PRAGMA   varchar(300) NULL    
LESSKEY   varchar(300) NULL    
NNTPSERVER    varchar(300) NULL    
MANPATH   varchar(300) NULL    
XDG_SESSION_ID    varchar(300) NULL    
HOSTNAME  varchar(300) NULL    
XKEYSYMDB varchar(300) NULL    
HOST  varchar(300) NULL    
TERM  varchar(300) NULL    
SHELL varchar(300) NULL    
PROFILEREAD   varchar(300) NULL    
HISTSIZE  varchar(300) NULL    
SSH_CLIENT    varchar(300) NULL    
MORE  varchar(300) NULL    
SSH_TTY   varchar(300) NULL    
USER  varchar(300) NULL    
LS_COLORS varchar(300) NULL    
XNLSPATH  varchar(300) NULL    
HOSTTYPE  varchar(300) NULL    
CONFIG_SITE   varchar(300) NULL    
FROM_HEADER   varchar(300) NULL    
PAGER varchar(300) NULL    
CSHEDIT   varchar(300) NULL    
XDG_CONFIG_DIRS   varchar(300) NULL    
LIBGL_DEBUG   varchar(300) NULL    
MINICOM   varchar(300) NULL    
MAIL  varchar(300) NULL    
CPU   varchar(300) NULL    
INPUTRC   varchar(300) NULL    
PWD   varchar(300) NULL    
LANG  varchar(300) NULL    
PYTHONSTARTUP varchar(300) NULL    
GPG_TTY   varchar(300) NULL    
SHLVL varchar(300) NULL    
HTTP_USER_AGENT_I bigint(20) NULL  
SSH_CONNECTION    varchar(300) NULL    
LESSCLOSE varchar(300) NULL    
WINDOWMANAGER varchar(300) NULL    
REDIRECT_spider   varchar(300) NULL    
spider    varchar(300) NULL    
CONTENT_TYPE  varchar(50) NULL     
HTTP_REFERER  varchar(50) NULL     
HTTP_ACCEPT_CHARSET   varchar(75) NULL     
HTTP_DNT  varchar(10) NULL

Best Answer

"5-6 distinct values" -- TINYINT UNSIGNED or ENUM.

I lean toward ENUM if the options are unlikely to change over time. An ENUM NOT NULL takes 1 byte (for up to 255 options). Consider making the first enum option be "unspecified" (or whatever) instead of making NULLable. Note: I don't think you can have commas in a enum option; suggest you replace with some other char.

Consider dynamic normalization. That is, when a string (such as "x-gzip, gzip, deflate") comes in, look it up in a table and find (or create) an id; then use that id as a TINYINT UNSIGNED or SMALLINT UNSIGNED in the main table. (If you choose this approach, we should discuss IODKU and "burning" of ids.)

Sparse data -- There are multiple options.

  • NULLable columns: 1 bit per column when NULL.
  • JSON -- this is open-ended, and rather compact if most of the possible columns are missing (and you don't mention them in the JSON). I also like to compress (in the client) the json and put it into a BLOB -- this gives even more shrinkage (typically 3x). Do not plan on using any thing from within the JSON in a WHERE clause.
  • Separate table(s). Let me fabricate an example: Suppose some small percentage of rows have an image associated with them. And the are 10 'columns' associated with each image. Make a table of images with 11 columns, including an id back to the main table. Then use LEFT JOIN when you need the image-related columns.

70 columns is a lot. But not necessarily too many.

100K rows is probably less than 1GB, even without my suggestions. Is that "too big"?

If this is a "log", will you be summarizing stuff like COUNT(*), SUM(...), etc over a day/week? That is another topic to discuss.

Specifics

(Now seeing the column names...)

  • Don't normalize "continuous" values, such as TIMEs;
  • Things like AGENT work well in a lookup table. You might want to use SMALLINT UNSIGNED (2 bytes; 64K max) instead of TINYINT just in case there is a lot of variety. HOST might fit here, depending on your traffic.
  • But... If a value is usually missing, it may be more efficient to put it into JSON -- bigger, but less frequent, than the always-present 2-byte SMALLINT.
  • That many VARCHAR(300) may prevent you from using InnoDB, which you should use. Switching to JSON for many of them will avoid this limit.

But... Do you really need the data? It seems like you only need a little bit of it for statistics; the rest could be simply written to a flat file, not a database. That way, you get the stats easily accessible, and the raw data more compact and still accessible if you decide you missed something. While you are at it, don't save one line per 'request', summarize the data for each, say, hour. That will save more space and 'report' time.