Mariadb – How to store a bunch of columns and where you don’t know how many there will be, the columns change, and new columns are introduced

database-designeavmariadb

So the so source of the data is the internet, and random bots that submit values to my apache server.

Periodically new columns appear, but they are not always present.
The other problem is then you have a bunch of NULLs which you don't want to store at all if possible.

Later, I would statistically analyze the data, but we are just trying to store it here.

Currently there are 99 columns, with a minimum of 27 and a max of 59, but the max will probably go up slowly.

I am currently using a single master table, but the downside is there are tons of NULLs stored for each record. Lots of wasted space. Presently the most common value is 64 NULLs per row.

So I had an idea, and was wondering is anyone can improve on it.

You need a master table, and every submitted column has a lookup table with an index and a value.

master
index   num_columns table_index
 1        27        1

There would be table for simplicity named tables27 to table100

table27 would be
index , columns_names_csv, col1_index,col2_index,...col27_index

table28 would be much the same except
index,  columns_names_csv, col1_index,col2_index,...col28_index

All the indexes would be integers used to point to the correct values in the corresponding lookup table.

Since they would all be integers there would be no need to worry about data types.

So in addition to the master table and 99 (and growing) lookup tables you would have tables27 through, currently, 59 but that number could eventually increase.

Is there a better way to do this?

For fun here's the table

+-----------------------------------+---------------------+------+-----+---------+----------------+
| Field                             | Type                | Null | Key | Default | Extra          |
+-----------------------------------+---------------------+------+-----+---------+----------------+
| id                                | int(11)             | NO   | PRI | NULL    | auto_increment |
| lstREMOTE_ADDR                    | int(11)             | NO   |     | NULL    |                |
| lstCONTENT_LENGTH                 | int(11)             | YES  |     | NULL    |                |
| lstREQUEST_TIME_FLOAT             | datetime(4)         | NO   |     | NULL    |                |
| lstUNIQUE_ID                      | varchar(128)        | NO   |     | NULL    |                |
| lstSERVER_PORT                    | int(5)              | NO   |     | NULL    |                |
| lstREMOTE_PORT                    | int(5)              | NO   |     | NULL    |                |
| lstREDIRECT_UNIQUE_ID             | int(10) unsigned    | YES  |     | NULL    |                |
| lstREDIRECT_REQUEST_METHOD        | int(10) unsigned    | YES  |     | NULL    |                |
| lstREDIRECT_STATUS                | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_HOST                      | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_USER_AGENT                | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_ACCEPT                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_ACCEPT_LANGUAGE           | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_ACCEPT_ENCODING           | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_CONNECTION                | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_COOKIE                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_UPGRADE_INSECURE_REQUESTS | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_CACHE_CONTROL             | int(10) unsigned    | YES  |     | NULL    |                |
| lstPATH                           | tinyint(3) unsigned | YES  |     | NULL    |                |
| lstSERVER_SIGNATURE               | tinyint(3) unsigned | YES  |     | NULL    |                |
| lstSERVER_SOFTWARE                | int(10) unsigned    | YES  |     | NULL    |                |
| lstSERVER_NAME                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstSERVER_ADDR                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstDOCUMENT_ROOT                  | tinyint(3) unsigned | YES  |     | NULL    |                |
| lstREQUEST_SCHEME                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstCONTEXT_PREFIX                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstCONTEXT_DOCUMENT_ROOT          | int(10) unsigned    | YES  |     | NULL    |                |
| lstSERVER_ADMIN                   | int(10) unsigned    | YES  |     | NULL    |                |
| lstSCRIPT_FILENAME                | tinyint(3) unsigned | YES  |     | NULL    |                |
| lstREDIRECT_URL                   | int(10) unsigned    | YES  |     | NULL    |                |
| lstGATEWAY_INTERFACE              | int(10) unsigned    | YES  |     | NULL    |                |
| lstSERVER_PROTOCOL                | int(10) unsigned    | YES  |     | NULL    |                |
| lstREQUEST_METHOD                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstQUERY_STRING                   | int(10) unsigned    | YES  |     | NULL    |                |
| lstREQUEST_URI                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstSCRIPT_NAME                    | tinyint(3) unsigned | YES  |     | NULL    |                |
| lstPHP_SELF                       | int(10) unsigned    | YES  |     | NULL    |                |
| lstREDIRECT_spider                | int(10) unsigned    | YES  |     | NULL    |                |
| lstspider                         | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_PRAGMA                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_FROM                      | int(10) unsigned    | YES  |     | NULL    |                |
| lstREDIRECT_QUERY_STRING          | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_REFERER                   | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_DNT                       | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_REQUESTED_WITH          | int(10) unsigned    | YES  |     | NULL    |                |
| lstCONTENT_TYPE                   | int(10) unsigned    | YES  |     | NULL    |                |
| lstREDIRECT_ERROR_NOTES           | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_FORWARDED_FOR           | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_IF                        | int(10) unsigned    | YES  |     | NULL    |                |
| lstREDIRECT_HTTPS                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstREDIRECT_SSL_TLS_SNI           | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTPS                          | int(10) unsigned    | YES  |     | NULL    |                |
| lstSSL_TLS_SNI                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstPHP_AUTH_USER                  | int(10) unsigned    | YES  |     | NULL    |                |
| lstPHP_AUTH_PW                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_AMP_CACHE_TRANSFORM       | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_ACCEPT_CHARSET            | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_CLIENT_IP                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_REMOTE_IP               | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_ORIGINATING_IP          | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_REMOTE_ADDR             | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_REAL_IP                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_REVERSE_VIA               | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_VARNISH                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_UA_COMPATIBLE           | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_POWERED_BY              | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_TE                        | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_PIPER_ID                | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_UCBROWSER_UA            | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_WAP_PROFILE             | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_EBO_UA                  | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_OPERAMINI_FEATURES      | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_OPERAMINI_PHONE         | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_OPERAMINI_ROUTE         | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_DEVICE_STOCK_UA           | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_OPERAMINI_PHONE_UA      | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_FORWARDED                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_ACCEL_INTERNAL          | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_WAP_CONNECTION            | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_CONTENT_OPT             | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_KEEP_ALIVE                | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_APPID                     | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_CLIENTID                  | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_DID                       | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_PLATFORM                  | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_RISKUDID                  | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_SIGNTYPE                  | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_TRACKERID                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_WORKSPACEID               | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_VIA                       | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_PROXY_CONNECTION          | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_DEPTH                     | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_TRANSLATE                 | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_MIME_VERSION              | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_VERMEER_CONTENT_TYPE    | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_X_CNECTION                | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_EXPECT                    | int(10) unsigned    | YES  |     | NULL    |                |
| lstHTTP_IF_MODIFIED_SINCE         | int(10) unsigned    | YES  |     | NULL    |                |
+-----------------------------------+---------------------+------+-----+---------+----------------+

Best Answer

If I understand correctly, the table structure you're proposing appears to introduce a large overhead and will also make effective querying impossible.

First of all, note that InnoDB tables can be modified to add, drop and many other operations without locking the table for DML (i.e. SELECT, INSERT, UPDATE, DELETE etc) - see the MariaDB documentation for an InnoDB Online DDL Overview.

Secondly, the idea that NULL values create so much wasted space is actually untrue. With the default storage engine (InnoDB), NULL values only take up a single bit per row. So with 64 NULL values per row, you're only wasting the equivalent of a single INT.

Therefore, stop worrying, and just create the columns in your table in the straightforward, normal way. There is no need for complicated solutions here.

I know this may sound a bit incredible, so I don't expect you to just take my word for it. Therefore, below I've attempted to provide some hard evidence.

EVIDENCE

First of all, consider the following quote is from the MySQL InnoDB row format documentation pages under the section about the COMPACT row format storage characteristics (MariaDB InnoDB now uses the DYNAMIC row format by default, but it's identical to COMPACT with regards to NULL values):

The variable-length part of the record header contains a bit vector for indicating NULL columns. If the number of columns in the index that can be NULL is N, the bit vector occupies CEILING(N/8) bytes. (For example, if there are anywhere from 9 to 16 columns that can be NULL, the bit vector uses two bytes.) Columns that are NULL do not occupy space other than the bit in this vector.

The proof is in the pudding as they say, so let's perform an experiment. First let's create two identical tables and look at how much space they take up on disk:

MariaDB [db_1]> CREATE TABLE table1(id int unsigned PRIMARY KEY, 
value1 int, 
value2 int, 
value3 int, 
value4 int, 
value5 int, 
value6 int, 
value7 int, 
value8 int, 
value9 int, 
value10 int, 
value11 int, 
value12 int, 
value13 int, 
value14 int, 
value15 int, 
value16 int
);
Query OK, 0 rows affected (0.044 sec)

MariaDB [db_1]> CREATE TABLE table2 LIKE table1;
Query OK, 0 rows affected (0.022 sec)

[root@localhost ~]# cd /var/lib/mysql/db_1
[root@localhost db_1]# ll
total 204
-rw-rw---- 1 mysql mysql    65 Aug 25 00:08 db.opt
-rw-rw---- 1 mysql mysql  1379 Aug 25 10:32 table1.frm
-rw-rw---- 1 mysql mysql 98304 Aug 25 10:32 table1.ibd
-rw-rw---- 1 mysql mysql  1379 Aug 25 10:32 table2.frm
-rw-rw---- 1 mysql mysql 98304 Aug 25 10:32 table2.ibd

And then let's populate table1 with integers and table2 with NULLs, 10,000 rows for each:

MariaDB [db_1]> DELIMITER ;;
MariaDB [db_1]> BEGIN NOT ATOMIC
SET @i:=0; 
REPEAT 
  INSERT INTO table1(id, value1, value2, value3, value4, value5, value6,
    value7, value8, value9, value10, value11, value12, value13, value14,
    value15, value16) 
  VALUES (@i, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); 
  INSERT INTO table2(id) VALUES (@i);
  SET @i := @i + 1; 
UNTIL @i > 10000 END REPEAT; 
END;;
Query OK, 20002 rows affected (1 min 11.007 sec)
MariaDB [db_1]> DELIMITER ;

[root@localhost db_1]# ll
total 9588
-rw-rw---- 1 mysql mysql      65 Aug 25 00:08 db.opt
-rw-rw---- 1 mysql mysql    1379 Aug 25 10:32 table1.frm
-rw-rw---- 1 mysql mysql 9437184 Aug 25 10:36 table1.ibd
-rw-rw---- 1 mysql mysql    1379 Aug 25 10:32 table2.frm
-rw-rw---- 1 mysql mysql  360448 Aug 25 10:36 table2.ibd

As you can see, table1 which is populated with only non-NULL INTs is 26 times larger than table2 which is populated only with NULLs except for the PK column.

We can also compare this with a table3 that has only a PK colum:

CREATE TABLE table3(id int unsigned PRIMARY KEY);
Query OK, 0 rows affected (0.033 sec)
MariaDB [db_1]> DELIMITER ;;
MariaDB [db_1]> BEGIN NOT ATOMIC
SET @i:=0; 
REPEAT 
  INSERT INTO table3(id) VALUES (@i);
  SET @i := @i + 1; 
UNTIL @i > 10000 END REPEAT; 
END;;
Query OK, 10001 rows affected (37.228 sec)

MariaDB [db_1]> DELIMITER ;

And this is just 16K (the size of an InnoDB page) smaller than table2 which has 16 extra INT columns all populated with NULLs - not bad for 10,000 rows:

[root@localhost db_1]# ll table3.*
-rw-rw---- 1 mysql mysql    922 Aug 25 10:48 table3.frm
-rw-rw---- 1 mysql mysql 344064 Aug 25 10:50 table3.ibd

For info, I'm using MariaDB Server 10.4.7 and (mostly) default settings.