So the so source of the data is the internet, and random bots that submit values to my apache server.
Periodically new columns appear, but they are not always present.
The other problem is then you have a bunch of NULLs which you don't want to store at all if possible.
Later, I would statistically analyze the data, but we are just trying to store it here.
Currently there are 99 columns, with a minimum of 27 and a max of 59, but the max will probably go up slowly.
I am currently using a single master table, but the downside is there are tons of NULLs stored for each record. Lots of wasted space. Presently the most common value is 64 NULLs per row.
So I had an idea, and was wondering is anyone can improve on it.
You need a master table, and every submitted column has a lookup table with an index and a value.
master
index num_columns table_index
1 27 1
There would be table for simplicity named tables27 to table100
table27 would be
index , columns_names_csv, col1_index,col2_index,...col27_index
table28 would be much the same except
index, columns_names_csv, col1_index,col2_index,...col28_index
All the indexes would be integers used to point to the correct values in the corresponding lookup table.
Since they would all be integers there would be no need to worry about data types.
So in addition to the master table and 99 (and growing) lookup tables you would have tables27 through, currently, 59 but that number could eventually increase.
Is there a better way to do this?
For fun here's the table
+-----------------------------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------------------------+---------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| lstREMOTE_ADDR | int(11) | NO | | NULL | |
| lstCONTENT_LENGTH | int(11) | YES | | NULL | |
| lstREQUEST_TIME_FLOAT | datetime(4) | NO | | NULL | |
| lstUNIQUE_ID | varchar(128) | NO | | NULL | |
| lstSERVER_PORT | int(5) | NO | | NULL | |
| lstREMOTE_PORT | int(5) | NO | | NULL | |
| lstREDIRECT_UNIQUE_ID | int(10) unsigned | YES | | NULL | |
| lstREDIRECT_REQUEST_METHOD | int(10) unsigned | YES | | NULL | |
| lstREDIRECT_STATUS | int(10) unsigned | YES | | NULL | |
| lstHTTP_HOST | int(10) unsigned | YES | | NULL | |
| lstHTTP_USER_AGENT | int(10) unsigned | YES | | NULL | |
| lstHTTP_ACCEPT | int(10) unsigned | YES | | NULL | |
| lstHTTP_ACCEPT_LANGUAGE | int(10) unsigned | YES | | NULL | |
| lstHTTP_ACCEPT_ENCODING | int(10) unsigned | YES | | NULL | |
| lstHTTP_CONNECTION | int(10) unsigned | YES | | NULL | |
| lstHTTP_COOKIE | int(10) unsigned | YES | | NULL | |
| lstHTTP_UPGRADE_INSECURE_REQUESTS | int(10) unsigned | YES | | NULL | |
| lstHTTP_CACHE_CONTROL | int(10) unsigned | YES | | NULL | |
| lstPATH | tinyint(3) unsigned | YES | | NULL | |
| lstSERVER_SIGNATURE | tinyint(3) unsigned | YES | | NULL | |
| lstSERVER_SOFTWARE | int(10) unsigned | YES | | NULL | |
| lstSERVER_NAME | int(10) unsigned | YES | | NULL | |
| lstSERVER_ADDR | int(10) unsigned | YES | | NULL | |
| lstDOCUMENT_ROOT | tinyint(3) unsigned | YES | | NULL | |
| lstREQUEST_SCHEME | int(10) unsigned | YES | | NULL | |
| lstCONTEXT_PREFIX | int(10) unsigned | YES | | NULL | |
| lstCONTEXT_DOCUMENT_ROOT | int(10) unsigned | YES | | NULL | |
| lstSERVER_ADMIN | int(10) unsigned | YES | | NULL | |
| lstSCRIPT_FILENAME | tinyint(3) unsigned | YES | | NULL | |
| lstREDIRECT_URL | int(10) unsigned | YES | | NULL | |
| lstGATEWAY_INTERFACE | int(10) unsigned | YES | | NULL | |
| lstSERVER_PROTOCOL | int(10) unsigned | YES | | NULL | |
| lstREQUEST_METHOD | int(10) unsigned | YES | | NULL | |
| lstQUERY_STRING | int(10) unsigned | YES | | NULL | |
| lstREQUEST_URI | int(10) unsigned | YES | | NULL | |
| lstSCRIPT_NAME | tinyint(3) unsigned | YES | | NULL | |
| lstPHP_SELF | int(10) unsigned | YES | | NULL | |
| lstREDIRECT_spider | int(10) unsigned | YES | | NULL | |
| lstspider | int(10) unsigned | YES | | NULL | |
| lstHTTP_PRAGMA | int(10) unsigned | YES | | NULL | |
| lstHTTP_FROM | int(10) unsigned | YES | | NULL | |
| lstREDIRECT_QUERY_STRING | int(10) unsigned | YES | | NULL | |
| lstHTTP_REFERER | int(10) unsigned | YES | | NULL | |
| lstHTTP_DNT | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_REQUESTED_WITH | int(10) unsigned | YES | | NULL | |
| lstCONTENT_TYPE | int(10) unsigned | YES | | NULL | |
| lstREDIRECT_ERROR_NOTES | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_FORWARDED_FOR | int(10) unsigned | YES | | NULL | |
| lstHTTP_IF | int(10) unsigned | YES | | NULL | |
| lstREDIRECT_HTTPS | int(10) unsigned | YES | | NULL | |
| lstREDIRECT_SSL_TLS_SNI | int(10) unsigned | YES | | NULL | |
| lstHTTPS | int(10) unsigned | YES | | NULL | |
| lstSSL_TLS_SNI | int(10) unsigned | YES | | NULL | |
| lstPHP_AUTH_USER | int(10) unsigned | YES | | NULL | |
| lstPHP_AUTH_PW | int(10) unsigned | YES | | NULL | |
| lstHTTP_AMP_CACHE_TRANSFORM | int(10) unsigned | YES | | NULL | |
| lstHTTP_ACCEPT_CHARSET | int(10) unsigned | YES | | NULL | |
| lstHTTP_CLIENT_IP | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_REMOTE_IP | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_ORIGINATING_IP | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_REMOTE_ADDR | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_REAL_IP | int(10) unsigned | YES | | NULL | |
| lstHTTP_REVERSE_VIA | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_VARNISH | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_UA_COMPATIBLE | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_POWERED_BY | int(10) unsigned | YES | | NULL | |
| lstHTTP_TE | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_PIPER_ID | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_UCBROWSER_UA | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_WAP_PROFILE | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_EBO_UA | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_OPERAMINI_FEATURES | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_OPERAMINI_PHONE | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_OPERAMINI_ROUTE | int(10) unsigned | YES | | NULL | |
| lstHTTP_DEVICE_STOCK_UA | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_OPERAMINI_PHONE_UA | int(10) unsigned | YES | | NULL | |
| lstHTTP_FORWARDED | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_ACCEL_INTERNAL | int(10) unsigned | YES | | NULL | |
| lstHTTP_WAP_CONNECTION | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_CONTENT_OPT | int(10) unsigned | YES | | NULL | |
| lstHTTP_KEEP_ALIVE | int(10) unsigned | YES | | NULL | |
| lstHTTP_APPID | int(10) unsigned | YES | | NULL | |
| lstHTTP_CLIENTID | int(10) unsigned | YES | | NULL | |
| lstHTTP_DID | int(10) unsigned | YES | | NULL | |
| lstHTTP_PLATFORM | int(10) unsigned | YES | | NULL | |
| lstHTTP_RISKUDID | int(10) unsigned | YES | | NULL | |
| lstHTTP_SIGNTYPE | int(10) unsigned | YES | | NULL | |
| lstHTTP_TRACKERID | int(10) unsigned | YES | | NULL | |
| lstHTTP_WORKSPACEID | int(10) unsigned | YES | | NULL | |
| lstHTTP_VIA | int(10) unsigned | YES | | NULL | |
| lstHTTP_PROXY_CONNECTION | int(10) unsigned | YES | | NULL | |
| lstHTTP_DEPTH | int(10) unsigned | YES | | NULL | |
| lstHTTP_TRANSLATE | int(10) unsigned | YES | | NULL | |
| lstHTTP_MIME_VERSION | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_VERMEER_CONTENT_TYPE | int(10) unsigned | YES | | NULL | |
| lstHTTP_X_CNECTION | int(10) unsigned | YES | | NULL | |
| lstHTTP_EXPECT | int(10) unsigned | YES | | NULL | |
| lstHTTP_IF_MODIFIED_SINCE | int(10) unsigned | YES | | NULL | |
+-----------------------------------+---------------------+------+-----+---------+----------------+
Best Answer
If I understand correctly, the table structure you're proposing appears to introduce a large overhead and will also make effective querying impossible.
First of all, note that InnoDB tables can be modified to add, drop and many other operations without locking the table for DML (i.e.
SELECT
,INSERT
,UPDATE
,DELETE
etc) - see the MariaDB documentation for an InnoDB Online DDL Overview.Secondly, the idea that
NULL
values create so much wasted space is actually untrue. With the default storage engine (InnoDB),NULL
values only take up a single bit per row. So with 64NULL
values per row, you're only wasting the equivalent of a singleINT
.Therefore, stop worrying, and just create the columns in your table in the straightforward, normal way. There is no need for complicated solutions here.
I know this may sound a bit incredible, so I don't expect you to just take my word for it. Therefore, below I've attempted to provide some hard evidence.
EVIDENCE
First of all, consider the following quote is from the MySQL InnoDB row format documentation pages under the section about the
COMPACT
row format storage characteristics (MariaDB InnoDB now uses theDYNAMIC
row format by default, but it's identical toCOMPACT
with regards toNULL
values):The proof is in the pudding as they say, so let's perform an experiment. First let's create two identical tables and look at how much space they take up on disk:
And then let's populate
table1
with integers andtable2
withNULL
s, 10,000 rows for each:As you can see,
table1
which is populated with only non-NULL
INT
s is 26 times larger thantable2
which is populated only withNULL
s except for the PK column.We can also compare this with a
table3
that has only a PK colum:And this is just 16K (the size of an InnoDB page) smaller than
table2
which has 16 extraINT
columns all populated withNULL
s - not bad for 10,000 rows:For info, I'm using MariaDB Server 10.4.7 and (mostly) default settings.