Mysql – Analyzing 126 columns for patterns to create similiar tables to avoid storing lots of NULLS

linuxMySQLPHP

background info a php script is feeding the mysql table.

I could do it manually, but am hoping for at least some kind of automation.

1     12342 64     0
2     12347 65     0
3     11364 66     0
4     12281 67     0
5     12342 68     0
6     12347 69     0
7     11314 70     12347
8     2922  71     0
9     2351  72     0
10     2639 73     0
11     698  74     283
12     10852    75     283
13     362  76     109
14     568  77     107
15     12347    78     19
16     9504 79     139
17     12347    80     135
18     12347    81     28
19     12342    82     22
20     12342    83     66
21     12223    84     60
22     12347    85     123
23     12347    86     316
24     12347    87     139
25     12347    88     22274
26     12347    89     84
27     12342    90     84
28     11395    91     3
29     12347    92     1
30     12347    93     0
31     12347    94     0
32     9504 95     0
33     12347    96     0
34     12347    97     0
35     9846 98     0
36     101  99     0
37     0    100     8
38     0    101     0
39     0    102     1
40     0    103     1
41     0    104     0
42     0    105     0
43     0    106     0
44     0    107     0
45     0    108     0
46     0    109     0
47     0    110     0
48     0    111     0
49     0    112     0
50     0    113     0
51     0    114     10
52     0    115     1
53     0    116     1
54     0    117     1
55     0    118     1
56     0    119     1
57     0    120     8
58     0    121     1
59     0    122     1
60     0    123     1
61     0    124     1
62     0    125     3
63     0    126     1

The first and third columns are column numbers and 2nd and fourth are total number of values in each column that are not null.

I want to split up my table into 1 table with columns that are always (12347 records) present or almost always present. Then a column indicating a template number and an index field.

Then I need to analyze the rest of the fields to see what patterns emerge to create additional tables to hold the data that is sometimes present.

Not real info just an example
For example if column 1 = "something" column 2 ="something" and etc then columns 80-90 are present.

What I want to do then is have a table with an index number and 10 columns. Then main table will have a template number and an index number. Where the template number will probably be the name of the table to look in or a numeric index of a list of names.

Is there an automatic tool to analyze the data to determine what conditions are necessary for the optional/sparsely populated fields to have data.

It took over an hours work to determine that there are 28 columns that are nearly always there, 53 with no values, and the rest somewhere in the middle.

Show create table

www_error | CREATE TABLE `www_error` (
  `REQUEST_TIME_FLOAT` decimal(15,4) DEFAULT NULL,
  `REQUEST_URI` int(11) DEFAULT NULL,
  `REDIRECT_UNIQUE_ID` varchar(50) DEFAULT NULL,
  `REDIRECT_REQUEST_METHOD` varchar(18) DEFAULT NULL,
  `REDIRECT_STATUS` int(11) DEFAULT NULL,
  `UNIQUE_ID` varchar(50) DEFAULT NULL,
  `HTTP_HOST` varchar(33) DEFAULT NULL,
  `HTTP_ACCEPT` varchar(300) DEFAULT NULL,
  `HTTP_ACCEPT_LANGUAGE` varchar(120) DEFAULT NULL,
  `HTTP_ACCEPT_ENCODING` varchar(60) DEFAULT NULL,
  `HTTP_COOKIE` blob,
  `HTTP_CONNECTION` varchar(14) DEFAULT NULL,
  `HTTP_UPGRADE_INSECURE_REQUESTS` varchar(10) DEFAULT NULL,
  `HTTP_CACHE_CONTROL` varchar(50) DEFAULT NULL,
  `PATH` int(11) DEFAULT NULL,
  `SERVER_SIGNATURE` varchar(300) DEFAULT NULL,
  `SERVER_SOFTWARE` varchar(20) DEFAULT NULL,
  `SERVER_NAME` varchar(33) DEFAULT NULL,
  `SERVER_ADDR` varchar(30) DEFAULT NULL,
  `SERVER_PORT` varchar(5) DEFAULT NULL,
  `REMOTE_ADDR` bigint(40) DEFAULT NULL,
  `DOCUMENT_ROOT` int(11) DEFAULT NULL,
  `REQUEST_SCHEME` varchar(20) DEFAULT NULL,
  `CONTEXT_PREFIX` int(11) DEFAULT NULL,
  `CONTEXT_DOCUMENT_ROOT` int(11) DEFAULT NULL,
  `SCRIPT_FILENAME` int(11) DEFAULT NULL,
  `REMOTE_PORT` int(11) DEFAULT NULL,
  `REDIRECT_URL` blob,
  `GATEWAY_INTERFACE` varchar(15) DEFAULT NULL,
  `SERVER_PROTOCOL` varchar(170) DEFAULT NULL,
  `REQUEST_METHOD` varchar(20) DEFAULT NULL,
  `QUERY_STRING` varchar(123) DEFAULT NULL,
  `SCRIPT_NAME` int(11) DEFAULT NULL,
  `PHP_SELF` int(11) DEFAULT NULL,
  `REQUEST_TIME` int(15) unsigned DEFAULT NULL,
  `HTTP_PRAGMA` varchar(10) DEFAULT NULL,
  `LESSKEY` varchar(16) DEFAULT NULL,
  `NNTPSERVER` varchar(4) DEFAULT NULL,
  `MANPATH` varchar(59) DEFAULT NULL,
  `XDG_SESSION_ID` varchar(1) DEFAULT NULL,
  `HOSTNAME` varchar(4) DEFAULT NULL,
  `XKEYSYMDB` varchar(28) DEFAULT NULL,
  `HOST` varchar(4) DEFAULT NULL,
  `TERM` varchar(5) DEFAULT NULL,
  `SHELL` varchar(9) DEFAULT NULL,
  `PROFILEREAD` varchar(4) DEFAULT NULL,
  `HISTSIZE` varchar(4) DEFAULT NULL,
  `SSH_CLIENT` varchar(22) DEFAULT NULL,
  `MORE` varchar(3) DEFAULT NULL,
  `SSH_TTY` varchar(10) DEFAULT NULL,
  `USER` varchar(4) DEFAULT NULL,
  `LS_COLORS` varchar(879) DEFAULT NULL,
  `XNLSPATH` varchar(22) DEFAULT NULL,
  `HOSTTYPE` varchar(6) DEFAULT NULL,
  `CONFIG_SITE` varchar(40) DEFAULT NULL,
  `FROM_HEADER` varchar(20) DEFAULT NULL,
  `PAGER` varchar(4) DEFAULT NULL,
  `CSHEDIT` varchar(5) DEFAULT NULL,
  `XDG_CONFIG_DIRS` varchar(8) DEFAULT NULL,
  `LIBGL_DEBUG` varchar(5) DEFAULT NULL,
  `MINICOM` varchar(5) DEFAULT NULL,
  `MAIL` varchar(14) DEFAULT NULL,
  `CPU` varchar(6) DEFAULT NULL,
  `INPUTRC` varchar(10) DEFAULT NULL,
  `PWD` varchar(24) DEFAULT NULL,
  `LANG` varchar(5) DEFAULT NULL,
  `PYTHONSTARTUP` varchar(16) DEFAULT NULL,
  `GPG_TTY` varchar(10) DEFAULT NULL,
  `SHLVL` varchar(1) DEFAULT NULL,
  `HTTP_USER_AGENT` bigint(20) DEFAULT NULL,
  `SSH_CONNECTION` varchar(34) DEFAULT NULL,
  `LESSCLOSE` varchar(18) DEFAULT NULL,
  `WINDOWMANAGER` varchar(5) DEFAULT NULL,
  `REDIRECT_spider` varchar(300) DEFAULT NULL,
  `spider` varchar(300) DEFAULT NULL,
  `CONTENT_TYPE` varchar(839) DEFAULT NULL,
  `HTTP_REFERER` varchar(109) DEFAULT NULL,
  `HTTP_ACCEPT_CHARSET` varchar(75) DEFAULT NULL,
  `HTTP_DNT` varchar(10) DEFAULT NULL,
  `HTTP_FROM` varchar(300) DEFAULT NULL,
  `REDIRECT_HTTPS` varchar(10) DEFAULT NULL,
  `REDIRECT_SSL_TLS_SNI` varchar(30) DEFAULT NULL,
  `HTTPS` varchar(10) DEFAULT NULL,
  `SSL_TLS_SNI` varchar(33) DEFAULT NULL,
  `CONTENT_LENGTH` varchar(10) DEFAULT NULL,
  `HTTP_KEEP_ALIVE` varchar(10) DEFAULT NULL,
  `REDIRECT_QUERY_STRING` varchar(660) DEFAULT NULL,
  `SERVER_ADMIN` enum('comments@mail','','comments@ecybernard.com') DEFAULT NULL,
  `HTTP_DEPTH` varchar(1) DEFAULT NULL,
  `HTTP_TRANSLATE` varchar(1) DEFAULT NULL,
  `HTTP_IF_MODIFIED_SINCE` varchar(29) DEFAULT NULL,
  `HTTP_REFER` varchar(61) DEFAULT NULL,
  `LESS` varchar(8) DEFAULT NULL,
  `LESS_ADVANCED_PREPROCESSOR` varchar(2) DEFAULT NULL,
  `COLORTERM` varchar(1) DEFAULT NULL,
  `MACHTYPE` varchar(17) DEFAULT NULL,
  `QT_SYSTEM_DIR` varchar(23) DEFAULT NULL,
  `OSTYPE` varchar(5) DEFAULT NULL,
  `HOME` varchar(5) DEFAULT NULL,
  `HTTP_TE` varchar(18) DEFAULT NULL,
  `LC_CTYPE` varchar(11) DEFAULT NULL,
  `HTTP_MIME_VERSION` varchar(3) DEFAULT NULL,
  `HTTP_X_VERMEER_CONTENT_TYPE` varchar(32) DEFAULT NULL,
  `XDG_DATA_DIRS` varchar(10) DEFAULT NULL,
  `LS_OPTIONS` varchar(22) DEFAULT NULL,
  `XCURSOR_THEME` varchar(3) DEFAULT NULL,
  `LOGNAME` varchar(4) DEFAULT NULL,
  `DBUS_SESSION_BUS_ADDRESS` varchar(25) DEFAULT NULL,
  `XDG_RUNTIME_DIR` varchar(11) DEFAULT NULL,
  `G_BROKEN_FILENAMES` varchar(1) DEFAULT NULL,
  `CVS_RSH` varchar(3) DEFAULT NULL,
  `LESSOPEN` varchar(14) DEFAULT NULL,
  `_` varchar(12) DEFAULT NULL,
  `HTTP_PROXY_CONNECTION` varchar(10) DEFAULT NULL,
  `HTTP_X_FB_BACKGROUND_STATE` varchar(1) DEFAULT NULL,
  `HTTP_X_FB_SIM_HNI` varchar(6) DEFAULT NULL,
  `HTTP_X_FB_NET_HNI` varchar(6) DEFAULT NULL,
  `HTTP_X_FB_HTTP_ENGINE` varchar(5) DEFAULT NULL,
  `REDIRECT_ERROR_NOTES` varchar(136) DEFAULT NULL,
  `HTTP_REFERRER` varchar(34) DEFAULT NULL,
  `HTTP_X_REAL_IP` varchar(12) DEFAULT NULL,
  `HTTP_X_FORWARDED_FOR` varchar(12) DEFAULT NULL,
  `HTTP_TRUE_CLIENT_IP` varchar(12) DEFAULT NULL,
  `HTTP_EXPECT` varchar(12) DEFAULT NULL,
  `HTTP_CONTENT_LANGUAGE` varchar(5) DEFAULT NULL,
  `HTTP_ORIGIN` varchar(20) DEFAULT NULL,
  KEY `REQUEST_URI` (`REQUEST_URI`),
  CONSTRAINT `www_error_ibfk_1` FOREIGN KEY (`REQUEST_URI`) REFERENCES `request` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Secondary columns with only 193 unique values across 26 columns and 4000 and counting rows.

 HTTP_ACCEPT                    
 HTTP_ACCEPT_LANGUAGE           
 HTTP_ACCEPT_ENCODING           
 HTTP_COOKIE                    
 HTTP_UPGRADE_INSECURE_REQUESTS 
 HTTP_CACHE_CONTROL             
 SERVER_SIGNATURE               
 QUERY_STRING                   
 HTTP_PRAGMA                    
 REDIRECT_spider                
 spider                         
 CONTENT_TYPE                   
 HTTP_REFERER                   
 HTTP_ACCEPT_CHARSET            
 HTTP_DNT                       
 HTTP_FROM                      
 REDIRECT_HTTPS                 
 REDIRECT_SSL_TLS_SNI           
 HTTPS                          
 SSL_TLS_SNI                    
 CONTENT_LENGTH                 
 HTTP_KEEP_ALIVE                
 REDIRECT_QUERY_STRING          
 HTTP_DEPTH                     
 HTTP_TRANSLATE                 
 HTTP_REFER                     
 HTTP_PROXY_CONNECTION   

Best Answer

Which of those fields do you need to filter (WHERE) on? Which ones do you need to GROUP BY or ORDER BY? Make those real columns. Put the rest into a JSON string. Then, depending on which version of MySQL/MariaDB you are using, put that into a single TEXT or JSON column.

As a first cut, you could make the 28 always-present fields into real columns. But I recommend you cut back from that.

JSON is open-ended; you can add more as needed. You can leave out missing fields.

No analysis needed, no NULLs needed. No ALTER TABLE .. ADD COLUMN .. needed when Apache is upgraded.

See also http://mysql.rjweb.org/doc.php/eav .