When I say efficient, I mean total storage(kb), and not excessively burdening the CPU.
I have 70 columns of data. The last 35 fields have always been NULL.
- The table has only 4 columns of completely unique data.
- The rest of the fields have only few values.
- One fields has only GET,POST and their are 4-5 other values that never occur.
- The columns are 69 varchar, bigint,int, and 2 blob.(I am open to change)
The blobs are text, but contain too many different characters set, or random junk from hackers. So rather than figuring it out, binary and done. -
Most of the other columns have 4-5 distinct() values per column
-
At what point does a lookup column make sense?
Another table with ID Value, and only the ID is present in the first table. For example if you have GET,POST by the time you add overhead for lookup does it make sense to even do this.Should I use an enum or etc?
The table will eventually hold 100,000+ rows.
How many bytes in the value fields to actually save space before the whole lookup makes sense?
What is the best way to handle the 35 blank columns in terms of storage?
Create a separate table with a single ID, have the 35 columns with 1 row that has all NULL. Then just say ID 1, and if a different arrangement ever comes along then it is ID 2?
One column is always "Apache", another "keep-alive","closed",NULL.
Encoding is "gzip, deflate, sdch","identity, gzip, deflate","deflate, gzip","gzip,deflate,br","gzip,deflate","gzip, deflate, sdch, identity","gzip","identity",NULL,"x-gzip, gzip, deflate","gzip","deflate" -
Answering questions:
-
How is that table used?
backend to apache via php. I will be looking for patterns, and marking them with a score indicating bad to good probability.
-
How do you query the database?
Mainly via php(prepare), probably put a web gui on it.
-
When you perform queries that use this table: Do you need all the values for all the columns?
No, certainly not the 35 blank columns.
9 of the columns always have the same 1 record, and I will only care if they are different. Maybe I will setup up a trigger, if they are ever different, but probably otherwise ignore those columns.
So 70 -35 -9 = 26 or less frequently queried columns.
Column Type Comment REDIRECT_UNIQUE_ID varchar(50) NULL REDIRECT_REQUEST_METHOD varchar(20) NULL REDIRECT_STATUS int(11) NULL UNIQUE_ID varchar(50) NULL HTTP_HOST varchar(30) NULL HTTP_USER_AGENT varchar(300) NULL HTTP_ACCEPT varchar(300) NULL HTTP_ACCEPT_LANGUAGE varchar(120) NULL HTTP_ACCEPT_ENCODING varchar(60) NULL HTTP_COOKIE blob NULL HTTP_CONNECTION varchar(30) NULL HTTP_UPGRADE_INSECURE_REQUESTS varchar(10) NULL HTTP_CACHE_CONTROL varchar(50) NULL PATH varchar(300) NULL SERVER_SIGNATURE varchar(300) NULL SERVER_SOFTWARE varchar(20) NULL SERVER_NAME varchar(50) NULL SERVER_ADDR varchar(30) NULL SERVER_PORT varchar(5) NULL REMOTE_ADDR bigint(40) NULL DOCUMENT_ROOT varchar(20) NULL REQUEST_SCHEME varchar(20) NULL CONTEXT_PREFIX varchar(300) NULL CONTEXT_DOCUMENT_ROOT varchar(50) NULL SERVER_ADMIN varchar(40) NULL SCRIPT_FILENAME varchar(110) NULL REMOTE_PORT int(11) NULL REDIRECT_URL blob NULL GATEWAY_INTERFACE varchar(15) NULL SERVER_PROTOCOL varchar(170) NULL REQUEST_METHOD varchar(20) NULL QUERY_STRING varchar(123) NULL REQUEST_URI blob NULL SCRIPT_NAME varchar(100) NULL PHP_SELF varchar(75) NULL REQUEST_TIME_FLOAT decimal(15,4) NULL REQUEST_TIME int(15) unsigned NULL HTTP_PRAGMA varchar(300) NULL LESSKEY varchar(300) NULL NNTPSERVER varchar(300) NULL MANPATH varchar(300) NULL XDG_SESSION_ID varchar(300) NULL HOSTNAME varchar(300) NULL XKEYSYMDB varchar(300) NULL HOST varchar(300) NULL TERM varchar(300) NULL SHELL varchar(300) NULL PROFILEREAD varchar(300) NULL HISTSIZE varchar(300) NULL SSH_CLIENT varchar(300) NULL MORE varchar(300) NULL SSH_TTY varchar(300) NULL USER varchar(300) NULL LS_COLORS varchar(300) NULL XNLSPATH varchar(300) NULL HOSTTYPE varchar(300) NULL CONFIG_SITE varchar(300) NULL FROM_HEADER varchar(300) NULL PAGER varchar(300) NULL CSHEDIT varchar(300) NULL XDG_CONFIG_DIRS varchar(300) NULL LIBGL_DEBUG varchar(300) NULL MINICOM varchar(300) NULL MAIL varchar(300) NULL CPU varchar(300) NULL INPUTRC varchar(300) NULL PWD varchar(300) NULL LANG varchar(300) NULL PYTHONSTARTUP varchar(300) NULL GPG_TTY varchar(300) NULL SHLVL varchar(300) NULL HTTP_USER_AGENT_I bigint(20) NULL SSH_CONNECTION varchar(300) NULL LESSCLOSE varchar(300) NULL WINDOWMANAGER varchar(300) NULL REDIRECT_spider varchar(300) NULL spider varchar(300) NULL CONTENT_TYPE varchar(50) NULL HTTP_REFERER varchar(50) NULL HTTP_ACCEPT_CHARSET varchar(75) NULL HTTP_DNT varchar(10) NULL
Best Answer
"5-6 distinct values" --
TINYINT UNSIGNED
orENUM
.I lean toward
ENUM
if the options are unlikely to change over time. AnENUM NOT NULL
takes 1 byte (for up to 255 options). Consider making the first enum option be "unspecified" (or whatever) instead of makingNULLable
. Note: I don't think you can have commas in a enum option; suggest you replace with some other char.Consider dynamic normalization. That is, when a string (such as "x-gzip, gzip, deflate") comes in, look it up in a table and find (or create) an id; then use that id as a
TINYINT UNSIGNED
orSMALLINT UNSIGNED
in the main table. (If you choose this approach, we should discuss IODKU and "burning" of ids.)Sparse data -- There are multiple options.
NULLable
columns: 1 bit per column whenNULL
.JSON
-- this is open-ended, and rather compact if most of the possible columns are missing (and you don't mention them in the JSON). I also like to compress (in the client) the json and put it into aBLOB
-- this gives even more shrinkage (typically 3x). Do not plan on using any thing from within the JSON in aWHERE
clause.LEFT JOIN
when you need the image-related columns.70 columns is a lot. But not necessarily too many.
100K rows is probably less than 1GB, even without my suggestions. Is that "too big"?
If this is a "log", will you be summarizing stuff like
COUNT(*)
,SUM(...)
, etc over a day/week? That is another topic to discuss.Specifics
(Now seeing the column names...)
TIMEs
;AGENT
work well in a lookup table. You might want to useSMALLINT UNSIGNED
(2 bytes; 64K max) instead ofTINYINT
just in case there is a lot of variety.HOST
might fit here, depending on your traffic.SMALLINT
.VARCHAR(300)
may prevent you from using InnoDB, which you should use. Switching to JSON for many of them will avoid this limit.But... Do you really need the data? It seems like you only need a little bit of it for statistics; the rest could be simply written to a flat file, not a database. That way, you get the stats easily accessible, and the raw data more compact and still accessible if you decide you missed something. While you are at it, don't save one line per 'request', summarize the data for each, say, hour. That will save more space and 'report' time.