Definitely, in a Relational Data Management System, you SHOULD use relationships between your entities.
A simple example query where you will be stuck by using the first option :
Select all users and role name with roles 1 or 11
You will have to perform character to character searches and comparisons that are much less efficient than relational integer search and comparisons beacause the latter are indexed, and algorithms to perform them are much more sophisticated than pure string compare.
Unless you are really constrained (and even if so I would say...) use table relations, referential integrity constraints, indexes and so on, they will save your time !
"5-6 distinct values" -- TINYINT UNSIGNED
or ENUM
.
I lean toward ENUM
if the options are unlikely to change over time. An ENUM NOT NULL
takes 1 byte (for up to 255 options). Consider making the first enum option be "unspecified" (or whatever) instead of making NULLable
. Note: I don't think you can have commas in a enum option; suggest you replace with some other char.
Consider dynamic normalization. That is, when a string (such as "x-gzip, gzip, deflate") comes in, look it up in a table and find (or create) an id; then use that id as a TINYINT UNSIGNED
or SMALLINT UNSIGNED
in the main table. (If you choose this approach, we should discuss IODKU and "burning" of ids.)
Sparse data -- There are multiple options.
NULLable
columns: 1 bit per column when NULL
.
JSON
-- this is open-ended, and rather compact if most of the possible columns are missing (and you don't mention them in the JSON). I also like to compress (in the client) the json and put it into a BLOB
-- this gives even more shrinkage (typically 3x). Do not plan on using any thing from within the JSON in a WHERE
clause.
- Separate table(s). Let me fabricate an example: Suppose some small percentage of rows have an image associated with them. And the are 10 'columns' associated with each image. Make a table of images with 11 columns, including an id back to the main table. Then use
LEFT JOIN
when you need the image-related columns.
70 columns is a lot. But not necessarily too many.
100K rows is probably less than 1GB, even without my suggestions. Is that "too big"?
If this is a "log", will you be summarizing stuff like COUNT(*)
, SUM(...)
, etc over a day/week? That is another topic to discuss.
Specifics
(Now seeing the column names...)
- Don't normalize "continuous" values, such as
TIMEs
;
- Things like
AGENT
work well in a lookup table. You might want to use SMALLINT UNSIGNED
(2 bytes; 64K max) instead of TINYINT
just in case there is a lot of variety. HOST
might fit here, depending on your traffic.
- But... If a value is usually missing, it may be more efficient to put it into JSON -- bigger, but less frequent, than the always-present 2-byte
SMALLINT
.
- That many
VARCHAR(300)
may prevent you from using InnoDB, which you should use. Switching to JSON for many of them will avoid this limit.
But... Do you really need the data? It seems like you only need a little bit of it for statistics; the rest could be simply written to a flat file, not a database. That way, you get the stats easily accessible, and the raw data more compact and still accessible if you decide you missed something. While you are at it, don't save one line per 'request', summarize the data for each, say, hour. That will save more space and 'report' time.
Best Answer
I would argue for 1 row per data event (so 1440 rows per day) with one static column per data point. This will be easiest to query against any of the fields.