My system needs to store an append only log of events. Currently I have a database table that stores all the relevant data in one table:
CREATE TABLE `events` (
`event_id` VARCHAR(255) NOT NULL PRIMARY KEY,
`event_type` VARCHAR(255) NOT NULL,
`event_timestamp` DATETIME,
`group_id` VARCHAR(255),
`person_id` VARCHAR(255),
`client_id` VARCHAR(255),
`name` VARCHAR(768),
`result` VARCHAR(255),
`status` VARCHAR(255),
`logged_at` DATETIME,
`severity` VARCHAR(255),
`message` LONGTEXT,
INDEX `event_type_index` (`event_type`),
INDEX `event_timestamp_index` (`event_timestamp`),
INDEX `group_id_index` (`group_id`),
INDEX `person_id_index` (`person_id`),
INDEX `client_id_index` (`client_id`),
INDEX `name_index` (`name`),
INDEX `result_index` (`result`),
INDEX `status_index` (`status`),
INDEX `logged_at_index` (`logged_at`),
) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci
But I've noticed that queries with multiple attributes in the WHERE clause are still slow. For example:
SELECT
count(e.event_id) as total
FROM events e
WHERE
e.result='Success' AND
e.event_type='some_silly_event' AND
e.event_timestamp > '2019-01-01 00:00:00'
One solution would be to create an index like the following:
CREATE INDEX successful_silly_events
ON events (result,event_type,event_timestamp);
The downsides of this approach seem to be that creating the index would take a long time, and would only speed up this query. If I create a different query on this table with different columns, I'm back to square one.
Would I have been better served by splitting the events table into multiple tables from the start? For example:
CREATE TABLE `events` (
`event_id` VARCHAR(255) NOT NULL,
`logged_at` DATETIME,
`severity` VARCHAR(255),
`message` LONGTEXT,
PRIMARY KEY (event_id),
INDEX `logged_at_index` (`logged_at`),
) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci
CREATE TABLE `event_types` (
`event_id` VARCHAR(255) NOT NULL,
`event_type` VARCHAR(255) NOT NULL,
PRIMARY KEY event_id REFERENCES events(event_id)
INDEX `event_type_index` (`event_type`),
) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci
CREATE TABLE `event_timestamps` (
`event_id` VARCHAR(255) NOT NULL,
`event_timestamp` VARCHAR(255) NOT NULL,
PRIMARY KEY event_id REFERENCES events(event_id)
INDEX `event_timestamp_index` (`event_timestamp`),
) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci
CREATE TABLE `event_groups` (
`event_id` VARCHAR(255) NOT NULL,
`group_id` VARCHAR(255) NOT NULL,
PRIMARY KEY event_id REFERENCES events(event_id)
INDEX `group_id_index` (`group_id`),
) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci
And so on for all the other event attributes which I would have normally indexed on the events table. This way, I could construct a similar query:
SELECT
count(e.event_id) as total
FROM events e
LEFT JOIN event_results er ON e.event_id=er.event_id
LEFT JOIN event_types ety ON e.event_id=et.event_id
LEFT JOIN event_timestamps eti ON e.event_id=et.event_id
WHERE
er.result='Success' AND
ety.event_type='some_silly_event' AND
eti.event_timestamp > '2019-01-01 00:00:00'
Would the resulting query be fast and not require full table scan? If so, this seems like a better setup.
Best Answer
INDEX(result, event_type, event_timestamp)
obviates the need forINDEX(result)
andINDEX(result, event_type)
.Blindly using
(255)
hurts indexes and queries. Trim back to realistic limits.Splitting up the table the way you suggested helps nothing and hurts most queries. In particular, then you would not be able to effectively use the multiple indexes, nor would you be able to use a 'composite' index (since it would involve more than one table). On the other hand, if your 'log' gets to be too huge, then such 'normalization' will drastically shrink the disk footprint. This, itself, has some positive impact on performance.
Do not normalize 'continuous' values such as
timestamp
. This severely hurts performance because it makes it impractical to have an index on a "range" value.Don't use
LEFT
unless you need the 'right' table to be optional.LEFT
sometimes implies that the 'left' table must be scanned first. In your example, that would lead to a full scan ofevents
.If you change
LEFT JOIN
toJOIN
(in the last example), then the Optimizer would pick among the tables to decide which one to start with. This would be equivalent (but slower) than the original case of single-column indexes on the relevant columns."Low cardinality" columns (
status
andresult
) are virtually useless to index by themselves. They can be effective when the first column of a 'composite' index.Most tables have a limited number of queries that will realistically be applied. If you are saying that your table begs for lots of different queries, then my advice is as follows:
WHERE
) with=
. If there is also a "range" (as in your example withtimestamp
), then put it last. (Punt on multiple range tests.)ANDing
things inWHERE
does not matter, but the order of columns in anINDEX
does.MySQL does not implement "bitmap" indexes; they are rarely worth the effort. MySQL does implement "index merge intersect" (for
ANDs
) which is a clumsy way to simulate a composite index. "index merge union" (forORs
) is sometimes handy forOR
; butUNION
is likely to be as good.Yours seems like a "Data Warehouse" application. The best speedup for such is to build and maintain Summary tables. For your one example, a summary of daily counts broken down by
result
andevent_type
would be much smaller and much faster to query. (10x speedup is quite possible.) Furthermore, it is practical to have different indexes on the Summary table, thereby somewhat breaking the log jam you currently have. (You wouldSUM
the subtotals to get the totalCOUNT
.)