Mysql – Should I redesign the “events” table

database-designMySQLoptimization

My system needs to store an append only log of events. Currently I have a database table that stores all the relevant data in one table:

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL PRIMARY KEY,
        `event_type` VARCHAR(255) NOT NULL,
        `event_timestamp` DATETIME,
        `group_id` VARCHAR(255),
        `person_id` VARCHAR(255),
        `client_id` VARCHAR(255),
        `name` VARCHAR(768),
        `result` VARCHAR(255),
        `status` VARCHAR(255),
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        INDEX `event_type_index` (`event_type`),
        INDEX `event_timestamp_index` (`event_timestamp`),
        INDEX `group_id_index` (`group_id`),
        INDEX `person_id_index` (`person_id`),
        INDEX `client_id_index` (`client_id`),
        INDEX `name_index` (`name`),
        INDEX `result_index` (`result`),
        INDEX `status_index` (`status`),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

But I've noticed that queries with multiple attributes in the WHERE clause are still slow. For example:

SELECT
  count(e.event_id) as total
FROM events e
WHERE
  e.result='Success' AND
  e.event_type='some_silly_event' AND
  e.event_timestamp > '2019-01-01 00:00:00'

One solution would be to create an index like the following:

CREATE INDEX successful_silly_events
ON events (result,event_type,event_timestamp); 

The downsides of this approach seem to be that creating the index would take a long time, and would only speed up this query. If I create a different query on this table with different columns, I'm back to square one.

Would I have been better served by splitting the events table into multiple tables from the start? For example:

CREATE TABLE `events` (
        `event_id` VARCHAR(255) NOT NULL,
        `logged_at` DATETIME,
        `severity` VARCHAR(255),
        `message` LONGTEXT,
        PRIMARY KEY (event_id),
        INDEX `logged_at_index` (`logged_at`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_types` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_type` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_type_index` (`event_type`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_timestamps` (
        `event_id` VARCHAR(255) NOT NULL,
        `event_timestamp` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `event_timestamp_index` (`event_timestamp`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

CREATE TABLE `event_groups` (
        `event_id` VARCHAR(255) NOT NULL,
        `group_id` VARCHAR(255) NOT NULL,
        PRIMARY KEY event_id REFERENCES events(event_id)
        INDEX `group_id_index` (`group_id`),
      ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8mb4 COLLATE=utf8_general_ci

And so on for all the other event attributes which I would have normally indexed on the events table. This way, I could construct a similar query:

SELECT
  count(e.event_id) as total
FROM events e
  LEFT JOIN event_results er ON e.event_id=er.event_id
  LEFT JOIN event_types ety ON e.event_id=et.event_id
  LEFT JOIN event_timestamps eti ON e.event_id=et.event_id 
WHERE
  er.result='Success' AND
  ety.event_type='some_silly_event' AND
  eti.event_timestamp > '2019-01-01 00:00:00'

Would the resulting query be fast and not require full table scan? If so, this seems like a better setup.

Best Answer

INDEX(result, event_type, event_timestamp) obviates the need for INDEX(result) and INDEX(result, event_type).

Blindly using (255) hurts indexes and queries. Trim back to realistic limits.

Splitting up the table the way you suggested helps nothing and hurts most queries. In particular, then you would not be able to effectively use the multiple indexes, nor would you be able to use a 'composite' index (since it would involve more than one table). On the other hand, if your 'log' gets to be too huge, then such 'normalization' will drastically shrink the disk footprint. This, itself, has some positive impact on performance.

Do not normalize 'continuous' values such as timestamp. This severely hurts performance because it makes it impractical to have an index on a "range" value.

Don't use LEFT unless you need the 'right' table to be optional. LEFT sometimes implies that the 'left' table must be scanned first. In your example, that would lead to a full scan of events.

If you change LEFT JOIN to JOIN (in the last example), then the Optimizer would pick among the tables to decide which one to start with. This would be equivalent (but slower) than the original case of single-column indexes on the relevant columns.

"Low cardinality" columns (status and result) are virtually useless to index by themselves. They can be effective when the first column of a 'composite' index.

Most tables have a limited number of queries that will realistically be applied. If you are saying that your table begs for lots of different queries, then my advice is as follows:

  • Monitor what queries people want. Keep track of the typical combinations of columns.
  • Implement a few 2- and 3-column indexes.
  • Be sure to have the first column(s) in the index be column(s) that are tested (in WHERE) with =. If there is also a "range" (as in your example with timestamp), then put it last. (Punt on multiple range tests.)
  • Remember that the order of ANDing things in WHERE does not matter, but the order of columns in an INDEX does.
  • More on creating optimal indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

MySQL does not implement "bitmap" indexes; they are rarely worth the effort. MySQL does implement "index merge intersect" (for ANDs) which is a clumsy way to simulate a composite index. "index merge union" (for ORs) is sometimes handy for OR; but UNION is likely to be as good.

Yours seems like a "Data Warehouse" application. The best speedup for such is to build and maintain Summary tables. For your one example, a summary of daily counts broken down by result and event_type would be much smaller and much faster to query. (10x speedup is quite possible.) Furthermore, it is practical to have different indexes on the Summary table, thereby somewhat breaking the log jam you currently have. (You would SUM the subtotals to get the total COUNT.)