Mysql – P2P message system design

database-designMySQL

Database noob here, just curious to see if my design could benefit from any improvements.

My goal was to keep this as simple as possible: a user can create an account, two accounts can message each other.

That's it.

To achieve this, I've use the following schema:

CREATE TABLE account (
    id         INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    email      VARCHAR(255) NOT NULL,
    first_name VARCHAR(255) NOT NULL,
    last_name  VARCHAR(255) NOT NULL,
    --
    created_at TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
    deleted_at TIMESTAMP NULL DEFAULT NULL,
    --
    PRIMARY KEY (id),
    UNIQUE (email)
);

CREATE TABLE message (
    id      INT UNSIGNED NOT NULL AUTO_INCREMENT,
    to_id   INT(10) UNSIGNED NOT NULL,
    from_id INT(10) UNSIGNED NOT NULL,
    body    VARCHAR(500) NOT NULL,
    seen    BOOLEAN NOT NULL DEFAULT 0,
    --
    created_at TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
    deleted_at TIMESTAMP NULL DEFAULT NULL,
    --
    PRIMARY KEY (id)
);

Using this schema, I want to be able to run two queries:

  1. A snapshot. I want the message overview, similar to how Facebook does it; your most recent messages in each thread are listed.
  2. A thread. I want to show the thread of messages between two users in a list.

I achieved a paginated list of #1 with the following query:

SELECT message.*, `from`.*, `to`.*
FROM message
LEFT JOIN account AS `to`
    ON `to`.id = message.to_id
LEFT JOIN account AS `from`
    ON `from`.id = message.from_id
WHERE message.id IN(
    SELECT MAX(m.id) 
    FROM message m
    WHERE m.id < 999999999
        AND m.to_id = 1
        AND m.deleted_at IS NULL
    GROUP BY GREATEST(m.to_id, m.from_id), LEAST(m.to_id, m.from_id)
)
ORDER BY message.seen, message.id DESC
LIMIT 100

The query above for user with ID 1 will show me a snapshot of latest messages received from every user.

I achieved a paginated list of #2 with the following query:

SELECT message.*, `from`.*, `to`.*
FROM message
LEFT JOIN account AS `to`
    ON `to`.id = message.to_id
LEFT JOIN account AS `from`
    ON `from`.id = message.from_id
WHERE message.id < 999999999
    AND (message.to_id = 1 AND message.from_id = 2) OR (message.to_id = 2 AND message.from_id = 1)
    AND message.deleted_at IS NULL
LIMIT 100

The query above will give me all messages between user with ID 1 and user with ID 2.

Any thoughts? I'd love to hear of some obvious areas of improvement, and some possible explanation as to why so I can learn from it.

Cheers!

Best Answer

  • deleted and seen get in the way of optimizations
  • GROUP BY GREATEST(m.to_id, m.from_id), LEAST(m.to_id, m.from_id) seems strange. Wouldn't this suffice: GROUP BY m.to_id, m.from_id?
  • Since the user is either from_id or to_id, half the info is already known?
  • What is m.id < 999999999 for?
  • IN ( SELECT ... ) often optimizes poorly.
  • You say "pagination", yet I don't see the inefficient OFFSET not the better "remember where I left off".
  • Use realistic limits, not blindly VARCHAR(255).
  • Each table has 3 timestamps. You will never use some of them.
  • Don't use SELECT *; instead, spell out the columns that you really need.
  • You will need more indexes, but let's wait to see what more of the SELECTs look like.
  • LEFT JOIN account assumes that the desired account might be missing. If so, there a data integrity problem. So switch to JOIN; this may aid in optimizations.
  • LIMIT 100 -- Isn't that too much to put on a UI page?
  • AND (message.to_id = 1 AND message.from_id = 2) OR (message.to_id = 2 AND message.from_id = 1) is hard to optimize. Use UNION of two SELECTs instead of OR, and add (tentatively) INDEX(from_id, to_id). The reason is that the Optimizer has almost no techniques for handling OR; it simply ignores indexes and falls back on a table scan.

This might work:

AND (message.to_id, message.from_id) IN ((1, 2), (2, 1))

Here's the Changelog entry that hints at such:

----- 2013-12-03 5.7.3 Milestone 13 -- Functionality Added or Changed -- -----

The optimizer now is able to apply the range scan access method to queries of this form:

SELECT ... FROM t1 WHERE ( col_1, col_2 ) IN (( 'a', 'b' ), ( 'c', 'd' ));

Previously, for range scans to be used it was necessary for the query to be written as:

SELECT ... FROM t1 WHERE ( col_1 = 'a' AND col_2 = 'b' )
                      OR ( col_1 = 'c' AND col_2 = 'd' );

For the optimizer to use a range scan, queries must satisfy these conditions:

Only IN predicates can be used, not NOT IN.

There may only be column references in the row constructor on the IN predicate's left hand side.

There must be more than one row constructor on the IN predicate's right hand side.

Row constructors on the IN predicate's right hand side must contain only runtime constants, which are either literals or local column references that are bound to constants during execution.

EXPLAIN output for applicable queries will change from full table or index scan to range scan. Changes are also visible by checking the values of the Handler_read_first, Handler_read_key, and Handler_read_next status variables.