Postgresql – Schema design for user activity logging

database-designpostgresql

I'm designing a schema to support logging of user activity, where users must be able to search:

Across all events (events of any type), with datetime range, by username;
Across events of one type, with same as above and additionally with parameters of that module.

I created this schema:

And designed queries to search:

across all events:

SELECT extract(epoch from log_time) * 1000,
  u.username,
  CASE WHEN l.event_type IN (0, 2) THEN e.template
    WHEN l.event_type = 1 THEN format(e.template, ecs.field)
    WHEN l.event_type = 3 THEN format(e.template, ess.field)
    WHEN l.event_type = 4 THEN format(e.template, ect.ip, ect.port)
  END
FROM logs l
JOIN users u ON u.id = l.user_id
JOIN event_def e ON e.id = l.event_type
LEFT JOIN event_client_search ecs ON ecs.log_id = l.id
LEFT JOIN event_switch_search ess ON ess.log_id = l.id
LEFT JOIN event_cable_test ect ON ect.log_id = l.id

across same-type events:

SELECT extract(epoch from log_time) * 1000,
    u.username,
    format(e.template, ect.ip, ect.port)
FROM logs l
JOIN users u ON u.id = l.user_id
JOIN event_def e ON e.id = l.event_type
JOIN event_cable_test ect ON ect.log_id = l.id

This is what the event_def table looks like:

 id |                template
----+----------------------------------------
  0 | Logged in
  2 | Logged out
  1 | Searched in clients %s
  3 | Searched in switches %s
  4 | Cable test switch %s port %s

But what I didn't like, when I have about 100 events (now I have about 50, but haven't implemented them yet), it is going to be a performance issue.

So I thought to merge events with same parameters, like this:

I don't think that will help a lot (it'll cut the number of events by half at most), maybe I'm thinking in the wrong direction?

Best Answer

One of the assumptions that I would ask you to reconsider is whether you really want to have different tables for each event type, or if rather these event types are rows in a table instead of distinct tables.

If you had an EVENT_TYPE table containing your list of event types (about 100 of these) you could take the columns from your current various event tables and make them rows in an EVENT_PARAMETER table.

This type of design would turn your schema changes into data changes when you add new event types. That would save you code maintenance issues and also avoid making your queries more complex and potentially slower.

The instances of events would similarly be kept in two tables, for example a LOG table and a LOG_DETAIL table.

LOG would be an intersection between user and event type.

LOG_DETAIL would be an intersection between log and event parameter.

Your data model might look something like this: ERD

Note that one objection some people may have to this approach is that you end up saving the log parameter values in a string format, rather than in a native format. This is clearly a trade-off. You have to ask yourself is it a good trade-off for your situation.

Related Solutions

PostgreSQL Performance – Use Nested Loop with Indices Over Hash Join

This closely related answer on SO should provide answers to your primary question:
Setting enable_seqscan = off in a single SELECT query

You could use in similar fashion, to disable hash joins for the current transaction:

SET LOCAL enable_hashjoin=off;

But that's not my advice. Read the answer over there.
And this one about statistics and cost settings, too.

More importantly, untangle your query first:

SELECT creation_epoch, user_screen_name, chunk
FROM  (
   SELECT id AS owner_user_id
   FROM   users
   WHERE  reputation > 100000
   ORDER  BY reputation 
   LIMIT  500
   ) u
JOIN   posts p USING (owner_user_id)
JOIN   post_tokenized t USING (id)
WHERE  type = 'tag'
AND    user_screen_name IS NOT NULL;

Should be considerably faster and also make it easier for the query planner to choose the best plan (given sane cost settings and table statistics).

PostgreSQL – LEFT JOIN with Recurring Dates and Users

You have to do with users the same thing you're doing with days:

WITH
  days AS (
    SELECT generate_series(current_date-7, current_date, '1d')::date AS day
    ),
  eves AS (
    SELECT user_id, created_at::date AS full_day, COUNT(*) as evs
    FROM logged_events
    WHERE logged_events.created_at >= current_date-6
    GROUP BY user_id, full_day
    ),
  users AS (
    SELECT DISTINCT user_id FROM eves
    )
SELECT eves.full_day, days.day, users.user_id, eves.evs
FROM days
CROSS JOIN users
LEFT JOIN eves ON (eves.full_day = days.day AND eves.user_id = users.user_id)
GROUP BY users.user_id, days.day;

Best Answer

Related Solutions

PostgreSQL Performance – Use Nested Loop with Indices Over Hash Join

PostgreSQL – LEFT JOIN with Recurring Dates and Users

Related Question