Postgresql – Need help designing a database

database-designpostgresql

We are designing a system which holds statistical parameters about a particular user. As the users use the system the statistics change and the updated statistics are stored in the database. So for a new user a record is inserted and after that the records are only updated. So I expect this database to be update heavy.

Also when we see the user online we want to query his record from the database so that we can use the stored data to do some characterization.

The fields of the database which are important are

  1. user Identifier to uniquely identify the user.
  2. last accessed time to know when the record was last updated.

If a user is idle for more than 60 days we want to prune the entry from the database. We currently use postgres as our server.

I have been working on how to implement a efficient way to prune the database while keeping update and query performance satisfactory.

Current Design plan:

EDIT: Estimated Number of Records: 100 million.

Set up partition on postgres. Each partition corresponds to one day of usage.
The partitioning criteria would be the date of the last accessed time. The primary key for the partition would be UserId.

If a record is accessed frequently in the same day, it would map to the same partition but across different days i would have to delete the entry from the old partition and move it to the new partition using a update trigger.

A cron job would run daily and delete the old required partition and add the new partitions. A stale entry would probably remain in the partition and truncate table on the required partition would drop all the entries.

The query would be done based on the UserID, if the partition logic is correct we would find the userid in only one of the partitions. So the query would take place across all partitions but find only one record

I need the pruning performance to be good. This is because the table has primary key by User ID. It has no indexes based on time. Since the last accessed time keeps changing it cant be made part of any index. Also there is no logical clustered index field for this table.

If partitioning is not done, this would mean i would have to do a full table scan and since last accessed time is not part of the key, it would bring every record to the database eventually and this would replace existing frequently used pages used by our system. I would want to avoid a full scan on the table.

As pointed out, moving between partitions would require a delete and a insert into the other table. Space is not a issue. I can add the partition date as primary key along with the UserID field of the database and modify the query method to take the latest record. The query would be done on the primary keys and hence the computation would only be on the index files. I could avoid the delete and let the entries be in the database with the cron job clearing the table eventually.

But nevertheless insert into another partition would always happen on moving a record across paritions. A single table approach would solve this but then the delete performance would suffer.

Any help or suggestions on this would be appreciated

Best Answer

If you have an index on LastLoggedIn, deleting a few tens of thousands of records should be fast without needing to partition. I ran a small test on MS SQL:

CREATE TABLE Temp.Temp
(
UserID  INT NOT NULL PRIMARY KEY,
LastLoggedIn    DATETIME2(0) NOT NULL INDEX,
Dummy1  VARCHAR(30),
Dummy2  INT
)

I insert ~57 M dummy records, then selected a LastLoggedIn date such that 35,000 records were at least that old. Deleting these records took less than a second. This is on a four-core machine with 128 GB of memory, under light load; your mileage may vary.

Of course, if you run this purge only every few weeks, you may have millions of records to delete rather than thousands (300 K records took seven seconds for me; 1.3 M records took 30 seconds). But in that scenario, it's part of your monthly maintenance window rather than daily upkeep, and a brief delay should be acceptable.

If you get poor performance with just an index, then sure, consider partitioning. As others have pointed out, there will be a significant cost to UPDATE operations, a greater overhead than maintaining an index. You will have many more updates than purges, and the latter can happen during relatively idle periods, so I would worry about the updates more.

If you need always-on uptime, and the delete locks the table for an unacceptable length of time, that might justify using partitioning, but even there I'd look for alternatives. You could SELECT a list of old users and then delete them one at a time (confirming their LastLoggedIn dates first of course) with a cursor, for example.

I'm an MS SQL guy, so my apologies if any of this is incorrect for PostgreSQL.