Postgresql – Designing a High Score/Leaderboard table

database-designperformancepostgresql

I'm looking for input on how to design a highscore table for a game I'm creating.

The database I'm using is postgres. That's not really something I can or want to change (unless there's very good reason for it).

                                                          version
-----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.3.4 on x86_64-apple-darwin13.1.0, compiled by Apple LLVM version 5.1 (clang-503.0.38) (based on LLVM 3.4svn), 64-bit

My goal is to allow for up to 10 000 000 rows. I'm not sure if this is realistic.

In my initial attempt, the table is on the simplest of forms (note keys on alias and score).

 Column  |          Type          |                        Modifiers
---------+------------------------+---------------------------------------------------------
id       | integer                | not null default nextval('highscores_id_seq'::regclass)
traceid  | integer                |
alias    | character varying(256) |
score    | integer                |
rows     | integer                |
level    | integer                |
duration | integer                |
Indexes:
    "highscores_pkey" PRIMARY KEY, btree (id)
    "highscores_alias_idx" btree (alias)
    "highscores_score_idx" btree (score)

Most interesting columns for my problem are alias and score (I included them all on the off chance that they have some impact I'm unaware of).

Queries

Specifically I need to perform two queries. One which selects highscores, sorted by score with OFFSET and LIMIT (at most 1000 rows per query). I've written this:

SELECT *, RANK() over(ORDER BY score DESC) AS rank FROM highscores OFFSET 0 LIMIT 100

This works well for the topmost results (which are by far most common) and decreases in performance the larger I make OFFSET. Not ideal, but acceptable (I'm guessing caches are at play).

The second query is more complicated; Given an alias (e.g. a name of a user), it should give the highscores achieved by that user, and also each highscore's position in the global table (ordered by score).

I wrote this:

SELECT * FROM
  (SELECT *, RANK() OVER (ORDER BY score DESC) AS rank FROM highscores) AS tbl
WHERE alias = 'somealias'

It gives the correct result (e.g. rank is set to the global position in the table). It is horribly slow however. It takes up to 30 seconds, which is really not acceptable.

Explaining (ANALYZE, BUFFERS) these queries give the following (respectively)

link do depesz:

Limit  (cost=0.42..303.61 rows=100 width=35) (actual time=0.029..0.490 rows=100 loops=1)
  Buffers: shared hit=110
  ->  WindowAgg  (cost=0.42..606388.37 rows=200002 width=35) (actual time=0.028..0.469 rows=100 loops=1)
        Buffers: shared hit=110
        ->  Index Scan Backward using highscores_score_idx on highscores  (cost=0.42..603388.34 rows=200002 width=35) (actual time=0.015..0.264 rows=100 loops=1)
              Buffers: shared hit=110

and

link to depesz

Subquery Scan on tbl  (cost=67440.85..73440.91 rows=1 width=43) (actual time=960.290..960.290 rows=0 loops=1)
    Filter: ((tbl.alias)::text = 'somealias'::text)
    Rows Removed by Filter: 200002
    Buffers: shared hit=3102 read=33788, temp read=1829 written=1829
    ->  WindowAgg  (cost=67440.85..70940.89 rows=200002 width=35) (actual time=560.130..920.911 rows=200002 loops=1)
          Buffers: shared hit=3102 read=33788, temp read=1829 written=1829
          ->  Sort  (cost=67440.85..67940.86 rows=200002 width=35) (actual time=560.117..652.521 rows=200002 loops=1)
                Sort Key: highscores.score
                Sort Method: external merge  Disk: 8208kB
                Buffers: shared hit=3102 read=33788, temp read=1829 written=1829
                ->  Seq Scan on highscores  (cost=0.00..38890.02 rows=200002 width=35) (actual time=111.099..215.351 rows=200002 loops=1)
                    Buffers: shared hit=3102 read=33788width=35)

(from a table with 200 000 randomly generated entries).

Help

I'm not sure if my table design is optimal. I can change it if it would help with performance in any way. I'm guessing queries could be written differently as well to improve performance. I can't see how though. Any pointers would be greatly appreciated!

Best Answer

Assuming that you are prepared to compromise on the absolute correctness of the answer in order to obtain practical performance, you could do the following:

Create an index on score (assuming descending, but either way is workable)
Create a maintenance job to rebuild this index periodically (how often depends on your willingness to have your rank results be out of date)

I take it that you are OK with these assumptions, since you have noted an index on score in your question.

Then your user's global ranking for their scores would be something like this:

select 
  S.*
, count (R.id from highscores R
         where R.score > S.score) as rank
from highscores S
where S.alias = 'somealias'

Using correlated subqueries is not a recipe for blinding speed, but the advantage of this is that you don't have to rank everything, you just have to count entries in an index relative to a user's particular scores. Of course, only your query optimizer will know for sure.

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

I see a couple of issues.

The biggest one is that PG is using a sequence scan on A when filtering A. I think you need a composite index on A.flag AND A.strvalue. If there is already an index available, PostgreSQL is choosing not to use it for some reason. This seems to be eating up 92% of your cost estimate and is likely what's making it run for so long.

As for the ILIKE, PostgreSQL cannot natively (but see below for a module that can) use an index as long as your wildcard is the first character. That's simply a restriction on the ILIKE operator. For that reason you are getting a sequence scan which means every single row is being loaded and the C.name column is being scanned for characters. But one thing that's weird is that the ILIKE sequence scan doesn't seem to be eating up much of the cost estimate in this query plan. Anyway, if it is the ILIKE operator causing the slowdown, I would consider rewriting your query so that it somehow looks like this: ILIKE 'value%' or else consider using PostgreSQL's full text search.

UPDATED

The ILIKE operator can use a trigram index. Superb!

PostgreSQL Optimization – How to Optimize Multiple ORDER BYs

A few things you can do:

Use enums or lookups keyed by integer values, or a simple "char" field, instead of varchar sort keys where possible. I'd use an enum because you can control the sort order easily.

The only serious downside with an enum is that you can't currently drop values from an enum type. You can add them (including inserting them in the middle of the sort order) but not remove them. If that's a problem, you'll want to use a lookup table, or just a field declared "char" that has single character codes.

Also, if you don't need proper language collation, specify COLLATE "C" for character fields, e.g.

CREATE INDEX itemdesc_c ON supplier_management (item_description ASC COLLATE "C");

and then:

ORDER BY ...
    itemdesc COLLATE "C",
    ...

Important things to note:

Pg can combine indexes for predicates (WHERE clauses etc) but not sorting. You can't use a bitmap index scan for a sort. So it can use at most one of the candidate indexes, then it has to sort the rows within each group.
Low-selectivity indexes are a waste of time. If the values aren't widely distributed, don't index the column.
Pg's doing an on-disk sort. Throw more memory at the problem - try SET work_mem = '20MB' to start with. But see my comments below re thrashing with high max_connections. Use a connection pool.
Use a connection pool.
Indexes have a cost - they slow down insert/update/delete and increase vacuum work. So if the index isn't being used lots, get rid of it.
pg_catalog.pg_stat_user_indexes will help you tell which indexes are used.
pg_stat_statements (in contrib) and pg_stat_plans (the latter is an external module) are very useful for capturing data about query patterns, slow queries, etc.
Learn to love the auto_explain module.

Also, if you always do this sort, creating a composite index to match it will help.

CREATE INDEX bigindex ON supplier_management (
      item_description DESC,
      item_number DESC,
      order_type ASC,
      possession_date DESC,
      shipment_type DESC,
      store_type DESC
);

... but be aware that it's only useful for this particular sort, and it'll be a big index so it's only worth having if you do this a lot. In fact, you might as well add supplier_management.buyer_purchase_order_id too, so it can do an index-only scan:

CREATE INDEX bigindex ON supplier_management (
      item_description DESC,
      item_number DESC,
      order_type ASC,
      possession_date DESC,
      shipment_type DESC,
      store_type DESC,
      buyer_purchase_order_id
);