Postgresql – limit an index to only contain the newest relevant row

postgresql

I have a table in PostgreSQL 9.2 that records login sessions called user_login_session. This table contains a user_id, start_timestamp and end_timestamp columns, amongst others. This table has started growing as my system gains traction, and I can't purge old records for now.

I've defined an index on this table but I'm concerned about its growth over time. Is it possible to define an index which only keeps a handle to the newest login session row per user ID? Older rows are not accessed operationally.

Best Answer

This is a classic case for partitioning.

When you insert a new record, have a trigger delete the old record from the user_login_session_latest sub-table and insert it into the user_login_session_history sub-table. Have both of them inherit from an empty top-level user_login_session table so you can still query the two transparently. You will need an additional trigger-maintained column like is_latest to partition on, since you can't do constraint exclusion on a condition that refers to other rows.

A partial index will work, but not as well as you might hope; it's hard to write a partial index based on the concept of "latest" (you need an app to set a flag), and it might not boost things as much as you expect. The partial index points at a page, but if that page contains 10 old records and one new record you might not gain that much; you'd need to regularly CLUSTER the table to see much benefit. Partitioning will be a much better option.

Alternately, you can maintain a materialized view of the "latest" users in another table using triggers or application logic. I'd generally prefer to use partitioning in preference to a materialized view for this.

Remember to tune autovacuum to vacuum your tables quite frequently if there's lots of updating/deleting going on.

Related Solutions

Database Performance – Is CPU Performance Relevant for a Database Server?

For PostgreSQL, CPU power can be very relevant, especially if a fairly high percentage of the active working set of your data fits in RAM. Most of the databases I've worked with have had CPU power as the main bottleneck most of the time. (I just checked vmstat on a server hosting web sites with millions of hits per day hosting over 5TB of database space, and I never saw more than 2% disk wait time, but saw a peak of 12% user CPU time.)

Since PostgreSQL is process-based, any single process can only run as fast as one core, but in a mix like like we have on the server mentioned above, with a high volume of small requests, total CPU across all cores is most important. For the same total CPU power, PostgreSQL will generally do better with fewer, faster cores than many, slower cores.

Up to the point where a high percentage of your active data set is cached, adding RAM will typically show more bang for the buck than adding cores. After you've got sufficient caching, the benefit of additional RAM goes down and you're better off boosting CPU power.

For more details on this topic as it pertains to PostgreSQL, I don't think there is a better source than PostgreSQL 9.0 High Performance by Greg Smith. (Full disclosure, I was a technical reviewer for the book, but get no financial benefit based on sales.)

PostgreSQL – Can Spatial Index Help a Range-Order By-Limit Query

You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:

--testbed and lexikon dummy data:

begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial, 
                      word text, 
                      frequency integer, 
                      lset integer, 
                      width_granule integer);
--
insert into lexikon(word, frequency, lset) 
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'

granule analysis (mostly for information and tuning):

create table granule as 
select width_granule, count(*) as freq, 
       min(frequency) as granule_start, max(frequency) as granule_end 
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
 width_granule |  freq  | granule_start | granule_end
---------------+--------+---------------+-------------
             0 | 500000 |             1 |           1
             1 | 300000 |             2 |           4
             2 | 123077 |             5 |          12
             3 |  47512 |            13 |          33
             4 |  18422 |            34 |          90
             5 |   6908 |            91 |         244
             6 |   2580 |           245 |         665
             7 |    949 |           666 |        1808
             8 |    349 |          1811 |        4901
             9 |    129 |          4926 |       13333
            10 |     47 |         13513 |       35714
            11 |     17 |         37037 |       90909
            12 |      7 |        100000 |      250000
            13 |      2 |        333333 |      500000
            14 |      1 |       1000000 |     1000000
*/
alter table granule drop column freq;
--

function for scanning high frequencies first:

create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
       returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
  m integer;
  n integer := 0;
  r record;
begin 
  for r in (select width_granule from granule order by width_granule desc) loop
    return query( select * 
                  from lexikon 
                  where width_granule=r.width_granule 
                        and lset>=p_lset_low and lset<=p_lset_high );
    get diagnostics m = row_count;
    n = n+m;
    exit when n>=p_limit;
  end loop;
end;$$;

results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)

first using the function we've written:

\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 0.510 ms
*/

and then with a simple index scan:

select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id |   word    | frequency | lset  | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 |      7092 | 23237 |             9
 246 | word25112 |      4065 | 25112 |             8
 275 | word23825 |      3636 | 23825 |             8
 409 | word28660 |      2444 | 28660 |             8
 418 | word29923 |      2392 | 29923 |             8
Time: 51.250 ms
*/
\timing off
--
rollback;

Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.

Best Answer

Related Solutions

Database Performance – Is CPU Performance Relevant for a Database Server?

PostgreSQL – Can Spatial Index Help a Range-Order By-Limit Query

Related Question