Postgresql – Group Database Entries by time difference

group bypostgresqlpostgresql-9.1timestamp

I have a large postgreSQL database with log data. All this data has timestamps and I want to group consecutive rows where the difference between the timestamps is less then 1500 miliseconds for example.

For example:

1349427083272
1349427083669
1349427083707
1349427084277
1349427084787
1349427093471
1349427094031
1349427094307
1349427094980
1349427095879
1349427097211
1349437622947
1349437623813
1349437624316
1349437624815
1349437624938

Should result in the following groups:

1349427083272
1349427083669
1349427083707
1349427084277
1349427084787

1349427093471
1349427094031
1349427094307
1349427094980
1349427095879
1349427097211

1349437622947
1349437623813
1349437624316
1349437624815
1349437624938

The group identifier can just be a unique integer.

In MySQL I found a similar example which used lots of variables, but I have no clue how to to this in postgreSQL.
Could someone help me out?

Best Answer

With the comment from FrustratedWithFormsDesigner, I came to the following solution:

SELECT subq2.*, sum(new_group) OVER (ORDER BY t ASC) AS group_id
FROM (
  SELECT subq.*, CASE WHEN delta > 1500 THEN 1 ELSE 0 END AS new_group
  FROM (
    SELECT t, lag(t) over (ORDER BY t ASC),
      t - lag(t) over (ORDER BY t ASC) AS delta
    FROM time_points
  ) AS subq
) AS subq2

I am using the sum of all new_group values to create different groups. Thank you very much!

Query cleanup

First, lets rewrite your query to be readable, by using table aliases, qualifying field names, and using an ANSI join:

select t.userID, t.date, t.time, t.servID, t.timestamp, 
       l.servID_HEX, l.SERV_LOCY, l.SERV_LOCX 
from test t
     inner join locations l on (t.servID=l.servID_HEX)
where t.userID='<someusers>' 
  and extract(dow from t.timestamp)=2;

What the query needs

To start breaking this problem down you'll want to do is look at the fields used and where they're used:

Tables: locations, test

Fields output: test.userID, test.date, test.time, test.servID, test.timestamp, locations.servID_HEX, locations.SERV_LOCY, locations.SERV_LOCX

Terms used in filter predicates: test.servID, locations.servID_HEX, test.userID, extract(dow from test.timestamp)=2

Indexes

Now, the most important thing is to make sure that you have indexes that target high-selectivity columns used in predicates.

"high selectivity" just means that there aren't tons of the same value - there's no point indexing a column if 50% of the time it's 'a' and 50% of the time it's 'b', because you just don't gain much benefit from the index.

Assuming that all these columns have lots of well distributed distinct values, so they're highly selective and don't have any values that're massively common, you'll want to create an ordinary b-tree index. That's enough for locations because there's just one value used in a predicate:

CREATE INDEX locations_servid_idx ON locations(servID_HEX);

For test it's more complex.

You could create indexes separately for each column, but it's more efficient to create a single composite index that has all the columns you're using in this query's predicates for a given table. So you want servId, userID and timestamp. However, you don't use timestamp directly - you use it to get the day of week, and an index on timestamp can't be used to look up extract(dow from timestamp).

This is where expression indexes come in.

Indexes don't just cover columns. They can also cover arbitrary expressions. In this case, we'll create an index that includes extract(dow from timestamp).

So the index for test will look like:

CREATE INDEX test_custom_idx ON test (
    servID,
    extract(dow from timestamp),
    userID
);

As you'll be doing different kinds of timestamp based queries I'd consider creating multiple indexes for different timestamp expressions.

Which order to put the columns in depends on how selective each column is. Even though I'm sure the day-of-week will be more selective I've put the server ID first because it's used in a join condition that lets us completely disregard rows of the other table.

Since you didn't provide sample tables or sample data I can't really test it and I'm not keen on dummying some up. Adjustments may be required.

Partial indexes

Partial indexes can also be a big win, where the index only contains data for a subset of the data, e.g.:

CREATE INDEX test_custom_idx ON test (
    servID, userID
) WHERE (extract(dow from timestamp) = 2);

contains only data for that day, so it's a lot faster to search for queries involving just that day, but cannot be used at all for queries that might involve different days, or for which the day is determined at runtime from the output of another part of the query.

Index-only scans and covering indexes

Another thing you can consider is that all those unused columns still have to be read from disk to fetch your rows. In PostgreSQL 9.2 and newer there's a way to avoid that using an index-only scan if all columns you are fetching are in an index.

So you could create a covering index, where you include columns that aren't really required for the actual search but are there so they can be output without reading from the main table. The first thing to do for that is to get rid of useless values in your SELECT list, so we'll delete the redundant t.date and t.time from your query. Then create two new indexes:

CREATE INDEX locations_covering_idx 
ON locations(
    servID_HEX,
    SERV_LOCY,
    SERV_LOCX
);

CREATE INDEX test_covering_idx ON test (
    servID,
    extract(dow from timestamp),
    userID,
    timestamp
);

Reducing repetition

Since your dataset is static, having tons and tons of indexes isn't a big problem.

Each index has a cost for insert/update/delete. So for non-static data you want to minimize that by getting rid of indexes that don't provide enough improvement to be worth the cost.

One way to do that can be to split up composite indexes. In your case for example I might take the timestamp stuff out of the main index on location and put it in a bunch of new indexes instead. That way all the different timestamp expression indexes can be updated without rewriting a whole lot of unrelated data for userid and serverid as well. PostgreSQL can still often use the indexes together, combining them with a bitmap index scan.

Not a concern for your current dataset, but worth learning for later.

Data cleanup

This is a static data set. So if you don't need a column, drop it. Or better yet, create a new table with just the subset of data you need, pre-joined, using something like:

CREATE TABLE reporting AS
select t.userID, t.servID, t.timestamp, l.SERV_LOCY, l.SERV_LOCX 
from test t
     inner join locations l on (t.servID=l.servID_HEX)
ORDER BY t.userID, t.servID, t.timestamp;

which creates a new table that has the data pre-joined and sorted for the most useful grouping of access.

You can then write much simpler queries that'll be much, much faster, against the new table, and your indexes will be a lot simpler too.

PostgreSQL data files have size more than data itself

Size can differ due to several reasons:

indexes take some disk space,
there can be several copies of the same record on disk,
slack space in pages.

Indexes take up disk space in order to facilitate faster lookup. The more indexes you have, the more disk space your DB will take up. GIN indexes are usually smaller, but aren't useful if you use range queries.

PostgreSQL supports concurrent access, which is implemented so, that updates and deletes on records only change the visibility of the records and not actually delete or overwrite the data, since one record may be used by another transaction. With update, a new (updated) copy gets then added. Both mean that old data is still written on disk. To free it up, PostgreSQL periodically performs vacuuming, which really removes the deleted records (after no transaction is using them).

PostgreSQL has default block size of 8KB. If your records are big, say 5KB, you can only get one record in one block, having fairly large (~3KB) slack.

Some possible solutions are:

Be sure to know how to check for disk usage in PostgreSQL,
reevaluate if you really need all the indexes,
think about what your data access pattern is - if it's append only table, then there should be very little invisible records. If on the other hand your data is changed for a time and gets archived after a period, then it's quote possible, that multiple records are on disk. If this is the case, then partitioning your table could help, since you can vacuum partitions separately.
you can vacuum the table manually. Don't use VACUUM FULL, since it will write a new copy of the table and requires appropriate amounts of disk space.
you can use some of the online (online as in you don't need to shut down the DB) repacking tools, such as pg_repack.

Edit:

pg_dump

pg_dump is okay, if you specify to dump data, as you've said you did. If you use custom format (-Fc flags), pg_restore will be able to do some extra stuff with it, such as load only specified tables, also see pg_dump's man page. Custom format gzips dump by default. This may slow your dumping, so you may want to disable that, and if you still want data gzipped, there's a parallel gzip (pigz).

Backups

When you backup, you backup data in tables. Indexes get recreated from data in tables. If you can afford heavy IO on the database when you are doing backups and restores, then pg_dump and pg_restore may suit your needs. When dumping, pg_dump only dumps tables, and when restoring, indexes get rebuilt automatically as part of restore. So to make it clear: backups don't care about indexes.

If heavy IO and degraded performance is not acceptable, then you might want to have a secondary replication server, which will have a copy of data, but won't serve your ordinary queries, so you can use it to make dumps.

If you need point in time restore capability, you could setup WAL log (write-ahead log) archiving, which then enables you to restore to a specific transaction, but this is quite advanced. There are tools to help, such as Barman.

Big indexes

As you've already discovered, indexes can take up a lot of space. If you add up your data and indexes, you get your DB size: 28GB + 42GB = 70GB.

Having big indexes means there's a lot of extra disk space used. Index data is also cached in memory, so having big indexes means you may have two copies of indexed data in RAM, which means there's less RAM for caching data on disk and you get more cache misses. There are some options to evaluate to make indexes smaller:

partial indexes: If the application only queries where a one field is set to a specific value, you can limit which rows are indexed with a WHERE clause on CREATE INDEX.
make use of index scan: If you can include all the queried data in an index, PostgreSQL can return by reading index data only. Works since PostgreSQL 9.2, see more about index scans on wiki.
use another type of index: If your application only does equality lookup (WHERE a = 4), GIN indexes can be a lot smaller, see btree_gin extension. PostgreSQL 9.5 will also bring BRIN indexes, which are really useful for fields, which have a monotonous increasing or decreasing pattern, such as timestamps and they are really really small, so you can free up both disk space for more storage and RAM for caching more data.

But ... a lot of this depends on how data is accessed, which is often only known by developer of the application.