Postgresql – Indexing a large static dataset

index-tuningpostgresqlpostgresql-9.3

My current situation is this: I have 328 million lines of data in a flat csv file. Not optimal. I want to be able to query this data (I'll explain in more detail). Grep is getting tired. This data is static and will not change.

I'm new to PostgreSQL and am working with this data on my 2010 MacBook Pro, 2,4 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3. Storage size is not really a problem, the data is around 65GB on a 1TB disk, but read / write speed is not great (it's not an SSD unfortunately). From what I remember from my CS studies a few years ago, indexing could help me improve query speeds, and I'm reading up on how best to do that.

Why am I using PostgreSQL? Peer pressure. Not really but let's go with that. I have PostgreSQL 9.3.4 running on OSX 10.9.2.

Now, what I'm trying to do. The data includes date and time info and spans about 10 months. I want, for example, to be able to:

  1. Find and summarize all activity on, say, all mondays in the data set for a specific user and/or for multiple users
  2. Look at differences between weeks in the summertime vs weeks in the wintertime
  3. Between months; and
  4. Hours of day, etc.

It's mostly timestamp related summaries that I'm interested in.

Each data record includes, besides the timestamp, a userID and a servID. In another file (which is stored in another table), each servID is linked to a location as X and Y values. It's these locations that I'm interested in displaying on a map, so I would be exporting these to plot using matlab, in aggregate for many users or focusing on specific users at a time.

All values are repeated, i.e. there are about 700.000 unique userIDs and 1500 unique servIDs. Timestamps are not unique either, the resolution is only seconds, so there are repeated timestamps.

So far I've created two tables, one with a small subset of test data and the X-Y location table:

create table test (userID varchar, junk1 varchar, 
     date varchar, time varchar, junk2 varchar, 
     junk3 varchar, servID varchar, junk4 varchar, junk5 varchar);
copy test from '/Users/path/someuser.csv' delimiter ';' CSV;
create table locations (col1 int, servID_DEC int, 
     Col2 varchar, servID_HEX varchar, Locname varchar, 
     Locname2 varchar, SERV_LOCY float8, SERV_LOCX float8, junk1 float8, junk2 float8);
copy locations from '/Users/path/locations.csv' delimiter ',' header CSV;             

I had to trim the data, one column at a time, which was annoying, but I repeated this for each value:

update test set userID=trim(userID);

Then I added a column for the actual timestamps I had to construct:

alter table test add column timestamp timestamptz;
update test set timestamp=to_timestamp(date || '-' || time, 'DD.MM.YYYY-HH24:MI:SS');

Now my question (finally! Sorry about that): How do I create an index for the data in Postgres such that a query similar to this would be fast:

select userID, date, time, servID, timestamp, servID_HEX, SERV_LOCY, SERV_LOCX 
     from test, locations 
     where servID=servID_HEX 
     and userID='<someusers>' 
     and extract(dow from timestamp)=2;

I would also like to be able to quickly summarize the frequency of the servIDs used by a userID over a specified timeframe, i.e. he uses servID=1 1000 times, servID=2 600 times, so on (similar to the tabulate command in matlab)

(I renamed some of the variables and path names for reasons, I hope I didn't mess that up).

Best Answer

Query cleanup

First, lets rewrite your query to be readable, by using table aliases, qualifying field names, and using an ANSI join:

select t.userID, t.date, t.time, t.servID, t.timestamp, 
       l.servID_HEX, l.SERV_LOCY, l.SERV_LOCX 
from test t
     inner join locations l on (t.servID=l.servID_HEX)
where t.userID='<someusers>' 
  and extract(dow from t.timestamp)=2;

What the query needs

To start breaking this problem down you'll want to do is look at the fields used and where they're used:

Tables: locations, test

Fields output: test.userID, test.date, test.time, test.servID, test.timestamp, locations.servID_HEX, locations.SERV_LOCY, locations.SERV_LOCX

Terms used in filter predicates: test.servID, locations.servID_HEX, test.userID, extract(dow from test.timestamp)=2

Indexes

Now, the most important thing is to make sure that you have indexes that target high-selectivity columns used in predicates.

"high selectivity" just means that there aren't tons of the same value - there's no point indexing a column if 50% of the time it's 'a' and 50% of the time it's 'b', because you just don't gain much benefit from the index.

Assuming that all these columns have lots of well distributed distinct values, so they're highly selective and don't have any values that're massively common, you'll want to create an ordinary b-tree index. That's enough for locations because there's just one value used in a predicate:

CREATE INDEX locations_servid_idx ON locations(servID_HEX);

For test it's more complex.

You could create indexes separately for each column, but it's more efficient to create a single composite index that has all the columns you're using in this query's predicates for a given table. So you want servId, userID and timestamp. However, you don't use timestamp directly - you use it to get the day of week, and an index on timestamp can't be used to look up extract(dow from timestamp).

This is where expression indexes come in.

Indexes don't just cover columns. They can also cover arbitrary expressions. In this case, we'll create an index that includes extract(dow from timestamp).

So the index for test will look like:

CREATE INDEX test_custom_idx ON test (
    servID,
    extract(dow from timestamp),
    userID
);

As you'll be doing different kinds of timestamp based queries I'd consider creating multiple indexes for different timestamp expressions.

Which order to put the columns in depends on how selective each column is. Even though I'm sure the day-of-week will be more selective I've put the server ID first because it's used in a join condition that lets us completely disregard rows of the other table.

Since you didn't provide sample tables or sample data I can't really test it and I'm not keen on dummying some up. Adjustments may be required.

Partial indexes

Partial indexes can also be a big win, where the index only contains data for a subset of the data, e.g.:

CREATE INDEX test_custom_idx ON test (
    servID, userID
) WHERE (extract(dow from timestamp) = 2);

contains only data for that day, so it's a lot faster to search for queries involving just that day, but cannot be used at all for queries that might involve different days, or for which the day is determined at runtime from the output of another part of the query.

Index-only scans and covering indexes

Another thing you can consider is that all those unused columns still have to be read from disk to fetch your rows. In PostgreSQL 9.2 and newer there's a way to avoid that using an index-only scan if all columns you are fetching are in an index.

So you could create a covering index, where you include columns that aren't really required for the actual search but are there so they can be output without reading from the main table. The first thing to do for that is to get rid of useless values in your SELECT list, so we'll delete the redundant t.date and t.time from your query. Then create two new indexes:

CREATE INDEX locations_covering_idx 
ON locations(
    servID_HEX,
    SERV_LOCY,
    SERV_LOCX
);

CREATE INDEX test_covering_idx ON test (
    servID,
    extract(dow from timestamp),
    userID,
    timestamp
);

Reducing repetition

Since your dataset is static, having tons and tons of indexes isn't a big problem.

Each index has a cost for insert/update/delete. So for non-static data you want to minimize that by getting rid of indexes that don't provide enough improvement to be worth the cost.

One way to do that can be to split up composite indexes. In your case for example I might take the timestamp stuff out of the main index on location and put it in a bunch of new indexes instead. That way all the different timestamp expression indexes can be updated without rewriting a whole lot of unrelated data for userid and serverid as well. PostgreSQL can still often use the indexes together, combining them with a bitmap index scan.

Not a concern for your current dataset, but worth learning for later.

Data cleanup

This is a static data set. So if you don't need a column, drop it. Or better yet, create a new table with just the subset of data you need, pre-joined, using something like:

CREATE TABLE reporting AS
select t.userID, t.servID, t.timestamp, l.SERV_LOCY, l.SERV_LOCX 
from test t
     inner join locations l on (t.servID=l.servID_HEX)
ORDER BY t.userID, t.servID, t.timestamp;

which creates a new table that has the data pre-joined and sorted for the most useful grouping of access.

You can then write much simpler queries that'll be much, much faster, against the new table, and your indexes will be a lot simpler too.