My current situation is this: I have 328 million lines of data in a flat csv
file. Not optimal. I want to be able to query this data (I'll explain in more detail). Grep
is getting tired. This data is static and will not change.
I'm new to PostgreSQL and am working with this data on my 2010 MacBook Pro, 2,4 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3. Storage size is not really a problem, the data is around 65GB on a 1TB disk, but read / write speed is not great (it's not an SSD unfortunately). From what I remember from my CS studies a few years ago, indexing could help me improve query speeds, and I'm reading up on how best to do that.
Why am I using PostgreSQL? Peer pressure. Not really but let's go with that. I have PostgreSQL 9.3.4 running on OSX 10.9.2.
Now, what I'm trying to do. The data includes date and time info and spans about 10 months. I want, for example, to be able to:
- Find and summarize all activity on, say, all mondays in the data set for a specific user and/or for multiple users
- Look at differences between weeks in the summertime vs weeks in the wintertime
- Between months; and
- Hours of day, etc.
It's mostly timestamp related summaries that I'm interested in.
Each data record includes, besides the timestamp
, a userID
and a servID
. In another file (which is stored in another table), each servID
is linked to a location as X and Y values. It's these locations that I'm interested in displaying on a map, so I would be exporting these to plot using matlab
, in aggregate for many users or focusing on specific users at a time.
All values are repeated, i.e. there are about 700.000 unique userIDs
and 1500 unique servIDs
. Timestamps are not unique either, the resolution is only seconds, so there are repeated timestamps.
So far I've created two tables, one with a small subset of test data and the X-Y location table:
create table test (userID varchar, junk1 varchar,
date varchar, time varchar, junk2 varchar,
junk3 varchar, servID varchar, junk4 varchar, junk5 varchar);
copy test from '/Users/path/someuser.csv' delimiter ';' CSV;
create table locations (col1 int, servID_DEC int,
Col2 varchar, servID_HEX varchar, Locname varchar,
Locname2 varchar, SERV_LOCY float8, SERV_LOCX float8, junk1 float8, junk2 float8);
copy locations from '/Users/path/locations.csv' delimiter ',' header CSV;
I had to trim the data, one column at a time, which was annoying, but I repeated this for each value:
update test set userID=trim(userID);
Then I added a column for the actual timestamps I had to construct:
alter table test add column timestamp timestamptz;
update test set timestamp=to_timestamp(date || '-' || time, 'DD.MM.YYYY-HH24:MI:SS');
Now my question (finally! Sorry about that): How do I create an index for the data in Postgres such that a query similar to this would be fast:
select userID, date, time, servID, timestamp, servID_HEX, SERV_LOCY, SERV_LOCX
from test, locations
where servID=servID_HEX
and userID='<someusers>'
and extract(dow from timestamp)=2;
I would also like to be able to quickly summarize the frequency of the servIDs
used by a userID
over a specified timeframe, i.e. he uses servID=1 1000 times, servID=2 600 times, so on (similar to the tabulate command in matlab)
(I renamed some of the variables and path names for reasons, I hope I didn't mess that up).
Best Answer
Query cleanup
First, lets rewrite your query to be readable, by using table aliases, qualifying field names, and using an ANSI join:
What the query needs
To start breaking this problem down you'll want to do is look at the fields used and where they're used:
Tables:
locations, test
Fields output:
test.userID, test.date, test.time, test.servID, test.timestamp, locations.servID_HEX, locations.SERV_LOCY, locations.SERV_LOCX
Terms used in filter predicates:
test.servID, locations.servID_HEX, test.userID, extract(dow from test.timestamp)=2
Indexes
Now, the most important thing is to make sure that you have indexes that target high-selectivity columns used in predicates.
"high selectivity" just means that there aren't tons of the same value - there's no point indexing a column if 50% of the time it's 'a' and 50% of the time it's 'b', because you just don't gain much benefit from the index.
Assuming that all these columns have lots of well distributed distinct values, so they're highly selective and don't have any values that're massively common, you'll want to create an ordinary b-tree index. That's enough for
locations
because there's just one value used in a predicate:For
test
it's more complex.You could create indexes separately for each column, but it's more efficient to create a single composite index that has all the columns you're using in this query's predicates for a given table. So you want
servId
,userID
andtimestamp
. However, you don't usetimestamp
directly - you use it to get the day of week, and an index ontimestamp
can't be used to look upextract(dow from timestamp)
.This is where expression indexes come in.
Indexes don't just cover columns. They can also cover arbitrary expressions. In this case, we'll create an index that includes
extract(dow from timestamp)
.So the index for
test
will look like:As you'll be doing different kinds of timestamp based queries I'd consider creating multiple indexes for different timestamp expressions.
Which order to put the columns in depends on how selective each column is. Even though I'm sure the day-of-week will be more selective I've put the server ID first because it's used in a join condition that lets us completely disregard rows of the other table.
Since you didn't provide sample tables or sample data I can't really test it and I'm not keen on dummying some up. Adjustments may be required.
Partial indexes
Partial indexes can also be a big win, where the index only contains data for a subset of the data, e.g.:
contains only data for that day, so it's a lot faster to search for queries involving just that day, but cannot be used at all for queries that might involve different days, or for which the day is determined at runtime from the output of another part of the query.
Index-only scans and covering indexes
Another thing you can consider is that all those unused columns still have to be read from disk to fetch your rows. In PostgreSQL 9.2 and newer there's a way to avoid that using an index-only scan if all columns you are fetching are in an index.
So you could create a covering index, where you include columns that aren't really required for the actual search but are there so they can be output without reading from the main table. The first thing to do for that is to get rid of useless values in your
SELECT
list, so we'll delete the redundantt.date
andt.time
from your query. Then create two new indexes:Reducing repetition
Since your dataset is static, having tons and tons of indexes isn't a big problem.
Each index has a cost for insert/update/delete. So for non-static data you want to minimize that by getting rid of indexes that don't provide enough improvement to be worth the cost.
One way to do that can be to split up composite indexes. In your case for example I might take the timestamp stuff out of the main index on location and put it in a bunch of new indexes instead. That way all the different timestamp expression indexes can be updated without rewriting a whole lot of unrelated data for userid and serverid as well. PostgreSQL can still often use the indexes together, combining them with a bitmap index scan.
Not a concern for your current dataset, but worth learning for later.
Data cleanup
This is a static data set. So if you don't need a column, drop it. Or better yet, create a new table with just the subset of data you need, pre-joined, using something like:
which creates a new table that has the data pre-joined and sorted for the most useful grouping of access.
You can then write much simpler queries that'll be much, much faster, against the new table, and your indexes will be a lot simpler too.