Postgresql – Inner join using an array column

gin-indexindexperformancepostgresqlquery-performance

Having trouble indexing and executing a query in O (log n) time.

The query includes an INNER JOIN, an ORDER BY, and an equality operation. If I understand the laws of databases correctly, a query can be indexed and executed in O (log n) time (or thereabouts) if a non-equality operator is not used on more than one column. In this case, I believe the INNER JOIN does count as an equality operator and the non-equality operator would be the ORDER BY portion of the query. This table has upwards of 10,000,000 rows and needs to handle several reads and writes per second.

Using PostgreSQL. This is what the table looks like. As you can see, the column 'Names' is a list property and it is the column that the INNER JOIN goes against:

Age Names                       Date
34  ['carla', 'john', 'sam']    3/13/2011
26  ['json', 'cindy', 'joel']   3/13/2011
72  ['beth', 'amber', 'susie']  3/13/2011
14  ['john', 'jim', 'debie']    3/13/2011

This is the query that we are trying to do:

SELECT * FROM the_table WHERE Age==26 AND Names=='john' ORDER BY Date

My background is from using App Engine's Big Table, so I've used equality operators here to indicate that 'john' should be one of the names in the Names column. This would be an acceptable query in GAE's big table, it would execute in O (log N) time as all Big Table queries are reqyured to do. I am assuming there is a way to do this in PostgreSQL as well since PostgreSQL accepts list data types as columns.

Is this possible to do in PostgreSQL?

If so, how should the index be set up (we can't figure out how to set up an index that takes into account the three properties)?

Best Answer

This is possible for PostgreSQL 8.4 or later. In order to create a multi-column GIN index, you need to install the additional module btree_gin.
Simple types like integer or date are usually better off with a (default) B-Tree index, so that's not installed with standard PostgreSQL. But for a case like this a multi-column index is fastest, so you need the additional index methods for the plain types.

In PostgreSQL 9.1 or later you can install it with CREATE EXTENSION:

CREATE EXTENSION btree_gin;

In older versions you would run as privileged system user (like postgres):

psql -d dbname -f SHAREDIR/contrib/btree_gin.sql

For a PostgreSQL 8.4 installation on Debian,this might be:

psql -d mydb -f /usr/share/postgresql/8.4/contrib/btree_gin.sql

Then, given this table:

CREATE TEMP TABLE tbl (age int, names text[], thedate date);

... you can create this multi-column GIN index:

CREATE INDEX tbl_gin_idx ON tbl USING GIN (names, age, thedate);

... which can be used by a query like:

SELECT * FROM tbl
WHERE  age = 26
AND    '{json}'::text[] <@ names
ORDER  BY thedate;

Note that a GIN index caries a non-trivial cost for write operations.
But that's hardly worth mentioning in comparison to what a SELECT on 10 million rows without index would do to you.