Postgresql – How to Model / Quickly test for set membership in SQL

postgresql

Each row in a table needs to be protected a list of application specific roles lets say something that looks like this example below

     Desired Table Structure

pkey | data | roles 
---------------------
 1   |  ... | {1,2,5}
 2   |  ... | {1,4,3}
 3   |  ... | {7,11,67,101}

In this example the row with primary key 1 should only be visible to any user who has roles 1 or 2 or 5, row 2 should be visible to any user who has role 1 or 4 or 3 … etc.

When a user logs into the application the application computes all roles that the user has for example I know that User 1 has has roles {1,11,7} and User 2 has roles {1,3,5}

My current implementation uses two tables like so.

    Actual Table Structure

foo table
pkey | data 
------------
 1   | ... 
 2   | ...        
 3   | ...

 roles table
 foo_fk | role_id
 ------------------
   1     | 1
   1     | 2
   1     | 5
   2     | 1 
   2     | 4
   2     | 3
   3     | 7
   3     | 11 
   3     | 67
   3     | 101

A join is required to make a query on data that only returns the data for a specific user.

Is there a way to represent the roles in some way that I can store the list of roles that protects a row in a single colum such that I can do a select statement against that table that returns only the rows that the user can see. I want this to be as fast as possible.

I have already tried using bit strings but backed off because a bit string operation in the left hand side of the where clause is going to always cause a full table scan because an expression such as
(column & seach_bit_mask) == some_value needs to evaluate the bit operation on every row of the table to return the result, thus requiring a full table scan.

Is there way to do what i want above with out using two tables that need a join or a bit operations that require a full table scan. I have control over the schema and the representation of the roles, I am open to clever encoding schemes if those help. I am using Postgres 9.2 and if writing C extensions to PostgreSQL can help I would be willing to entertain the idea.

I could not come up with a good title for this question, anyone that can improve the title please do so.

UPDATE Read performance is much more important than update performance for the foo_roles

UPDATE Sample code for PostgresSQL to play with.

DROP TABLE IF EXISTS foo_roles;
DROP TABLE IF EXISTS foo;

CREATE TABLE foo (
  pkey INTEGER PRIMARY KEY,
  name TEXT
);

CREATE TABLE foo_roles(
  foo_fk INTEGER REFERENCES foo(pkey),
  role_id INTEGER,
  PRIMARY KEY (foo_fk,role_id)
);

INSERT INTO foo (pkey,name) VALUES 
(1,'A'), (2,'B'), (3,'C');

INSERT INTO foo_roles(foo_fk,role_id) VALUES 
(1,1), (1,2), (1,5), 
(2,1), (2,4), (2,3),
(3,7), (3,11), (3,67), (3,101);

SELECT DISTINCT foo.pkey
FROM   foo,
       foo_roles
WHERE  foo.pkey = foo_roles.foo_fk
       AND foo_roles.role_id IN ( 1, 2, 11, 14 ); 

Explain for the current implementation query, notice the two sequential table scans.

HashAggregate  (cost=74.67..75.09 rows=42 width=4)
  ->  Hash Join  (cost=42.62..74.57 rows=42 width=4)
        Hash Cond: (foo.pkey = foo_roles.foo_fk)
        ->  Seq Scan on foo  (cost=0.00..22.30 rows=1230 width=4)
        ->  Hash  (cost=42.10..42.10 rows=42 width=4)
              ->  Seq Scan on foo_roles  (cost=0.00..42.10 rows=42 width=4)
                    Filter: (role_id = ANY ('{1,2,11,14}'::integer[]))

UPDATE Based on Chris Travers Answer Below.

DROP TABLE IF EXISTS foo_roles;
DROP TABLE IF EXISTS foo;

CREATE TABLE foo (
  pkey INTEGER PRIMARY KEY,
  name TEXT,
  roles INTEGER[]
);

CREATE INDEX ON foo USING gin(roles);

CREATE TABLE foo_roles(
  foo_fk INTEGER REFERENCES foo(pkey),
  role_id INTEGER,
  PRIMARY KEY (foo_fk,role_id)
);
CREATE OR REPLACE FUNCTION rebuild_foo_roles(in_foo_id int) RETURNS int[] 
LANGUAGE SQL AS $$

     UPDATE foo SET roles = (select array_agg(role_id) from foo_roles where foo_fk = $1)
      WHERE pkey = $1
     RETURNING roles;
$$;

INSERT INTO foo (pkey,name) VALUES (1,'A'), (2,'B'), (3,'C');
INSERT INTO foo_roles(foo_fk,role_id) VALUES 
(1,1),(1,2),(1,5),
(2,1),(2,4),(2,3),
(3,7),(3,11),(3,67),(3,101);

SELECT rebuild_foo_roles(1);
SELECT rebuild_foo_roles(2);
SELECT rebuild_foo_roles(3);

SELECT * from foo WHERE foo.roles && ARRAY[1,2,11,14];

Result of the Query

 pkey | name |     roles     
------+------+---------------
    1 | A    | {1,2,5}
    2 | B    | {1,4,3}
    3 | C    | {7,11,67,101}
(3 rows)

Explain for the query EXPLAIN SELECT * from foo WHERE foo.roles && ARRAY[1,2,11,14];

                     QUERY PLAN                      
-----------------------------------------------------
 Seq Scan on foo  (cost=0.00..20.38 rows=4 width=68)
   Filter: (roles && '{1,2,11,14}'::integer[])
(2 rows)

It still does a sequential scan but that is much better iprovement over the query plan using the joins. I am not sure if I am creating the GIN index correctly, or if postgres planner is doing a seq scan because there is not enough data so I tried.

play=# set enable_seqscan=false;
SET
play=# explain SELECT * from foo WHERE foo.roles && ARRAY[1,2,11,14];
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=24.03..32.49 rows=4 width=68)
   Recheck Cond: (roles && '{1,2,11,14}'::integer[])
   ->  Bitmap Index Scan on foo_roles_idx  (cost=0.00..24.03 rows=4 width=0)
         Index Cond: (roles && '{1,2,11,14}'::integer[])
(4 rows)

Is this query plan correct for GIN index?

Best Answer

The first thing I will say here is that you really need to encapsulate this properly to make it work. Secondly this is a good case where breaking 1NF is likely to be a good thing performance-wise.

I would break this into three pieces, totally mangling Codd's rules in the process, but I think those can be broken in certain cases and here breaking one piece demands breaking more pieces in order to correct the damage.

The first piece is the administration portion. In this portion, permissions are stored in the form you are currently using. This is used for updating and managing the permissions in a relational way. These would then be compiled for a specific row using a stored procedure.

The second piece is the storage piece. Here the rows are stored with the roles in an array as you suggest in your first answer. The arrays here are indexed using GIN which makes lookups a lot faster.

Third, you need encapsulation logic, such as views to only retrieve the rows a user has access to, or a stored procedure interface which checks the roles adding WHERE my_role_id = ANY(roles) to the end of the query.

To make this work you will need to compile the permissions when they are changed. However this structure allows you to say "all rows that Bob has access to can now be accessed by Joe."

 CREATE TABLE foo (
    id serial not null unique,
    ....
    roles int[]
 );
 CREATE TABLE foo_roles (
    foo_id int not null references foo(id),
    role_id int not null references roles(id)
 );

 CREATE OR REPLACE FUNCTION rebuild_foo_roles(in_foo_id int) RETURNS int[] 
 LANGUAGE SQL AS $$

     UPDATE foo SET roles = (select array_agg(role_id) from foo_roles 
                               where foo_id = $1)
      WHERE role_id = $1
     RETURNING roles;
  $$;

Then you'd update the permissions by altering the data in foo_roles and then calling rebuild_foo_roles with the ids of the rows you want to modify.

Now one thing about this structure is that it makes lookups fast, but updates to the permissions pretty slow.

Update: A note about GIN indexes (per request).

Traditional database indexes are b-trees, which means you can search by comparing an expected value to branches of the index, and then traversing to leaves. This works well if searching either on the whole value (WHERE foo = 'my_fun_value') but it doesn't work well with more complex searches. GIN indexes (Generalized Index Method) allow for indexes to be used in a wider variety of searches, but they are slightly slower than b-tree indexes. GIN indexes can be used to ask questions like:

  • Is value x in array y? (our use case here)

  • Does circle A overlap with circle B?

  • Does line segment C cross line segment D?

  • Does appointment A overlap time-wise with appointment B?

In newer versions of PostgreSQL you can use these indexes to enforce constraints like "No appointments in any given room may overlap time-wise" but the main use case here is that it makes searching arrays to be fast.

In general a GIN index will give you much better performance than a b-tree or GIST index on a join. In this case your query against foo can be an index scan against the relation with no joins needed.