Postgresql – Solutions to RLS (Row Level Security) in PostgreSQL for Desktop Application

database-designpostgresqlrow-level-security

The Situation:

I have a desktop application that currently uses a database which is stored locally on the computer that is running the application, and I am starting to implement a server DB so that some of the information in this application can be shared between users on different machines. The local DB schema was not designed to support RLS, or anything close to it. I'm now trying to design a schema that holds the same data, but restricts access to it based on a users 'role' or 'group', etc.

My Understanding:

I would ideally like to use PostgreSQL, but others are fine, and saw that they have RLS support, but I'm unsure how to proceed. I'm thinking of using something like a RL_Groups and Users table, with a junction table like RL_Group_Users to determine row level access instead. I'm not sure of performance / security implications of either option, or even the alternative options. I've attached a simplified diagram to give some context to my current thoughts.

Supporting Info:

The local database is SQLite 3.
There are roughly 100 tables at the moment.
These 100 tables will all need some kind of RLS to restrict information to users and user_groups.

If I missed anything or you have any questions please comment so I can get back to you.

NOTE: The attached diagram has been simplified, data has been removed, etc. but it should be plenty to illustrate the problem.

Best Answer

I believe you're problem is similar to one I've faced a few years ago, while trying to design a multi-tenant web application with a single underlying database schema. There's an interesting article by AWS that can point you in the right direction. The core of it is:

CREATE POLICY tenant_isolation_policy ON tenant
USING (tenant_id::TEXT = current_user);

There are several different approaches, but the one I ended following is this:

Use a different database user for each connection coming from your desktop PCs. This should match your RLGroupId, if I understood correctly.
Apply a Policy to every table where data segregation by RLGroupId should be enforced

Hope this helps.

Related Solutions

Database Design – Row Level Security vs Multiple Tables

My bias is to use a single table with appropriate row-level security.

There are potentially huge maintenance advantages to a single set of tables. If you end up with n copies of each table, that means that you have to run n copies of each script every time you want to make a change. Frequently, that means that you end up with at least a few very slightly different versions of the application running at a time because someone forgot to apply script 7 of 23 in a monthly build to one set of tables and someone else created an index on one set of tables to address one customer's issues without adding it to every customer which makes debugging much harder.

A single set of tables has significant scalability advantages. Adding new customers just requires adding a new row to the customer table not deploying a new schema/ database with the new customer's copies of the tables. Adding new customers also doesn't directly add ongoing work for the DBAs. If you have separate copies of the tables, you need someone to deploy tables to create a new customer and every new customer means additional work for the DBA at least to run the scripts one more time every time there is a change.

A single set of tables may also offer performance advantages. If you're using a separate set of tables, each customer would realistically need a separate connection pool in the middle tier. It would defeat the purpose of having separate tables, after all, if your middle tier is connecting as a user that can see every tenant's data because then you'd be implementing row-level security in the middle tier and dealing with all the complexity of multiple sets of tables in the back end. That makes it tough to scale across servers-- do you create a connection pool for every client on every server? Do you send certain clients to certain servers? Do you not preallocate connections and incur the cost of waiting for connections to be established more frequently?

That being said, there are cases where separate tables might be preferred. If your customers are frequently large institutions, for example, separate tables will make it much easier to do things like move a customer to a dedicated box (or at least a dedicated VM) if the customer wants to upgrade to dedicated hardware so that they don't risk performance being affected by other customers. Those large institutions may want greater control over outages and upgrades so it may make sense to have separate tables to allow different customers to be upgraded at different times in order to work with that customer's schedule. Those institutions may find it easier to tell an auditor that their data is physically separate from all other customers rather than explaining that the data is physically intermingled but security controls are in place to guarantee row-level security. If you're going to have relatively few relatively large customers, the amount of maintenance overhead you introduce by having separate tables may not be particularly significant given the general day-to-day maintenance tasks that each client likely requires. In that sort of environment, different clients often have sufficiently different configurations that the problems they encounter are relatively unique even when they're running exactly the same version of the software.

Postgresql – Optimize Row Level Security expression in Postgres

At first glance, it might appear that since current_setting and string_to_array functions are stable and immutable respectively and there is an index on the category column, the following condition could do the trick.

CREATE POLICY table_select_policy ON big_table
FOR SELECT
USING (category = ANY(string_to_array(current_setting('mydb.allowed_categories'),',')::int[])));

However, the real problem has nothing to do with the form of the condition. If you take a look at the PostgreSQL system catalog and find those two functions that were used in the condition, you may notice that both functions have their cost parameter set to 1. This makes the optimizer assume that calling those functions to recheck the condition while doing Bitmap Heap Scan is cheap.

In order to illustrate this, consider the big_table table shown below.

CREATE TABLE big_table AS
SELECT c AS category
FROM generate_series(1, 1000000) g,
     generate_series(1, 10) c;

CREATE INDEX big_table_category_idx ON big_table (category);

ANALYZE big_table;

Querying the table with the the query below results in the following plan (look at the number of rows removed by Index Recheck).

EXPLAIN ANALYZE SELECT * FROM big_table WHERE category = ANY(string_to_array(current_setting('mydb.allowed_categories'), ',')::int[]);

"Bitmap Heap Scan on big_table  (cost=52478.56..195418.17 rows=3036665 width=4) (actual time=166.613..9273.010 rows=3000000 loops=1)"
"  Recheck Cond: (category = ANY ((string_to_array(current_setting('mydb.allowed_categories'::text), ','::text))::integer[]))"
"  Rows Removed by Index Recheck: 5209365"
"  ->  Bitmap Index Scan on big_table_category_idx  (cost=0.00..51719.39 rows=3036665 width=0) (actual time=164.782..164.782 rows=3000000 loops=1)"
"        Index Cond: (category = ANY ((string_to_array(current_setting('mydb.allowed_categories'::text), ','::text))::integer[]))"
"Total runtime: 9341.568 ms"

To avoid that, you have two options. The first is to increase the work_mem parameter in your server config. This will allow the server to store the complete bitmap in memory and save a great deal of time while rechecking the condition. The plan below was obtained when the work_mem parameter was set to 1000M.

EXPLAIN ANALYZE SELECT * FROM big_table WHERE category = ANY(string_to_array(current_setting('mydb.allowed_categories'), ',')::int[]);


"Bitmap Heap Scan on big_table  (cost=52478.56..195418.17 rows=3036665 width=4) (actual time=193.613..449.385 rows=3000000 loops=1)"
"  Recheck Cond: (category = ANY ((string_to_array(current_setting('mydb.allowed_categories'::text), ','::text))::integer[]))"
"  ->  Bitmap Index Scan on big_table_category_idx  (cost=0.00..51719.39 rows=3036665 width=0) (actual time=184.858..184.858 rows=3000000 loops=1)"
"        Index Cond: (category = ANY ((string_to_array(current_setting('mydb.allowed_categories'::text), ','::text))::integer[]))"
"Total runtime: 513.528 ms"

The second option is to wrap the query condition in a "costly" function.

CREATE OR REPLACE FUNCTION my_categories()
RETURNS int[] AS $$
BEGIN
  RETURN string_to_array(current_setting('mydb.allowed_categories'), ',')::int[];
END;
$$ LANGUAGE plpgsql STABLE COST 100000;

EXPLAIN ANALYZE SELECT * FROM big_table WHERE category = ANY(my_categories());

"Index Only Scan using big_table_category_idx on big_table  (cost=250.44..9363945.19 rows=3036665 width=4) (actual time=0.035..577.094 rows=3000000 loops=1)"
"  Index Cond: (category = ANY (my_categories()))"
"  Heap Fetches: 3000000"
"Total runtime: 642.522 ms"

This option will also make the optimizer end up performing Index Scan instead of Bitmap Scan which may turn out to be a bit slower.

Best Answer

Related Solutions

Database Design – Row Level Security vs Multiple Tables

Postgresql – Optimize Row Level Security expression in Postgres

Related Question