Postgresql – How much could sequential scan gain by GPU

postgresql

I am considering hiring someone to implement GPU processing in Postgres to speed up sequential scans on a database located on a ramdisk. How much do you believe I could gain using this approach? I could buy any type of graphic card. I know that this question is really hard, but I know very little about GPU, so a very rough guess would help me a lot.

EDIT: Even though I my interest is specifically for Postgres, I could ask this question in a different way: "How large is the potential performance increase by using GPU for sequential scans on an in-memory database". If the answer is that it has a potential of 10-50 times the performance, then I can start investigating more specifically if this could be realized also in Postgres.

Best Answer

I stumbled on some research for sqllite that may be applicable. They reported increased performance of 20-70X for some applications

It goes without saying that your mileage may vary.

Related Solutions

PostgreSQL Sequential Scan instead of Index Scan Why

Notice this line:

->  Index Scan using data_area_pkey on data_area  (cost=0.00..52.13 rows=1 width=8) 
    (actual time=0.006..0.008 rows=0 loops=335130)

If you compute the total cost, considering loops, it is 52.3 * 335130 = 17527299. This is larger than 14857017.62 for the seq_scan alternative. That is why it does not use the index.

So the optimizer is overestimating the cost of the index scan. I'd guess that your data is sorted on the index (either due to a clustered index or to how it was loaded) and/or you have plenty of cache memory and/or a nice fast disk. Hence there is little random I/O going on.

You should also check the correlation in pg_stats, that is used by the optimizer to assess clustering when computing the index cost, and finally try changing random_page_cost and cpu_index_tuple_cost, to match your system.

Postgresql – How to Model / Quickly test for set membership in SQL

The first thing I will say here is that you really need to encapsulate this properly to make it work. Secondly this is a good case where breaking 1NF is likely to be a good thing performance-wise.

I would break this into three pieces, totally mangling Codd's rules in the process, but I think those can be broken in certain cases and here breaking one piece demands breaking more pieces in order to correct the damage.

The first piece is the administration portion. In this portion, permissions are stored in the form you are currently using. This is used for updating and managing the permissions in a relational way. These would then be compiled for a specific row using a stored procedure.

The second piece is the storage piece. Here the rows are stored with the roles in an array as you suggest in your first answer. The arrays here are indexed using GIN which makes lookups a lot faster.

Third, you need encapsulation logic, such as views to only retrieve the rows a user has access to, or a stored procedure interface which checks the roles adding WHERE my_role_id = ANY(roles) to the end of the query.

To make this work you will need to compile the permissions when they are changed. However this structure allows you to say "all rows that Bob has access to can now be accessed by Joe."

 CREATE TABLE foo (
    id serial not null unique,
    ....
    roles int[]
 );
 CREATE TABLE foo_roles (
    foo_id int not null references foo(id),
    role_id int not null references roles(id)
 );

 CREATE OR REPLACE FUNCTION rebuild_foo_roles(in_foo_id int) RETURNS int[] 
 LANGUAGE SQL AS $$

     UPDATE foo SET roles = (select array_agg(role_id) from foo_roles 
                               where foo_id = $1)
      WHERE role_id = $1
     RETURNING roles;
  $$;

Then you'd update the permissions by altering the data in foo_roles and then calling rebuild_foo_roles with the ids of the rows you want to modify.

Now one thing about this structure is that it makes lookups fast, but updates to the permissions pretty slow.

Update: A note about GIN indexes (per request).

Traditional database indexes are b-trees, which means you can search by comparing an expected value to branches of the index, and then traversing to leaves. This works well if searching either on the whole value (WHERE foo = 'my_fun_value') but it doesn't work well with more complex searches. GIN indexes (Generalized Index Method) allow for indexes to be used in a wider variety of searches, but they are slightly slower than b-tree indexes. GIN indexes can be used to ask questions like:

Is value x in array y? (our use case here)
Does circle A overlap with circle B?
Does line segment C cross line segment D?
Does appointment A overlap time-wise with appointment B?

In newer versions of PostgreSQL you can use these indexes to enforce constraints like "No appointments in any given room may overlap time-wise" but the main use case here is that it makes searching arrays to be fast.

In general a GIN index will give you much better performance than a b-tree or GIST index on a join. In this case your query against foo can be an index scan against the relation with no joins needed.

Best Answer

Related Solutions

PostgreSQL Sequential Scan instead of Index Scan Why

Postgresql – How to Model / Quickly test for set membership in SQL

Related Question