Postgresql – Why colname as input param of function = not good idea

dynamic-sqlplpgsqlpostgresqlpostgresql-9.4

Someone told me that using text input for column names and formatting it, like I do below, is rarely a good idea. When I asked why, however, an answer wasn't given. That was on the postgresql IRC, and those guys seem to know their stuff. So I'd like to know why is it not advised ? I'm mostly wondering if it opens the door for sql injection.

create or replace function getItemsOrderBy(order_by_p text)
RETURNS TABLE (id int) AS $$


BEGIN

    return query EXECUTE format('
    SELECT id
    FROM items 
    ORDER BY %s', order_by_p) ;

END;

He also said to use execute with using instead, so what's the difference between this:

return query EXECUTE format('
SELECT id
FROM items 
ORDER BY %s', order_by_p) ;

and this :

return query EXECUTE '
SELECT id
FROM items 
ORDER BY $1' USING order_by_p ;

My function is more complex than what is above – the format part is only part of it. I have a choice to either create one function that can deal with multiple cases (for ordering) or create a bunch of them to deal with every ordering. I felt like doing only one was more practical. Having no function at all isn't an option.

I am actually using pg-promise but I was under the impression that since I'm doing a lot of back and forth between the back end and the DB (send something, wait response, compute something else, send again..) I should go with function and let everything happen all at once.

Best Answer

There are two questions here,

  1. What is the difference between EXECUTE .. USING and EXECUTE FORMAT()
  2. Why is wrapping or generating simple SQL statements in a procedural function a bad idea?

The difference between EXECUTE .. USING and EXECUTE FORMAT()

From the docs,

The command string can use parameter values, which are referenced in the command as $1, $2, etc. These symbols refer to values supplied in the USING clause. This method is often preferable to inserting data values into the command string as text: it avoids run-time overhead of converting the values to text and back, and it is much less prone to SQL-injection attacks since there is no need for quoting or escaping. An example is:

EXECUTE 'SELECT count(*) FROM mytable WHERE inserted_by = $1 AND inserted <= $2'
  INTO c
  USING checked_user, checked_date;

Note that parameter symbols can only be used for data values — if you want to use dynamically determined table or column names, you must insert them into the command string textually. For example, if the preceding query needed to be done against a dynamically selected table, you could do this:

EXECUTE 'SELECT count(*) FROM '
    || quote_ident(tabname)
    || ' WHERE inserted_by = $1 AND inserted <= $2'
   INTO c
   USING checked_user, checked_date;

A cleaner approach is to use format()'s %I specification for table or column names (strings separated by a newline are concatenated):

EXECUTE format('SELECT count(*) FROM %I '
   'WHERE inserted_by = $1 AND inserted <= $2', tabname)
   INTO c
   USING checked_user, checked_date;

An EXECUTE with a simple constant command string and some USING parameters, as in the first example above, is functionally equivalent to just writing the command directly in PL/pgSQL and allowing replacement of PL/pgSQL variables to happen automatically. The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

So you have a few arguments here.

  1. You can use both EXECUTE FORMAT() ... USING
  2. USING allows the plan to be cached.
  3. USING allows symbols to stay symbols and stops them from having to be converted to text and re-escaped.
  4. USING can not be used with identifiers, only literals.

Wrapping and generating simple SQL statements in a procedural function is a bad idea.

As for the other part of the question,

So I'd like to know why is it not advised? (Wrapping simple SQL statements in plpgsql.)

There are a lot of reasons for that,

  • functions as such obscure the costs (plural) to the planner, and require the user to explicitly
    1. set the execution cost (or use a rather silly estimate)
    2. 9.6+ establish if it is parallel-safe safe/restricted/unsafe
    3. 9.6+ establish if it has side effects strict/immutable/volatile
  • they obscure the internals to the user.
  • they prevent predicate-pushdown.
  • they complicate permissions (now you need access to the function too).
  • they raise the barrier for maintenance, now you have to define the result set the function returns TABLE (id int)
  • they present all kinds of problems with ORMs.

And, it's not SQL. You're building a new language on top of a DBMS. Why?

As for the dynamic component, there are other ways to engineer around the problem. Take for instance the exact statement provided, the worst case scenario is where ever you see that

SELECT * FROM getItemsOrderBy($col);

You have to explicitly write out the order-by. As bad as it, it's a better solution in my opinion.

SELECT id FROM items ORDER BY col1
SELECT id FROM items ORDER BY col2
SELECT id FROM items ORDER BY col3

A step even further would be to use a library which provides some kind of assistance for generating , like pg-promise

let args = { orderBy: "col1" };
if ( args.orderBy !== 'col1' ) {
  throw new Error "invalid orderBy Column";
}
db.manyOrNone( 'SELECT id FROM items ORDER BY ${orderBy~}', args );

Or, DBIx::Abstract, or an ORM like DBIx::Class.