Postgresql – Why colname as input param of function = not good idea

dynamic-sqlplpgsqlpostgresqlpostgresql-9.4

Someone told me that using text input for column names and formatting it, like I do below, is rarely a good idea. When I asked why, however, an answer wasn't given. That was on the postgresql IRC, and those guys seem to know their stuff. So I'd like to know why is it not advised ? I'm mostly wondering if it opens the door for sql injection.

create or replace function getItemsOrderBy(order_by_p text)
RETURNS TABLE (id int) AS $$


BEGIN

    return query EXECUTE format('
    SELECT id
    FROM items 
    ORDER BY %s', order_by_p) ;

END;

He also said to use execute with using instead, so what's the difference between this:

return query EXECUTE format('
SELECT id
FROM items 
ORDER BY %s', order_by_p) ;

and this :

return query EXECUTE '
SELECT id
FROM items 
ORDER BY $1' USING order_by_p ;

My function is more complex than what is above – the format part is only part of it. I have a choice to either create one function that can deal with multiple cases (for ordering) or create a bunch of them to deal with every ordering. I felt like doing only one was more practical. Having no function at all isn't an option.

I am actually using pg-promise but I was under the impression that since I'm doing a lot of back and forth between the back end and the DB (send something, wait response, compute something else, send again..) I should go with function and let everything happen all at once.

Best Answer

There are two questions here,

What is the difference between EXECUTE .. USING and EXECUTE FORMAT()
Why is wrapping or generating simple SQL statements in a procedural function a bad idea?

The difference between `EXECUTE .. USING` and `EXECUTE FORMAT()`

From the docs,

The command string can use parameter values, which are referenced in the command as $1, $2, etc. These symbols refer to values supplied in the USING clause. This method is often preferable to inserting data values into the command string as text: it avoids run-time overhead of converting the values to text and back, and it is much less prone to SQL-injection attacks since there is no need for quoting or escaping. An example is:
EXECUTE 'SELECT count(*) FROM mytable WHERE inserted_by = $1 AND inserted <= $2'
  INTO c
  USING checked_user, checked_date;
Note that parameter symbols can only be used for data values — if you want to use dynamically determined table or column names, you must insert them into the command string textually. For example, if the preceding query needed to be done against a dynamically selected table, you could do this:
EXECUTE 'SELECT count(*) FROM '
    || quote_ident(tabname)
    || ' WHERE inserted_by = $1 AND inserted <= $2'
   INTO c
   USING checked_user, checked_date;
A cleaner approach is to use format()'s %I specification for table or column names (strings separated by a newline are concatenated):
EXECUTE format('SELECT count(*) FROM %I '
   'WHERE inserted_by = $1 AND inserted <= $2', tabname)
   INTO c
   USING checked_user, checked_date;
An EXECUTE with a simple constant command string and some USING parameters, as in the first example above, is functionally equivalent to just writing the command directly in PL/pgSQL and allowing replacement of PL/pgSQL variables to happen automatically. The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

So you have a few arguments here.

You can use both EXECUTE FORMAT() ... USING
USING allows the plan to be cached.
USING allows symbols to stay symbols and stops them from having to be converted to text and re-escaped.
USING can not be used with identifiers, only literals.

Wrapping and generating simple SQL statements in a procedural function is a bad idea.

As for the other part of the question,

So I'd like to know why is it not advised? (Wrapping simple SQL statements in plpgsql.)

There are a lot of reasons for that,

functions as such obscure the costs (plural) to the planner, and require the user to explicitly
1. set the execution cost (or use a rather silly estimate)
2. 9.6+ establish if it is parallel-safe safe/restricted/unsafe
3. 9.6+ establish if it has side effects strict/immutable/volatile
they obscure the internals to the user.
they prevent predicate-pushdown.
they complicate permissions (now you need access to the function too).
they raise the barrier for maintenance, now you have to define the result set the function returns TABLE (id int)
they present all kinds of problems with ORMs.

And, it's not SQL. You're building a new language on top of a DBMS. Why?

As for the dynamic component, there are other ways to engineer around the problem. Take for instance the exact statement provided, the worst case scenario is where ever you see that

SELECT * FROM getItemsOrderBy($col);

You have to explicitly write out the order-by. As bad as it, it's a better solution in my opinion.

SELECT id FROM items ORDER BY col1
SELECT id FROM items ORDER BY col2
SELECT id FROM items ORDER BY col3

A step even further would be to use a library which provides some kind of assistance for generating , like pg-promise

let args = { orderBy: "col1" };
if ( args.orderBy !== 'col1' ) {
  throw new Error "invalid orderBy Column";
}
db.manyOrNone( 'SELECT id FROM items ORDER BY ${orderBy~}', args );

Or, DBIx::Abstract, or an ORM like DBIx::Class.

Base query

First of all, your base query is needlessly complex and possibly incorrect. Consider instead:

(TABLE a EXCEPT ALL TABLE b)  -- ALL keeps duplicate rows and is faster
UNION ALL
(TABLE b EXCEPT ALL TABLE a);

I doubt you want to fold duplicates. Typically, tables do not hold complete duplicates to begin with, and omitting the ALL key word would instruct Postgres to try and fold duplicates in every step, which is just a waste of time. Details:

Select rows which are not present in other table

Even if completely duplicate rows are possible, it might be best to return all of them to avoid misleading answers. If you want to fold duplicates in the result anyway, a single UNION does the job:

(TABLE a EXCEPT ALL TABLE b)
UNION
(TABLE b EXCEPT ALL TABLE a);

However, a query with NATURAL FULL OUTER JOIN is more efficient, doing almost the same:

SELECT *
FROM   a NATURAL FULL OUTER JOIN b
WHERE  a IS NULL OR b IS NULL;

There are two subtle, exotic corner cases:

This returns rows that are all NULL to begin with. It would be an odd table design that allows such rows, though.
This does not return duplicate rows that are not matched with the same number of duplicate rows in the other table: a single match is enough to eliminate all - which is subtly different from folding duplicates in the result! Have a look at the fiddle below. Again, it would be an odd table design that allows completely duplicate rows.

Function

What you have so far does not work for several reasons.

To return actual table rows dynamically (and not just a count or a text representation), you need to use a polymorphic type.

Since the second table is bound to have a compatible row type (as per my assumption) it's enough to hand in just the table name for that.

CREATE OR REPLACE FUNCTION f_tbl_diff(_tbl1_type ANYELEMENT, _tbl2 text)
  RETURNS SETOF ANYELEMENT AS
$func$
BEGIN
   RETURN QUERY EXECUTE format('
      SELECT *
      FROM   %1$s NATURAL FULL OUTER JOIN %2$I
      WHERE  %1$s IS NULL OR %2$I IS NULL'
    , pg_typeof(_tbl1_type), _tbl2);
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_tbl_diff(NULL::a, 'b');

Note the special syntax for the first argument! We hand in an actual row defining the return type, not just a table name. Related answer on SO with ample details (scroll to the last chapter):

Refactor a PL/pgSQL function to return the output of various SELECT queries

This returns all rows that have no exact, complete match in the other table

SQL Fiddle also demonstrating exotic corner cases.

About the dynamic SQL:

SQL injection in Postgres functions vs prepared queries

Best Answer

The difference between EXECUTE .. USING and EXECUTE FORMAT()

Wrapping and generating simple SQL statements in a procedural function is a bad idea.

Related Solutions

Postgresql – RETURN NEXT in Postgres Function

Postgresql – Find unmatched rows between two tables dynamically

Base query

Function

Related Question

The difference between `EXECUTE .. USING` and `EXECUTE FORMAT()`