PostgreSQL ORDER BY – Does Using CASE .. END Make Sense?

dynamic-sqlfunctionsorder-bypostgresqlpostgresql-9.6

Queries like SELECT * FROM t ORDER BY case when _parameter='a' then column_a end, case when _parameter='b' then column_b end are possible, but: Is this a good practice?

It is common to use parameters in the WHERE part of queries, and to have some computed columns in the SELECT part, but not that common to parameterize the ORDER BY clause.

Let's say we have an application that lists pre-owned cars (à la CraigsList). The list of cars can be sorted by price or colour. we have one function that, given a certain amount of parameters (let's say price-range, colour, and sorting criteria for instance) returns a set of records with the results.

To make it concrete, let's assume cars are all in the following table:

CREATE TABLE cars
(
  car_id serial NOT NULL PRIMARY KEY,  /* arbitrary anonymous key */
  make text NOT NULL,       /* unnormalized, for the sake of simplicity */
  model text NOT NULL,      /* unnormalized, for the sake of simplicity */
  year integer,             /* may be null, meaning unknown */
  euro_price numeric(12,2), /* may be null, meaning seller did not disclose */
  colour text               /* may be null, meaning unknown */
) ;

The table would have indices for most columns…

CREATE INDEX cars_colour_idx
  ON cars (colour);
CREATE INDEX cars_price_idx
  ON cars (price);
/* etc. */

And have some commodity enumerations:

CREATE TYPE car_sorting_criteria AS ENUM
   ('price',
    'colour');

… and some sample data

INSERT INTO cars.cars (make, model, year, euro_price, colour)

VALUES 
    ('Ford',   'Mondeo',   1990,  2000.00, 'green'),
    ('Audi',   'A3',       2005,  2500.00, 'golden magenta'),
    ('Seat',   'Ibiza',    2012, 12500.00, 'dark blue'),
    ('Fiat',   'Punto',    2014,     NULL, 'yellow'),
    ('Fiat',   '500',      2010,  7500.00, 'blueish'),
    ('Toyota', 'Avensis',  NULL,  9500.00, 'brown'), 
    ('Lexus',  'CT200h',   2012, 12500.00, 'dark whitish'), 
    ('Lexus',  'NX300h',   2013, 22500.00, NULL) ;

The kind of queries we are going to make are of the style:

SELECT
    make, model, year, euro_price, colour
FROM
    cars.cars
WHERE
    euro_price between 7500 and 9500 
ORDER BY
    colour ;

We would like to have queries of this style in a function:

CREATE or REPLACE FUNCTION get_car_list
   (IN _colour    text, 
    IN _min_price numeric, 
    IN _max_price numeric, 
    IN _sorting_criterium car_sorting_criteria) 
RETURNS record AS
$BODY$
      SELECT
          make, model, year, euro_price, colour
      FROM
          cars
      WHERE
           euro_price between _min_price and _max_price
           AND colour = _colour
      ORDER BY
          CASE WHEN _sorting_criterium = 'colour' THEN
            colour
          END,
          CASE WHEN _sorting_criterium = 'price' THEN
            euro_price
          END 
$BODY$
LANGUAGE SQL ;

Instead of this approach, the SQL in this function could be generated dynamically (in PL/pgSQL) as a string, and then EXECUTEd.

We can sense a few limitations, advantages and disadvantages with either approach:

Within a function, finding which is the query plan for a certain statement is difficult (if possible at all). Yet we tend to use functions mostly when going to use something often enough.
Errors in static SQL will be caught (mostly) when the function is compiled or when it is first called.
Errors in dynamic SQL will only be caught (moslty) after the function is compiled, and all the execution paths have been checked (i.e.: the number of tests to perform to the function can be really high).
A parametric query like the one exposed will probably be less efficient than a dynamically generated one; yet the executor will have a harder job parsing / making query tree / deciding every single time (which could impact efficiency in the opposite direction).

Question:

How to "get the best of both worlds" (if possible)? [Efficiency + Compiler checks + Debugging easily + Optimizing easily]

NOTE: this is to be run on PostgreSQL 9.6.

Best Answer

General answer

First, I want to address ambiguity in the premise:

It is common to use parameters in the WHERE part of queries, and to have some computed columns in the SELECT part, but not that common to parameterize the ORDER BY clause.

Computed columns in the SELECT part are hardly ever relevant for the query plan or performance. But "in the WHERE part" is ambiguous.

It's common to parameterize values in the WHERE clause, which works for prepared statements. (And PL/pgSQL works with prepared statements internally.) A generic query plan often makes sense regardless of provided values. That is, unless tables have a very uneven data distribution, but since Postgres 9.2 PL/pgSQL re-plans queries a couple of times to test whether the generic plan seems good enough:

PostgreSQL Stored Procedure Performance

But it's not as common to parameterize whole predicates (including identifiers) in the WHERE clause, which is impossible with prepared statements to begin with. You need dynamic SQL with EXECUTE, or you assemble the query strings in the client.

Dynamic ORDER BY expressions are somewhere in between both. You can do it with a CASE expression, but that's very hard to optimize in general. Postgres might use indexes with a plain ORDER BY, but not with CASE expressions hiding the eventual sort order. The planner is smart, but not an AI. Depending on the rest of the query (ORDER BY may be relevant for the plan or not - it's relevant in your example), you may end up with a sub-optimal query plan all the time.
Plus, you add the minor cost of the CASE expression(s). And in your particular example also for multiple useless ORDER BY columns.

Typically, dynamic SQL with EXECUTE is faster or much faster for this.

Maintainability should not be a problem if you maintain a clear and readable code format in your function body.

Fix demo function

The function in the question is broken. The return type is defined to return an anonymous record:

RETURNS record AS

But the query actually returns a set of records, it would have to be:

RETURNS SETOF record AS

But that's still unhelpful. You'd have to provide a column definition list with every call. Your query returns columns of well-known type. Declare the return type accordingly! I am guessing here, use actual data types of returned columns / expressions:

RETURNS TABLE (make text, model text, year int, euro_price int, colour text) AS

I use the same column names for convenience. Columns in the RETURNS TABLE clause are effectively OUT parameters, visible in every SQL statement in the body (but not inside EXECUTE). So table-qualify columns in queries in the function body to avoid possible naming conflicts. The demo function would work like this:

CREATE or REPLACE FUNCTION get_car_list (
    _colour            text, 
    _min_price         numeric, 
    _max_price         numeric, 
    _sorting_criterium car_sorting_criteria) 
  RETURNS TABLE (make text, model text, year int, euro_price numeric, colour text) AS
$func$
      SELECT c.make, c.model, c.year, c.euro_price, c.colour
      FROM   cars c
      WHERE  c.euro_price BETWEEN _min_price AND _max_price
      AND    c.colour = _colour
      ORDER  BY CASE WHEN _sorting_criterium = 'colour' THEN c.colour     END
              , CASE WHEN _sorting_criterium = 'price'  THEN c.euro_price END;
$func$  LANGUAGE sql;

Do not confuse the RETURNS key word in the function declaration with the plpgsql RETURN command like Evan did in his answer. Details:

Difference between return next and return record

General difficulty of the example query

Predicate on some columns (even worse: range predicates), other columns in ORDER BY, that's already hard to optimize. But you mentioned in a comment:

Actual result sets can be in the order of several 1.000 rows (and thus, sorted and paginated in smaller chunks server-side)

So you'll add LIMIT and OFFSET to these queries, returning the n "best" matches first. Or some smarter pagination technique:

Improve performance for order by with columns from many tables

You need a matching index to make this fast. I don't see how this could possibly work with CASE expressions in ORDER BY.

Consider:

Related Solutions

How to Generate Table Create Script in T-SQL

Back in 2007, I asked for an easy way to generate a CREATE TABLE script via T-SQL rather than using the UI or SMO. I was summarily rejected.

However, SQL Server 2012 makes this very easy. Let's pretend we have a table with the same schema across multiple databases, e.g. dbo.whatcha:

CREATE TABLE dbo.whatcha
(
  id INT IDENTITY(1,1), 
  x VARCHAR(MAX), 
  b DECIMAL(10,2), 
  y SYSNAME
);

The following script uses the new sys.dm_exec_describe_first_results_set dynamic management function to retrieve the proper data types for each of the columns (and ignoring the IDENTITY property). It builds the #tmp table you need, inserts from each of the databases in your list, and then selects from #tmp, all within a single dynamic SQL batch and without using a WHILE loop (that doesn't make it better, just simpler to look at and allows you to ignore Database_Ref_No entirely :-)).

SET NOCOUNT ON;

DECLARE @sql NVARCHAR(MAX), @cols NVARCHAR(MAX) = N'';

SELECT @cols += N',' + name + ' ' + system_type_name
  FROM sys.dm_exec_describe_first_result_set(N'SELECT * FROM dbo.whatcha', NULL, 1);

SET @cols = STUFF(@cols, 1, 1, N'');

SET @sql = N'CREATE TABLE #tmp(' + @cols + ');'

DECLARE @dbs TABLE(db SYSNAME);

INSERT @dbs VALUES(N'db1'),(N'db2');
  -- SELECT whatever FROM dbo.databases

SELECT @sql += N'
  INSERT #tmp SELECT ' + @cols + ' FROM ' + QUOTENAME(db) + '.dbo.tablename;'
  FROM @dbs;

SET @sql += N'
  SELECT ' + @cols + ' FROM #tmp;';

PRINT @sql;
-- EXEC sp_executesql @sql;

The resulting PRINT output:

CREATE TABLE #tmp(id int,x varchar(max),b decimal(10,2),y nvarchar(128));
  INSERT #tmp SELECT id,x,b,y FROM [db1].dbo.tablename;
  INSERT #tmp SELECT id,x,b,y FROM [db2].dbo.tablename;
  SELECT id,x,b,y FROM #tmp;

When you are confident it's doing what you expect, just uncomment the EXEC.

(This trusts you that the schema is the same; it does not validate that one or more of the tables has since been changed, and may fail as a result.)

Postgresql – Guidelines on best indexing strategy for varying searches on 20+ columns

Capture the queries that people actually make.
Analyze data from #1; then devise short, multi-column, indexes (2, maybe 3, columns each)
No. Especially not for yes/no flags. It's ok to combine such, as in #2.
See #3.
The order of columns is important in an index. A 5-column index is (roughly) equivalent to the 1-col index on its first column, but is not useful if you don't query on that column.
Probably.
Forcing into memory is counter-productive because it takes away space for caching other stuff.

More discussion: http://mysql.rjweb.org/doc.php/eav