Postgresql – Should I write a scalar function or a setof returning function for an INSERT/UPDATE/DELETE which can return at most one row

functionsplpgsqlpostgresqlset-returning-functions

I want to embed certain INSERT, UPDATE or DELETE statements into a function, to enhance re-usability. These functions have in common that the are manipulating a single row (if it exists).

An example is the following function:

create function my_table_delete(_identifier int) returns setof int as
$BODY$
begin
    return query
    delete from    my_table
    where          identifier = _identifier
    returning      identifier;
end
$BODY$
language plpgsql volatile strict;

This function accepts an identifier, deletes the corresponding row and returns the same identifier in case the delete was successful.

I am quite unsure if I should implement this as a set-returning function (like I did above) or a scalar function.

My reasoning for preferring set-returning function is the following:

In case of a scalar function, we will have a NULL return value in case the row could not be found (while we will have 0 rows in case of the set-returning function). Having to deal with NULLs is a pain.
I'm working with sets (SQL is all about sets), so it makes sense to just keep doing that.
Also, I prefer returning the identifier (1 row: successful, 0 rows: not successful) instead of a boolean (true: successful, false: not successful), as it would allow constructs such as:

with d(id) as
(
    select * from my_table_delete(1234)
)
insert into log_table(id, action)
select      d.id, 'DELETE'
from        d;

If I were to return a boolean, the statement above could become less elegant.

I'm interested whether those arguments make sense. I'm quite new to the embedding logic into the database, so I'm not sure if I am addressing this correctly.

What do you think?

On a slightly related note, I also wonder if I should define the function as following:

create function my_table_delete(_identifier int)
    returns table (identifier int)
as ...

This gives the output row's column a name, which is even more convenient to use.

Best Answer

First, let's assert that manipulating a single row is positively guaranteed by a UNIQUE or PRIMARY KEY constraint on my_table.identifier. Else it should be a table function to begin with.

Next: Should you write a function for such a simple DELETE at all? Or use the plain DELETE instead? Barely longer and less overhead. (Maybe your actual function is more sophisticated ...)

Then, while your function is that simple, a plain SQL function might serve:

CREATE FUNCTION my_table_delete(_identifier int)
  RETURNS TABLE(identifier int)
$func$
   DELETE FROM my_table t
   WHERE  t.identifier = _identifier
   RETURNING t.identifier;
$func$  LANGUAGE sql VOLATILE STRICT;

Also using RETURNS TABLE like you pondered: Note the table-qualified column name to avoid conflicts. All function parameters (including columns of a RETURNS TABLE clause) are visible in SQL commands in the function body (in PL/pgSQL as well as in SQL functions). You need to avoid conflicts between column names and parameters.

An SQL function has less overhead. But if you call the function many times in the same session, the PL/pgSQL variant may actually be faster since it manages SQL statements like prepared statements, meaning the query plan is saved and reused if some preconditions are met. Then again: much of the same effect is achieved with prepared statements. See:

As to your reasoning:

1. If you don't want to deal with NULL, then the case is settled. Make it a table function. But you may have to deal with NULL after all: Various query constructs (have to) convert "no row" to NULL.

2. sounds like an argument, but is just a logical fallacy. Does not carry weight.

3. If it's your personal preference, then the case is settled again. Can't argue with that. But a data-modifying CTE would work with the scalar function just as well:

WITH d(id) AS (
   SELECT my_table_delete(1234)
   )
INSERT INTO log_table(id, action)
SELECT  d.id, 'DELETE'
FROM    d
WHERE   d.id IS NOT NULL;

In summary: go with RETURNS TABLE, but it hardly matters for the given case.

`DISTINCT ON()`

Just as a side note, this is precisely what DISTINCT ON() does (not to be confused with DISTINCT)

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

So if you were to write,

SELECT myFirstAgg(z)
FROM foo
GROUP BY x,y;

It's effectively

SELECT DISTINCT ON(x,y) z
FROM foo;
-- ORDER BY z;

In that it takes the first z. There are two important differences,

You can also select other columns at no cost of further aggregation..

SELECT DISTINCT ON(x,y) z, k, r, t, v
FROM foo;
-- ORDER BY z, k, r, t, v;

Because there is no GROUP BY you can not use (real) aggregates with it.

CREATE TABLE foo AS
SELECT * FROM ( VALUES
  (1,2,3),
  (1,2,4),
  (1,2,5)
) AS t(x,y,z);

SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- fails, as you should expect.
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- would not otherwise fail.
SELECT myFirstAgg(z), sum(z)
FROM foo
GROUP BY x,y;

Don't forget `ORDER BY`

Also, while I didn't bold it then I will now

Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

Always use an ORDER BY with DISTINCT ON

Using an Ordered-Set Aggregate Function

I imagine a lot of people are looking for first_value, Ordered-Set Aggregate Functions. Just wanted to throw that out there. It would look like this, if the function existed:

SELECT a, b, first_value() WITHIN GROUP (ORDER BY z)    
FROM foo
GROUP BY a,b;

But, alas you can do this.

SELECT a, b, percentile_disc(0) WITHIN GROUP (ORDER BY z)   
FROM foo
GROUP BY a,b;

Sql-server – Why would call to scalar function inside a Table Value Function be slower than outside the TVF

Scalar functions are called once-per-row, when called as part of a query.

Consider the following example.

Create a new, blank database for our tests:

USE master;
IF EXISTS (SELECT 1 FROM sys.databases d WHERE d.name = 'mv')
BEGIN
    ALTER DATABASE mv SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
    DROP DATABASE mv;
END
GO
CREATE DATABASE mv;
GO

Create a table, a multi-statement function, and a table-valued-function:

USE mv;
GO
CREATE TABLE dbo.t
(
    t_id int NOT NULL
        CONSTRAINT PK_t
        PRIMARY KEY CLUSTERED
);
GO

CREATE FUNCTION dbo.t_func
(
    @t_id int
)
RETURNS bit
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @r bit;
    IF EXISTS (SELECT 1 FROM dbo.t WHERE t.t_id = @t_id)
        SET @r = 1
    ELSE
        SET @r = 0;
    RETURN @r;
END
GO

CREATE FUNCTION dbo.t_tvf
(
    @min_t_id int
    , @max_t_id int
)
RETURNS TABLE 
WITH SCHEMABINDING
AS
RETURN (
    SELECT t_id = t.t_id
        , e = dbo.t_func(dbo.t.t_id)
    FROM dbo.t
    WHERE t.t_id >= @min_t_id
        AND t.t_id <= @max_t_id
);
GO

Insert some sample data into the table:

INSERT INTO dbo.t (t_id)
SELECT ROW_NUMBER() OVER (ORDER BY c.id, c.colid)
FROM sys.syscolumns c;
GO

Create a table to store function execution stats, and populate it with a start-row showing execution counts for the multi-statement-function, t_func:

CREATE TABLE dbo.function_stats
(
    run_num int NOT NULL
    , object_name sysname NOT NULL
    , execution_count int NULL 
    , CONSTRAINT PK_function_stats
        PRIMARY KEY CLUSTERED (run_num, object_name)
);
GO
INSERT INTO dbo.function_stats (run_num, object_name, execution_count)
SELECT 1
    , o.name
    , COALESCE(fs.execution_count, 0)
FROM sys.objects o 
    LEFT JOIN sys.dm_exec_function_stats fs ON fs.object_id = o.object_id
WHERE o.name = 't_func';
GO

Run a query against the TVF:
```
SELECT t.*
FROM dbo.t_tvf(1, 2) t;
GO
```

Capture the execution stats now:

INSERT INTO dbo.function_stats (run_num, object_name, execution_count)
SELECT 2
    , o.name
    , COALESCE(fs.execution_count, 0)
FROM sys.objects o 
    LEFT JOIN sys.dm_exec_function_stats fs ON fs.object_id = o.object_id
WHERE o.name = 't_func';

The function stats results:

SELECT *
FROM dbo.function_stats fs
ORDER BY fs.run_num
    , fs.object_name;

╔═════════╦═════════════╦═════════════════╗
║ run_num ║ object_name ║ execution_count ║
╠═════════╬═════════════╬═════════════════╣
║       1 ║ t_func      ║               0 ║
║       2 ║ t_func      ║               2 ║
╚═════════╩═════════════╩═════════════════╝

As you can see, the multi-statement-function has execute twice, once per row for the source table accessed by the TVF.

I expect the mutli-statement-function is being called many, many times by the TVF, giving the impression that it is running slowly, whereas in fact it is simply being called many times.

Best Answer

Related Solutions

Postgresql – In PostgreSQL, is there a type-safe first() aggregate function

DISTINCT ON()

Don't forget ORDER BY

Using an Ordered-Set Aggregate Function

Sql-server – Why would call to scalar function inside a Table Value Function be slower than outside the TVF

Related Question

`DISTINCT ON()`

Don't forget `ORDER BY`