Postgresql – In PostgreSQL, is there a type-safe first() aggregate function

aggregatefunctionsgreatest-n-per-grouppostgresql

Full question re-write

I'm looking for a First() aggregate function.

Here I found something that almost works:

CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
        sfunc    = public.first_agg,
        basetype = anyelement,
        stype    = anyelement
);

The problem is that when a varchar(n) column passes through the first() function, it's converted into simple varchar (without a size). Trying to return the query in a function as RETURNS SETOF anyelement, I get the following error:

ERROR: structure of query does not match function result type Estado
de SQL:42804 Detalhe:Returned type character varying does not match
expected type character varying(40) in column 2. Contexto:PL/pgSQL
function vsr_table_at_time(anyelement,timestamp without time zone)
line 31 at RETURN QUERY

In the same wiki page there is a link to a C Version of the function that would replace the above. I don't know how to install it, but I wonder if this version could solve my problem.

Meanwhile, is there a way I can change the above function so it returns the exact same type of the input column?

Best Answer

`DISTINCT ON()`

Just as a side note, this is precisely what DISTINCT ON() does (not to be confused with DISTINCT)

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

So if you were to write,

SELECT myFirstAgg(z)
FROM foo
GROUP BY x,y;

It's effectively

SELECT DISTINCT ON(x,y) z
FROM foo;
-- ORDER BY z;

In that it takes the first z. There are two important differences,

You can also select other columns at no cost of further aggregation..

SELECT DISTINCT ON(x,y) z, k, r, t, v
FROM foo;
-- ORDER BY z, k, r, t, v;

Because there is no GROUP BY you can not use (real) aggregates with it.

CREATE TABLE foo AS
SELECT * FROM ( VALUES
  (1,2,3),
  (1,2,4),
  (1,2,5)
) AS t(x,y,z);

SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- fails, as you should expect.
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- would not otherwise fail.
SELECT myFirstAgg(z), sum(z)
FROM foo
GROUP BY x,y;

Don't forget `ORDER BY`

Also, while I didn't bold it then I will now

Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

Always use an ORDER BY with DISTINCT ON

Using an Ordered-Set Aggregate Function

I imagine a lot of people are looking for first_value, Ordered-Set Aggregate Functions. Just wanted to throw that out there. It would look like this, if the function existed:

SELECT a, b, first_value() WITHIN GROUP (ORDER BY z)    
FROM foo
GROUP BY a,b;

But, alas you can do this.

SELECT a, b, percentile_disc(0) WITHIN GROUP (ORDER BY z)   
FROM foo
GROUP BY a,b;

Related Solutions

Postgresql – Concatenation of setof type or setof record

The approach you're using is unnecessarily complex - and very inefficient. Instead of the first function use:

create or replace function compute_pair_id_value(id bigint, value integer)
    returns setof pair_id_value
as $$
SELECT $1, generate_series(0,$2);
$$                          
language sql;

or better, get rid of it entirely and write the whole operation like this:

-- Sample data creation:
CREATE TABLE my_obj(id bigint, obj_value integer);
insert into my_obj(id,obj_value) VALUES (1712437,2),(17000,5);

-- and the query:
SELECT id, generate_series(0,obj_value) FROM my_obj;

Resulting in:

regress=> SELECT id, generate_series(0,obj_value) FROM my_obj;
   id    | generate_series 
---------+-----------------
 1712437 |               0
 1712437 |               1
 1712437 |               2
   17000 |               0
   17000 |               1
   17000 |               2
   17000 |               3
   17000 |               4
   17000 |               5
(9 rows)

This exploits PostgreSQL's behaviour with set-returning functions called in the SELECT list. Once PostgreSQL 9.3 comes out it can be replaced with a standards-compliant LATERAL query.

Since it turns out your question was a simplified version of the real problem, let's tackle that. I'll work with the simplified compute_pair_id_value above to avoid the hassle of plpython3. Here's how to do what you want:

SELECT (compute_pair_id_value(id,obj_value)).* FROM my_obj;

Result:

regress=> SELECT (compute_pair_id_value(id,obj_value)).* FROM my_obj;
   id    | value 
---------+-------
 1712437 |     0
 1712437 |     1
 1712437 |     2
   17000 |     0
   17000 |     1
   17000 |     2
   17000 |     3
   17000 |     4
   17000 |     5
(9 rows)

but again, be warned that compute_pair_id_value will be called more than once. This is a limitation of PostgreSQL's query executor that can be avoided in 9.3 with LATERAL support, but as far as I know you're stuck with it in 9.2 and below. Observe:

create or replace function compute_pair_id_value(id bigint, value integer)
    returns setof pair_id_value
as $$
BEGIN
  RAISE NOTICE 'compute_pair_id_value(%,%)',id,value;
  RETURN QUERY SELECT $1, generate_series(0,$2);
END;
$$             
language plpgsql;

output:

regress=> SELECT (compute_pair_id_value(id,obj_value)).* FROM my_obj;
NOTICE:  compute_pair_id_value(1712437,2)
NOTICE:  compute_pair_id_value(1712437,2)
NOTICE:  compute_pair_id_value(17000,5)
NOTICE:  compute_pair_id_value(17000,5)
   id    | value 
---------+-------
 1712437 |     0
 1712437 |     1
 1712437 |     2
   17000 |     0
   17000 |     1
   17000 |     2
   17000 |     3
   17000 |     4
   17000 |     5
(9 rows)

See how compute_pair_id_value is called once per output column?

There is a workaround: Another layer of subquery to unpack the composite type result. See:

regress=> SELECT (val).* FROM (SELECT compute_pair_id_value(id,obj_value) FROM my_obj) x(val);
NOTICE:  compute_pair_id_value(1712437,2)
NOTICE:  compute_pair_id_value(17000,5)
   id    | value 
---------+-------
 1712437 |     0
 1712437 |     1
 1712437 |     2
   17000 |     0
   17000 |     1
   17000 |     2
   17000 |     3
   17000 |     4
   17000 |     5
(9 rows)

You can use the same technique in your code if you really must LOOP over the results (it's slow to do that, so avoid it if you can).

Postgresql – Error: set_valued function called in context that cannot accept a set. What is it about

The error message isn't very helpful:

regress=> SELECT * FROM  compute_all_pair_by_craig(100);
ERROR:  a column definition list is required for functions returning "record"
LINE 1: SELECT * FROM  compute_all_pair_by_craig(100);

but if you rephrase the query to call it as a proper set-returning function you'll see the real problem:

regress=> SELECT * FROM compute_all_pair_by_craig(100);
ERROR:  a column definition list is required for functions returning "record"
LINE 1: SELECT * FROM compute_all_pair_by_craig(100);

If you're using SETOF RECORD without an OUT parameter list you must specify the results in the calling statement, eg:

regress=> SELECT * FROM compute_all_pair_by_craig(100) theresult(a integer, b integer);

However, it's much better to use RETURNS TABLE or OUT parameters. With the former syntax your function would be:

create or replace function compute_all_pair_by_craig(id_obj bigint)
    returns table(a integer, b integer) as $$
begin
    return query select o.id, generate_series(0,o.value) from m_obj as o;     
end;
$$ language plpgsql;

This is callable in SELECT-list context and can be used without creating a type explicitly or specifying the result structure at the call site.

As for the second half of the question, what's happening is that the 1st case specifies two separate columns in a SELECT-list, wheras the second returns a single composite. It's actually not to do with how you're returning the result, but how you're invoking the function. If we create the sample function:

CREATE OR REPLACE FUNCTION twocols() RETURNS TABLE(a integer, b integer) 
AS $$ SELECT x, x FROM generate_series(1,5) x; $$ LANGUAGE sql;

You'll see the difference in the two ways to call a set-returning function - in the SELECT list, a PostgreSQL specific non-standard extension with quirky behaviour:

regress=> SELECT twocols();
 twocols 
---------
 (1,1)
 (2,2)
 (3,3)
 (4,4)
 (5,5)
(5 rows)

or as a table in the more standard way:

regress=> SELECT * FROM twocols();
 a | b 
---+---
 1 | 1
 2 | 2
 3 | 3
 4 | 4
 5 | 5
(5 rows)