Postgresql – Use array_position() function to get the most frequent value for column from pg_stats

arraydatatypesperformancepostgresqlpostgresql-performance

I'm trying to execute this simple query in order to check whether some value (1000) belongs to MCV list used by Postgres query optimizer:

SELECT array_position(most_common_vals, 1000) 
FROM pg_stats 
WHERE tablename = 'tenk1' 
AND attname = 'unique1';

But receive the following error message:

ERROR:  function array_position(anyarray, integer) does not exist

How can it be fixed?

array_position() is the standard function as described here and the following statement returns 2 as expected:

SELECT array_position('{1,2,3}', 2);

Best Answer

Solution

Assuming the data type of your column tenk1.unique1 is integer:

SELECT array_position(most_common_vals::text::int[], 1000) 
FROM   pg_stats 
WHERE  tablename = 'tenk1' 
AND    attname = 'unique1';

Use your actual column type and the corresponding array type.
You get the position, or NULL if the value is not in the MCV list.

The solution is short - unlike the ...

Explanation

The function array_position() is defined to take (anyarray, anyelement) (or (anyarray, anyelement, integer) for the second variant).

The column pg_stats.most_common_vals has the polymorphic data type anyarray to be able to hold arrays of any data type - for obvious reasons.

anyarray and anyelement are not allowed as data types for user-created tables. For users, both are polymorphic pseudo-types. (But Postgres can use them in system tables.)

Multiple polymorphic variables in the same function have to resolve to the same (or corresponding) data types. The manual:

Furthermore, if there are positions declared anyarray and others declared anyelement, the actual array type in the anyarray positions must be an array whose elements are the same type appearing in the anyelement positions.

And:

Thus, when more than one argument position is declared with a polymorphic type, the net effect is that only certain combinations of actual argument types are allowed.

Bold emphasis mine.

You have found a corner case where function type resolution fails for the combination of the polymorphic anyarray with an integer - or any non-polymorphic type in second position.

1000 in your expression array_position(most_common_vals, 1000) is a numeric constant resolving to integer. These would fail in similar fashion:

array_position(most_common_vals, '1000')  -- untyped string literal

ERROR:  function array_position(anyarray, unknown) does not exist

array_position(most_common_vals, '1000'::text)

ERROR:  function array_position(anyarray, text) does not exist

Furthermore, there are no casts defined for anyarray, being a pseudo-type in user-land:

SELECT * FROM pg_cast WHERE castsource = 'anyarray'::regtype;  -- nothing found

The workaround is to cast to text as stepping stone, since any type can be cast to text. Then cast to integer[], arriving at the solution above.

In closing, I think this is a shortcoming in function type resolution that could be resolved (easily?). But since the data type anyarray is not supposed to be used like this in user-land to begin with I doubt that any developer will spend time on it ...

Related Solutions

Postgresql – Concatenation of setof type or setof record

The approach you're using is unnecessarily complex - and very inefficient. Instead of the first function use:

create or replace function compute_pair_id_value(id bigint, value integer)
    returns setof pair_id_value
as $$
SELECT $1, generate_series(0,$2);
$$                          
language sql;

or better, get rid of it entirely and write the whole operation like this:

-- Sample data creation:
CREATE TABLE my_obj(id bigint, obj_value integer);
insert into my_obj(id,obj_value) VALUES (1712437,2),(17000,5);

-- and the query:
SELECT id, generate_series(0,obj_value) FROM my_obj;

Resulting in:

regress=> SELECT id, generate_series(0,obj_value) FROM my_obj;
   id    | generate_series 
---------+-----------------
 1712437 |               0
 1712437 |               1
 1712437 |               2
   17000 |               0
   17000 |               1
   17000 |               2
   17000 |               3
   17000 |               4
   17000 |               5
(9 rows)

This exploits PostgreSQL's behaviour with set-returning functions called in the SELECT list. Once PostgreSQL 9.3 comes out it can be replaced with a standards-compliant LATERAL query.

Since it turns out your question was a simplified version of the real problem, let's tackle that. I'll work with the simplified compute_pair_id_value above to avoid the hassle of plpython3. Here's how to do what you want:

SELECT (compute_pair_id_value(id,obj_value)).* FROM my_obj;

Result:

regress=> SELECT (compute_pair_id_value(id,obj_value)).* FROM my_obj;
   id    | value 
---------+-------
 1712437 |     0
 1712437 |     1
 1712437 |     2
   17000 |     0
   17000 |     1
   17000 |     2
   17000 |     3
   17000 |     4
   17000 |     5
(9 rows)

but again, be warned that compute_pair_id_value will be called more than once. This is a limitation of PostgreSQL's query executor that can be avoided in 9.3 with LATERAL support, but as far as I know you're stuck with it in 9.2 and below. Observe:

create or replace function compute_pair_id_value(id bigint, value integer)
    returns setof pair_id_value
as $$
BEGIN
  RAISE NOTICE 'compute_pair_id_value(%,%)',id,value;
  RETURN QUERY SELECT $1, generate_series(0,$2);
END;
$$             
language plpgsql;

output:

regress=> SELECT (compute_pair_id_value(id,obj_value)).* FROM my_obj;
NOTICE:  compute_pair_id_value(1712437,2)
NOTICE:  compute_pair_id_value(1712437,2)
NOTICE:  compute_pair_id_value(17000,5)
NOTICE:  compute_pair_id_value(17000,5)
   id    | value 
---------+-------
 1712437 |     0
 1712437 |     1
 1712437 |     2
   17000 |     0
   17000 |     1
   17000 |     2
   17000 |     3
   17000 |     4
   17000 |     5
(9 rows)

See how compute_pair_id_value is called once per output column?

There is a workaround: Another layer of subquery to unpack the composite type result. See:

regress=> SELECT (val).* FROM (SELECT compute_pair_id_value(id,obj_value) FROM my_obj) x(val);
NOTICE:  compute_pair_id_value(1712437,2)
NOTICE:  compute_pair_id_value(17000,5)
   id    | value 
---------+-------
 1712437 |     0
 1712437 |     1
 1712437 |     2
   17000 |     0
   17000 |     1
   17000 |     2
   17000 |     3
   17000 |     4
   17000 |     5
(9 rows)

You can use the same technique in your code if you really must LOOP over the results (it's slow to do that, so avoid it if you can).

Postgresql – Query the definition of a materialized view in Postgres

Looks like 9.3 and up you can do:

select * from pg_matviews;
select * from pg_matviews where matviewname = 'view_name';

More info found here: https://stackoverflow.com/questions/29297296/postgres-see-query-used-to-create-materialized-view