Postgresql – Inconsistent behavior of aggregate functions about whether to return NULL on empty input

aggregatepostgresql

On an empty input table, PostgreSQL (12.x) is returning NULL on some aggregate functions like array_agg() but normal values for others like count(), as shown below:

=> SELECT count(*) IS NULL FROM (SELECT 1 a) e WHERE a > 1;
 ?column? 
----------
 f
(1 row)

=> SELECT array_agg(a) IS NULL FROM (SELECT 1 a) t WHERE a > 1;
 ?column? 
----------
 t
(1 row)

Why does not array_agg() return a normal value (i.e., an empty array) in case of empty input?

This is somewhat disturbing as it's difficult to remember which ones can return NULL.

Related to this, what are the aggregates that return NULL on empties?

Best Answer

I can't speak from any experience but SQL Server, but I'll assume that this applies to most/all SQL language implementations:

COUNT(*) is counting rows. If there are zero rows to count, zero is the response.

Most other aggregate functions depend upon values contained in fields (MIN, MAX, SUM, etc). If there are no values to aggregate (either because no records are in the dataset or only NULL values in the field(s) involved), NULL is the response.

Related Solutions

SQL Server – Why Am I Getting ANSI Warning About NULL Values and Aggregate Functions?

From what I can see you understand the problem perfectly. It sounds like min(notif_dttm) is causing you the grief. You can check this pretty easily by running code like this:

SELECT *
FROM inc_unit AS notif
WHERE notif_dttm IS NULL
  AND EXISTS (SELECT 1 FROM inc_unit u
                WHERE notif.inci_no = u.inci_no
                  AND resp_code = 'E'
                  AND arv_dttm IS NOT NULL)

If you get any rows then that's probably your problem.

Here is some easy sample code that will reproduce the warning you are getting.

CREATE TABLE #test (nullableCol1 int)
INSERT INTO #test VALUES (NULL),(1),(2)
SELECT MIN(nullableCol1) FROM #test

As far as your problem goes there is a property in SSIS that tells it how many errors are acceptable. It's called MaximumErrorCount. Increase that for that object and each of it's containers. That way you can have the warning (which isn't really a big deal) and still have your code move on correctly.

Edit Additional order by information you wanted can be found here.

But basically SQL processes a query in this order:

FROM
ON
OUTER
WHERE
GROUP BY
CUBE | ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP

So if I take your subquery and place it in the order it's processed it would look like this

        FROM inc_unit AS notif                  -- Find the table
        WHERE notif.inci_no = u.inci_no         -- Restrict the rows used
            --AND notif.arv_dttm IS NOT NULL    -- 
        GROUP BY notif.inci_no                  -- Group up what's left
        SELECT min(notif_dttm)                  -- Process any aggregates

You can see it's a pretty logical progression. We can't GROUP BY until we remove the appropriate rows using the WHERE clause. And then we can't check the MIN until we have finished grouping everything together.

PostgreSQL/Psycopg2 – Indexing Large Arrays

I dont' think indexes will help. I am not sure why, but I am just being experimental...

First, we create a table with your structure, and fill it with random data (10.000 rows, and a vector of 1.000 columns).

 CREATE TABLE t
 (
     id integer /* PRIMARY KEY */, 
     stuff integer, 
     large_array_of_values float[]
 ) ;

 CREATE OR REPLACE FUNCTION random_vector() RETURNS float[] AS
 $$
     select 
        array_agg(random())
     from 
        generate_series (1, 1000)
 $$
 LANGUAGE SQL ;

 INSERT INTO t
    (id, stuff, large_array_of_values)
 SELECT
    id, random()*10000,  random_vector()
 FROM
        generate_series(1, 10000) AS i(id) ;

At this point we create one sample index for values at index [32] of the vector (plus the id!):

 CREATE INDEX 
    idx_32 ON t(id, (large_array_of_values[32]));

Now, we ask PostgreSQL to analyze the following query and explain it:

 EXPLAIN ANALYZE
 SELECT
    avg(large_array_of_values[32])
 FROM
    t
 WHERE
    id BETWEEN 5000 and 7500 
     AND (large_array_of_values[32]) > 0.32 ;

 | QUERY PLAN                                                                                                              |
 | :---------------------------------------------------------------------------------------------------------------------- |
 | Aggregate  (cost=46.94..46.95 rows=1 width=8) (actual time=54.871..54.871 rows=1 loops=1)                               |
 |   ->  Bitmap Heap Scan on t  (cost=4.91..46.89 rows=17 width=32) (actual time=0.392..1.204 rows=1732 loops=1)           |
 |         Recheck Cond: ((id >= 5000) AND (id <= 7500) AND (large_array_of_values[32] > '0.32'::double precision))        |
 |         Heap Blocks: exact=20                                                                                           |
 |         ->  Bitmap Index Scan on idx_32  (cost=0.00..4.91 rows=17 width=0) (actual time=0.364..0.364 rows=1732 loops=1) |
 |               Index Cond: ((id >= 5000) AND (id <= 7500) AND (large_array_of_values[32] > '0.32'::double precision))    |
 | Planning time: 0.405 ms                                                                                                 |
 | Execution time: 55.013 ms                                                                                               |

dbfiddle here

The query plan is not using the index (idx_32) to perform an index-only scan, which I guess is what you wanted, even if idx_32 is a covering index for such a query.

VACUUMing, to make sure that the visibility map was up-to-date didn't have any effect. I couldn't find any explicit restrictions on index-only scans having to refer to columns (and not Indexes on Expressions), but it appears that they aren't used in this second case.

Comparison with a non-vector column

 CREATE TABLE t
 (
     id integer /* PRIMARY KEY */, 
     stuff integer, 
     a_value float
 ) ;

 INSERT INTO t
    (id, stuff, a_value)
 SELECT
    id, random()*10000,  random()
 FROM
    generate_series(1, 10000) AS i(id) ;

 CREATE INDEX idx_value ON t(id, a_value);

 VACUUM ANALYZE VERBOSE t ;

In this case, the covering index is actually used, and you get an "index-only scan".

 EXPLAIN ANALYZE
 SELECT
    avg(a_value)
 FROM
    t
 WHERE
    id BETWEEN 5000 and 7500 AND (a_value > 0.1)

 | QUERY PLAN                                                                                                                    |
 | :---------------------------------------------------------------------------------------------------------------------------- |
 | Aggregate  (cost=103.67..103.69 rows=1 width=8) (actual time=1.139..1.140 rows=1 loops=1)                                     |
 |   ->  Index Only Scan using idx_value on t  (cost=0.29..98.05 rows=2251 width=8) (actual time=0.026..0.655 rows=2247 loops=1) |
 |         Index Cond: ((id >= 5000) AND (id <= 7500) AND (a_value > '0.1'::double precision))                                   |
 |         Heap Fetches: 0                                                                                                       |
 | Planning time: 0.184 ms                                                                                                       |
 | Execution time: 1.179 ms                                                                                                      |

dbfiddle here

Best Answer

Related Solutions

SQL Server – Why Am I Getting ANSI Warning About NULL Values and Aggregate Functions?

PostgreSQL/Psycopg2 – Indexing Large Arrays

Related Question