From what I can see you understand the problem perfectly. It sounds like min(notif_dttm)
is causing you the grief. You can check this pretty easily by running code like this:
SELECT *
FROM inc_unit AS notif
WHERE notif_dttm IS NULL
AND EXISTS (SELECT 1 FROM inc_unit u
WHERE notif.inci_no = u.inci_no
AND resp_code = 'E'
AND arv_dttm IS NOT NULL)
If you get any rows then that's probably your problem.
Here is some easy sample code that will reproduce the warning you are getting.
CREATE TABLE #test (nullableCol1 int)
INSERT INTO #test VALUES (NULL),(1),(2)
SELECT MIN(nullableCol1) FROM #test
As far as your problem goes there is a property in SSIS that tells it how many errors are acceptable
. It's called MaximumErrorCount. Increase that for that object and each of it's containers. That way you can have the warning (which isn't really a big deal) and still have your code move on correctly.
Edit Additional order by information you wanted can be found here.
But basically SQL processes a query in this order:
- FROM
- ON
- OUTER
- WHERE
- GROUP BY
- CUBE | ROLLUP
- HAVING
- SELECT
- DISTINCT
- ORDER BY
- TOP
So if I take your subquery and place it in the order it's processed it would look like this
FROM inc_unit AS notif -- Find the table
WHERE notif.inci_no = u.inci_no -- Restrict the rows used
--AND notif.arv_dttm IS NOT NULL --
GROUP BY notif.inci_no -- Group up what's left
SELECT min(notif_dttm) -- Process any aggregates
You can see it's a pretty logical progression. We can't GROUP BY
until we remove the appropriate rows using the WHERE
clause. And then we can't check the MIN
until we have finished grouping everything together.
I dont' think indexes will help. I am not sure why, but I am just being experimental...
First, we create a table with your structure, and fill it with random data (10.000 rows, and a vector of 1.000 columns).
CREATE TABLE t
(
id integer /* PRIMARY KEY */,
stuff integer,
large_array_of_values float[]
) ;
CREATE OR REPLACE FUNCTION random_vector() RETURNS float[] AS
$$
select
array_agg(random())
from
generate_series (1, 1000)
$$
LANGUAGE SQL ;
INSERT INTO t
(id, stuff, large_array_of_values)
SELECT
id, random()*10000, random_vector()
FROM
generate_series(1, 10000) AS i(id) ;
At this point we create one sample index for values at index [32] of the vector (plus the id!):
CREATE INDEX
idx_32 ON t(id, (large_array_of_values[32]));
Now, we ask PostgreSQL to analyze the following query and explain it:
EXPLAIN ANALYZE
SELECT
avg(large_array_of_values[32])
FROM
t
WHERE
id BETWEEN 5000 and 7500
AND (large_array_of_values[32]) > 0.32 ;
| QUERY PLAN |
| :---------------------------------------------------------------------------------------------------------------------- |
| Aggregate (cost=46.94..46.95 rows=1 width=8) (actual time=54.871..54.871 rows=1 loops=1) |
| -> Bitmap Heap Scan on t (cost=4.91..46.89 rows=17 width=32) (actual time=0.392..1.204 rows=1732 loops=1) |
| Recheck Cond: ((id >= 5000) AND (id <= 7500) AND (large_array_of_values[32] > '0.32'::double precision)) |
| Heap Blocks: exact=20 |
| -> Bitmap Index Scan on idx_32 (cost=0.00..4.91 rows=17 width=0) (actual time=0.364..0.364 rows=1732 loops=1) |
| Index Cond: ((id >= 5000) AND (id <= 7500) AND (large_array_of_values[32] > '0.32'::double precision)) |
| Planning time: 0.405 ms |
| Execution time: 55.013 ms |
dbfiddle here
The query plan is not using the index (idx_32
) to perform an index-only scan, which I guess is what you wanted, even if idx_32
is a covering index for such a query.
VACUUMing, to make sure that the visibility map was up-to-date didn't have any effect. I couldn't find any explicit restrictions on index-only scans having to refer to columns (and not Indexes on Expressions), but it appears that they aren't used in this second case.
Comparison with a non-vector column
CREATE TABLE t
(
id integer /* PRIMARY KEY */,
stuff integer,
a_value float
) ;
INSERT INTO t
(id, stuff, a_value)
SELECT
id, random()*10000, random()
FROM
generate_series(1, 10000) AS i(id) ;
CREATE INDEX idx_value ON t(id, a_value);
VACUUM ANALYZE VERBOSE t ;
In this case, the covering index is actually used, and you get an "index-only scan".
EXPLAIN ANALYZE
SELECT
avg(a_value)
FROM
t
WHERE
id BETWEEN 5000 and 7500 AND (a_value > 0.1)
| QUERY PLAN |
| :---------------------------------------------------------------------------------------------------------------------------- |
| Aggregate (cost=103.67..103.69 rows=1 width=8) (actual time=1.139..1.140 rows=1 loops=1) |
| -> Index Only Scan using idx_value on t (cost=0.29..98.05 rows=2251 width=8) (actual time=0.026..0.655 rows=2247 loops=1) |
| Index Cond: ((id >= 5000) AND (id <= 7500) AND (a_value > '0.1'::double precision)) |
| Heap Fetches: 0 |
| Planning time: 0.184 ms |
| Execution time: 1.179 ms |
dbfiddle here
Best Answer
I can't speak from any experience but SQL Server, but I'll assume that this applies to most/all SQL language implementations:
COUNT(*) is counting rows. If there are zero rows to count, zero is the response.
Most other aggregate functions depend upon values contained in fields (MIN, MAX, SUM, etc). If there are no values to aggregate (either because no records are in the dataset or only NULL values in the field(s) involved), NULL is the response.