Sql-server – For absolute performance, is SUM faster or COUNT

MySQLoraclepostgresqlsql serversqlite

This relates to counting the number of records that match a certain condition, e.g. invoice amount > $100.

I tend to prefer

COUNT(CASE WHEN invoice_amount > 100 THEN 1 END)

However, this is just as valid

SUM(CASE WHEN invoice_amount > 100 THEN 1 ELSE 0 END)

I would have thought COUNT is preferable for 2 reasons:

Conveys the intention, which is to COUNT
COUNT probably involves
a simple i += 1 operation somewhere, whereas SUM cannot count on
its expression to be a simple integer value.

Does anyone have specific facts about the difference on specific RDBMS?

Best Answer

You mostly answered the question yourself already. I have a few morsels to add:

In PostgreSQL (and other RDBMS that support the boolean type) you can use the boolean result of the test directly. Cast it to integer and SUM():

SUM((amount > 100)::int))

Or use it in a NULLIF() expression and COUNT():

COUNT(NULLIF(amount > 100, FALSE))

Or with a simple OR NULL:

COUNT(amount > 100 OR NULL)

Or various other expressions. Performance is almost identical. COUNT() is typically very slightly faster than SUM(). Unlike SUM() and like Paul already commented, COUNT() never returns NULL, which may be convenient. Related:

Query optimization or missing indexes?

Since Postgres 9.4 there's also the aggregate FILTER clause. See:

Return counts for multiple ranges in a single SELECT statement

It's faster than all of the above by around 5 - 10 %:

COUNT(*) FILTER (WHERE amount > 100)

If the query is as simple as your test case, with only a single count and nothing else, you can rewrite:

SELECT count(*) FROM tbl WHERE amount > 100;

... which is the true king of performance, even without index.
With an applicable index it can be faster by orders of magnitude, especially with index-only scans.

Benchmarks

Postgres 13

db<>fiddle here

Basically the same results as for Postgres 10 below. (I added a test without the new parallelism.)

Postgres 10

I ran a new series of tests for Postgres 10, including the aggregate FILTER clause and demonstrating the role of an index for small and big counts.

Simple setup:

CREATE TABLE tbl (
   tbl_id int
 , amount int NOT NULL
);

INSERT INTO tbl
SELECT g, (random() * 150)::int
FROM   generate_series (1, 1000000) g;

-- only relevant for the last test
CREATE INDEX ON tbl (amount);

Actual times vary quite a bit due to background noise and specifics of the test bed. Showing typical best times from a bigger set of tests. These two cases should capture the essence:

Test 1 counting ~ 1 % of all rows

SELECT COUNT(NULLIF(amount > 148, FALSE))            FROM tbl; -- 140 ms
SELECT SUM((amount > 148)::int)                      FROM tbl; -- 136 ms
SELECT SUM(CASE WHEN amount > 148 THEN 1 ELSE 0 END) FROM tbl; -- 133 ms
SELECT COUNT(CASE WHEN amount > 148 THEN 1 END)      FROM tbl; -- 130 ms
SELECT COUNT((amount > 148) OR NULL)                 FROM tbl; -- 130 ms
SELECT COUNT(*) FILTER (WHERE amount > 148)          FROM tbl; -- 118 ms -- !

SELECT count(*) FROM tbl WHERE amount > 148; -- without index  --  75 ms -- !!
SELECT count(*) FROM tbl WHERE amount > 148; -- with index     --   1.4 ms -- !!!

db<>fiddle here

Test 2 counting ~ 33 % of all rows

SELECT COUNT(NULLIF(amount > 100, FALSE))            FROM tbl; -- 140 ms
SELECT SUM((amount > 100)::int)                      FROM tbl; -- 138 ms
SELECT SUM(CASE WHEN amount > 100 THEN 1 ELSE 0 END) FROM tbl; -- 139 ms
SELECT COUNT(CASE WHEN amount > 100 THEN 1 END)      FROM tbl; -- 138 ms
SELECT COUNT(amount > 100 OR NULL)                   FROM tbl; -- 137 ms
SELECT COUNT(*) FILTER (WHERE amount > 100)          FROM tbl; -- 132 ms -- !

SELECT count(*) FROM tbl WHERE amount > 100; -- without index  -- 102 ms -- !!
SELECT count(*) FROM tbl WHERE amount > 100; -- with index     --  55 ms -- !!!

db<>fiddle here

The last test in each set used an index-only scan, which is why it helped for counting one third of all rows. Plain index or bitmap index scans cannot compete with a sequential scan when involving roughly 5 % or more of all rows.

Old test for Postgres 9.1

To verify I ran a quick test with EXPLAIN ANALYZE on a real life table in PostgreSQL 9.1.6.

74208 of 184568 rows qualified with the condition kat_id > 50. All queries return the same result. I ran each like 10 times in turns to exclude caching effects and appended the best result as note:

SELECT SUM((kat_id > 50)::int)                      FROM log_kat; -- 438 ms
SELECT COUNT(NULLIF(kat_id > 50, FALSE))            FROM log_kat; -- 437 ms
SELECT COUNT(CASE WHEN kat_id > 50 THEN 1 END)      FROM log_kat; -- 437 ms
SELECT COUNT((kat_id > 50) OR NULL)                 FROM log_kat; -- 436 ms
SELECT SUM(CASE WHEN kat_id > 50 THEN 1 ELSE 0 END) FROM log_kat; -- 432 ms

Hardly any real difference in performance.

The query you have

You could simplify your query using a WINDOW clause, but that's just shortening the syntax, not changing the query plan.

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
             ORDER BY trans_date
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

Also using the slightly faster count(*), since id is certainly defined NOT NULL?
And you don't need to ORDER BY entity_id since you already PARTITION BY entity_id

You can simplify further, though:
Don't add ORDER BY to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);

Simpler, faster, but still just a better version of what you have, with static months.

The query you might want

... is not clearly defined, so I'll build on these assumptions:

Count transactions and amount for every 30-day period within the first and last transaction of any entity_id. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.

SELECT entity_id, trans_date
     , COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
     , COALESCE(sum(daily_count)  OVER w, 0) AS trans_count
FROM  (
   SELECT entity_id
        , generate_series (min(trans_date)::timestamp
                         , GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
                         , interval '1 day')::date AS trans_date
   FROM   transactiondb 
   GROUP  BY 1
   ) x
LEFT JOIN (
   SELECT entity_id, trans_date
        , sum(amount) AS daily_amount, count(*) AS daily_count
   FROM   transactiondb
   GROUP  BY 1, 2
   ) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
             ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);

This lists all 30-day periods for each entity_id with your aggregates and with trans_date being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...

The basic difficulty is the same as discussed here:

Referencing current row in FILTER clause of window function

The frame definition of a window cannot depend on values of the current row.

And rather call generate_series() with timestamp input:

Generating time series between two dates in PostgreSQL

The query you actually want

After question update and discussion:
Accumulate rows of the same entity_id in a 30-day window starting at each actual transaction.

Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL joins, yet:

SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
     , sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM   transactiondb t0
JOIN   transactiondb t1 USING (entity_id)
WHERE  t1.trans_date >= t0.trans_date
AND    t1.trans_date <  t0.trans_date + 30  -- exclude upper bound
-- AND    t0.entity_id = 114284  -- or pick a single entity ...
GROUP  BY t0.id  -- is PK!
ORDER  BY t0.trans_date, t0.id

SQL Fiddle.

A rolling window could only make sense (with respect to performance) with data for most days.

This does not aggregate duplicates on (trans_date, entity_id) per day, but all rows of the same day are always included in the 30-day window.

For a big table, a covering index like this could help quite a bit:

CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);

The last column amount is only useful if you get index-only scans out of it. Else drop it.

But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.

Sql-server – Why are multiple COUNT faster than one SUM with CASE

The COUNT(*) version is able to simply seek into the index you have on the status column once for each status you are selecting, whereas the SUM(...) version needs to seek the index twelve times (the total number of unique status types).

Clearly seeking an index three times is going to be faster than seeking it 12 times.

The first plan requires a memory grant of 238MB, whereas the second plan requires a memory grant of 650MB. It may be that the larger memory grant could not be immediately filled, making the query that much slower.

Alter the second query to be:

SELECT  Approved = SUM(CASE WHEN Status = 'Approved' THEN 1 ELSE 0 END),
        Valid    = SUM(CASE WHEN Status = 'Valid'    THEN 1 ELSE 0 END),
        Reject   = SUM(CASE WHEN Status = 'Reject'   THEN 1 ELSE 0 END)
FROM dbo.Claims c
WHERE c.Status = 'Approved'
    OR c.Status = 'Valid'
    OR c.Status = 'Reject';

This will allow the query optimizer to eliminate 75% of the index seeks, and should result in both a lower required memory grant, lower I/O requirements, and faster time-to-result.

The SUM(CASE WHEN ...) construct essentially prevents the query optimizer from pushing the Status predicates down into the index seek portion of the plan.

Best Answer

Benchmarks

Postgres 13

Postgres 10

Old test for Postgres 9.1

Related Solutions

Postgresql – Rolling sum / count / average over date interval

The query you have

The query you might want

The query you actually want

Sql-server – Why are multiple COUNT faster than one SUM with CASE

Related Question