How many unique values do you have for crawled_name
? How feasible it is that you'll hit an apartment_id
value greater than 2 billion (in the next 5-10 years)?
Note that COUNT(id)
is the same as COUNT(*)
(because id
is not nullable), and that COUNT(apartment_id) > 10000
is effectively COUNT(*) > 10000
along with WHERE apartment_id IS NOT NULL
. What you're doing not only complicates understanding the query, but may also cause PostgreSQL to select a sub-optimal strategy. By using COUNT(*)
instead of COUNT(id)
, PostgreSQL can compute the whole thing on an index that does not include id
, and without the need for reading the table at all, so it should process much faster.
Let's assume you currently have apartment_id
values in the low millions, and it is unlikely you'll hit the value 2 billion in the near future. You could speed it up by creating an index and adjusting your query as follows:
CREATE INDEX ON facility (crawled_name, (apartment_id)::int)
WHERE NULLIF(facility_characteristic, '') IS NULL
AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL;
SELECT crawled_name, COUNT(*)
FROM facility
WHERE NULLIF(facility_characteristic, '') IS NULL
AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL
GROUP BY crawled_name, apartment_id::int
HAVING COUNT(*) > 10000
LIMIT 10;
You could significantly speed it up further if you have a fixed list of crawled_name
values that range in the dozens (or low hundreds), by turning the type of the column into an enum. You'll have to drop indexes that use this column prior:
DO $do$ DECLARE _l text[]; BEGIN
_l := ARRAY(SELECT DISTINCT crawled_name FROM facility WHERE crawled_name > '' ORDER BY 1);
EXECUTE $$CREATE TYPE crawled_name_enum AS enum ('$$ ||
array_to_string(_l, $$','$$) || $$');$$;
ALTER TABLE facility ALTER COLUMN crawled_name
TYPE crawled_name_enum USING (crawled_name::crawled_name_enum);
END; $do$;
As always, the documentation is an excellent reference. In here, it shows WHERE condition
and condition is defined (further down the page) as
where condition is any expression that evaluates to a result of type boolean. Any row that does not satisfy this condition will be eliminated from the output. A row satisfies the condition if it returns true when the actual row values are substituted for any variable references.
HAVING
, as John pointed out in a comment, is for filtering on aggregates after they have been calculated, while WHERE
is for filtering the rows that are gathered at the start. For example:
SELECT
type, COUNT(type)
FROM table
GROUP BY type
HAVING COUNT(type) > 1
would return a set of rows where the COUNT(type) result ended up being more than 1.. The filtering done by HAVING
takes place after the DB has already grouped the entire table by TYPE and calculated all COUNT()
values. On the other hand
SELECT
type, COUNT(type)
FROM table
WHERE type = 'EXAMPLE'
GROUP BY type
would return one row (where type is EXAMPLE) and the count of how many rows contained that type. The filtering here takes place AS THE DB COLLECTS THE ROWS up front .. Any row that doesn't have type 'EXAMPLE' is thrown out. Only one COUNT() is calculated in this instance.
One more complex example to show that the two are evaluated separately and can be combined in any query.
SELECT
col_1, col_2, col_3, SUM(col_4), COUNT(col_4), AVG(col_4)
FROM table
WHERE col_2 IN ('A', 'B', 'C')
AND col_3 > 3
GROUP BY col_1, col_2, col_3
HAVING COUNT(col_4) > 2
AND AVG(col_4) > 10.5
Best Answer
No, the syntax you have is not valid, it can be corrected by the use of a
CASE
expression.(and I guess you have a
GROUP BY a, b
as you'd get an error otherwise).Note that the
ELSE NULL
is redundant and can be removed as that is the defaultELSE
behaviour anyway:There is a (SQL Standard)
FILTER
syntax that is close to your attempt but SQL Server has not yet implemented it: