distinct
is NOT a function. It always operates on all columns of the result.
The difference between select distinct (foo), bar
and select distinct foo, bar
is the same as the difference between select (foo), bar
and select foo, bar
. The parenthesis is just "noise".
When you write select (foo,bar)
you are actually creating an anonymous record type in Postgres which results in a single column that has two attributes - which is not what you actually want.
As you are using Postgres, you can use the (proprietary) operator DISTINCT ON
which - in contrast to the standard DISTINCT
- does operate on a sub-set of the columns.
You have to specify an ORDER BY in that case to define which of the rows to take if there is more than one with the same combination of (field1, field2)
.
CREATE TABLE new_name
AS
SELECT DISTINCT ON (table.field1, table.field2),
table.field1,
table.field2,
.....
FROM ...
WHERE ...
ORDER BY ..
If you want to stick to ANSI SQL you will need a window function for this:
create table new_name
as
select column1, column2, column3, column4, column5, column6
from (
select column1, column2, column3, column4, column5, column6,
row_number() over (partition by column1, column2 order by ...) as rn
from the_table
where ...
) t
where rn = 1;
For a large table, DISTINCT ON
is probably faster than the solution with the window function.
How many unique values do you have for crawled_name
? How feasible it is that you'll hit an apartment_id
value greater than 2 billion (in the next 5-10 years)?
Note that COUNT(id)
is the same as COUNT(*)
(because id
is not nullable), and that COUNT(apartment_id) > 10000
is effectively COUNT(*) > 10000
along with WHERE apartment_id IS NOT NULL
. What you're doing not only complicates understanding the query, but may also cause PostgreSQL to select a sub-optimal strategy. By using COUNT(*)
instead of COUNT(id)
, PostgreSQL can compute the whole thing on an index that does not include id
, and without the need for reading the table at all, so it should process much faster.
Let's assume you currently have apartment_id
values in the low millions, and it is unlikely you'll hit the value 2 billion in the near future. You could speed it up by creating an index and adjusting your query as follows:
CREATE INDEX ON facility (crawled_name, (apartment_id)::int)
WHERE NULLIF(facility_characteristic, '') IS NULL
AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL;
SELECT crawled_name, COUNT(*)
FROM facility
WHERE NULLIF(facility_characteristic, '') IS NULL
AND apartment_id IS NOT NULL AND crawled_name IS NOT NULL
GROUP BY crawled_name, apartment_id::int
HAVING COUNT(*) > 10000
LIMIT 10;
You could significantly speed it up further if you have a fixed list of crawled_name
values that range in the dozens (or low hundreds), by turning the type of the column into an enum. You'll have to drop indexes that use this column prior:
DO $do$ DECLARE _l text[]; BEGIN
_l := ARRAY(SELECT DISTINCT crawled_name FROM facility WHERE crawled_name > '' ORDER BY 1);
EXECUTE $$CREATE TYPE crawled_name_enum AS enum ('$$ ||
array_to_string(_l, $$','$$) || $$');$$;
ALTER TABLE facility ALTER COLUMN crawled_name
TYPE crawled_name_enum USING (crawled_name::crawled_name_enum);
END; $do$;
Best Answer
you can use something like...
...as a subquery (substitute name of your id field for "id") and wrap it up like this...
Hard to tell exactly what you're looking for, but that will get you 24 unordered results from the btmt table with fsid value of greater than 1, although it might not be the prettiest way to do so, computationally speaking.
Please clarify if that's not what you're after--hope it helps if so.
EDIT: if you're looking for fsid values 1-25 it would be more like...