PostgreSQL – Fastest Query for Selecting Arrays with Duplicates

arraypostgresql

I'm using Postgres 9.5 and I have a column, named phones, of type text[].
I need to find all rows where this column contains duplicates.

I've found this extremely useful set of functions https://github.com/JDBurnZ/postgresql-anyarray and I could use this particular function https://github.com/JDBurnZ/postgresql-anyarray/blob/master/stable/anyarray_uniq.sql

select * from user_info 
where anyarray_uniq(phones) <> phones

I was wondering though if there is a faster way of achieving what I want.
Maybe unnesting the array and using the window functionality would be better? Although I can find my way around SQL, I'm new to Postgres' specific best practices, so any help is welcome.

Is this better suited for CodeReview?

Best Answer

1) Function anyarray_uniq can be simplified in several ways to make it faster (note that in the function's body the input parameter can be accessed not only by the name but also by the number: $<n>):

create or replace function array_deldup1(anyarray) returns anyarray as $body$
declare
  result $1%type = '{}';
  i int;
begin
  for i in array_lower($1, 1)..array_upper($1, 1) loop
    if array_position(result, $1[i]) is null then -- function was introduced in 9.5 version 
      result := result || $1[i];
    end if;
  end loop;
  return result;
end $body$ language plpgsql immutable;

or yet simpler using pure SQL:

create or replace function array_deldup2(anyarray) returns anyarray as $body$
  select array_agg(x order by n) 
  from (
    select distinct on (x) x, n 
    from unnest($1) with ordinality as t(x,n) order by x, n) as t(x,n);
$body$ language sql immutable;

Second one is slower then first but still faster then the original on my tests.

Those functions doing exactly the same thing as anyarray_uniq (removes duplicates and keeps the order of the elements), but for your purpose the order is irrelevant, so the simplest way (using function) is

create or replace function array_deldup3(anyarray) returns anyarray as $body$
  select array_agg(distinct x) from unnest($1) t(x);
  -- Or yet another syntax doing the same thing:
  -- select array(select distinct unnest($1));
$body$ language sql immutable;

and now because the elements order changed you should to compare the arrays length instead of its content:

select * from user_info
where array_length(array_deldup3(phones), 1) <> array_length(phones, 1)

2) To achieve your goal you are doing ambiguous work by calling the function (it is also slowing down the query performance), calculating the result as array without duplicates and finally comparing two arrays. The actual goal is to compare the whole array length against the count of the distinct values:

select * from user_info 
where (select count(x) <> count(distinct x) from unnest(phones) as t(x))

Upd:
3) When you fix your data using one of the functions above

update user_info set phones = array_deldup<n>(phones);

you can avoid those situation by creating constraint on the field:

create or replace function array_havedup(anyarray) returns boolean as $body$
  select count(x) <> count(distinct x) from unnest($1) as t(x);
$body$ language sql immutable;

alter table user_info add constraint chk_user_info_phone check (not array_havedup(phones));

Actually you can use this function in the question's query:

select * from user_info where array_havedup(phones); -- Simple, isn't it?

4) Try to follow to the common database designing rules called "database normalization". The example you provided is exactly about the First and Second normal forms.

Let's imagine that you need the phone's additional info like "home/work/mobile", "internal code", "availability time" and so on. Using your current design it can be problematic.