PostgreSQL – Fastest Query for Selecting Arrays with Duplicates

arraypostgresql

I'm using Postgres 9.5 and I have a column, named phones, of type text[].
I need to find all rows where this column contains duplicates.

I've found this extremely useful set of functions https://github.com/JDBurnZ/postgresql-anyarray and I could use this particular function https://github.com/JDBurnZ/postgresql-anyarray/blob/master/stable/anyarray_uniq.sql

select * from user_info 
where anyarray_uniq(phones) <> phones

I was wondering though if there is a faster way of achieving what I want.
Maybe unnesting the array and using the window functionality would be better? Although I can find my way around SQL, I'm new to Postgres' specific best practices, so any help is welcome.

Is this better suited for CodeReview?

Best Answer

1) Function anyarray_uniq can be simplified in several ways to make it faster (note that in the function's body the input parameter can be accessed not only by the name but also by the number: $<n>):

create or replace function array_deldup1(anyarray) returns anyarray as $body$
declare
  result $1%type = '{}';
  i int;
begin
  for i in array_lower($1, 1)..array_upper($1, 1) loop
    if array_position(result, $1[i]) is null then -- function was introduced in 9.5 version 
      result := result || $1[i];
    end if;
  end loop;
  return result;
end $body$ language plpgsql immutable;

or yet simpler using pure SQL:

create or replace function array_deldup2(anyarray) returns anyarray as $body$
  select array_agg(x order by n) 
  from (
    select distinct on (x) x, n 
    from unnest($1) with ordinality as t(x,n) order by x, n) as t(x,n);
$body$ language sql immutable;

Second one is slower then first but still faster then the original on my tests.

Those functions doing exactly the same thing as anyarray_uniq (removes duplicates and keeps the order of the elements), but for your purpose the order is irrelevant, so the simplest way (using function) is

create or replace function array_deldup3(anyarray) returns anyarray as $body$
  select array_agg(distinct x) from unnest($1) t(x);
  -- Or yet another syntax doing the same thing:
  -- select array(select distinct unnest($1));
$body$ language sql immutable;

and now because the elements order changed you should to compare the arrays length instead of its content:

select * from user_info
where array_length(array_deldup3(phones), 1) <> array_length(phones, 1)

2) To achieve your goal you are doing ambiguous work by calling the function (it is also slowing down the query performance), calculating the result as array without duplicates and finally comparing two arrays. The actual goal is to compare the whole array length against the count of the distinct values:

select * from user_info 
where (select count(x) <> count(distinct x) from unnest(phones) as t(x))

Upd:
3) When you fix your data using one of the functions above

update user_info set phones = array_deldup<n>(phones);

you can avoid those situation by creating constraint on the field:

create or replace function array_havedup(anyarray) returns boolean as $body$
  select count(x) <> count(distinct x) from unnest($1) as t(x);
$body$ language sql immutable;

alter table user_info add constraint chk_user_info_phone check (not array_havedup(phones));

Actually you can use this function in the question's query:

select * from user_info where array_havedup(phones); -- Simple, isn't it?

4) Try to follow to the common database designing rules called "database normalization". The example you provided is exactly about the First and Second normal forms.

Let's imagine that you need the phone's additional info like "home/work/mobile", "internal code", "availability time" and so on. Using your current design it can be problematic.

Related Solutions

Postgresql – “Lambda” function returns type string instead of type record

As lambda_record function returns RECORD and doesn't have OUT parameter, you must define the columns and types it will return when calling using an alias:

SELECT r.out1, r.out2
FROM lambda_record(...) AS r(out1 numeric, out2 numeric);

In this case r is the table alias, and out1/out2 the alias for the columns with the types defined (defining types is only necessary when it returns RECORD and doesn't have OUT parameters).

EDIT: even though it uses a RECORD, lambda_record still calls the function as it had only one column returned, using SELECT func_name(params...) method:

sql := format(
    'select %s(%s)',
    foid::oid::regproc,
    array_to_string(call_args, ', ')
);
raise debug 'sql=%', sql;

You could change that line to call it use SELECT * FROM func_name(params...):

sql := format(
    'select * from %s(%s)',
    foid::oid::regproc,
    array_to_string(call_args, ', ')
);
raise debug 'sql=%', sql;

That way you should already have it return the columns correctly (even for functions with a single column), and now you can call as I showed in the beginning.

PostgreSQL IP4R Extension – Optimizing Query

For the Nested Loop, there is nothing wrong with that.. You have less than five rows you're returning and you do it in one loop.
I'm not sure what you mean as batch. You'll have to show your other queries. Why are you running a UNION ALL a bunch of times to begin with?

Try something like this for your batch,

SELECT
  test_ipv6.ip,
  test_ipv6.data,
  test_cidr_ipv6.cidr,
  test_cidr_ipv6.data,
  test_ipv6.start,
  test_ipv6.end,
  test_cidr_ipv6.start,
  test_cidr_ipv6.end
FROM test_ipv6
INNER JOIN test_cidr_ipv6
  ON (test_ipv6.ip <<= cidr )
WHERE
  test_ipv6.ip = any(ARRAY[ip6('2001:DB8::1'),ip6('2001:DB8::1')]::ip6[])
  AND '2017-01-01 00:00:00.100000' BETWEEN test_ipv6.start AND test_ipv6.end
  AND '2017-01-01 00:00:00.100000' BETWEEN test_cidr_ipv6.start AND test_cidr_ipv6.end;

You may have better luck with a CTE or inlining something like this...

WITH t AS (
  SELECT ipv6(iptext) AS ip
  FROM (VALUES
   ('fddf:c4c1:2573::/48') ,
   ('fd8d:d482:3c08::/48') 
  ) AS v(iptext) 
)
SELECT
  test_ipv6.ip,
  test_ipv6.data,
  test_cidr_ipv6.cidr,
  test_cidr_ipv6.data,
  test_ipv6.start,
  test_ipv6.end,
  test_cidr_ipv6.start,
  test_cidr_ipv6.end
FROM test_ipv6
INNER JOIN test_cidr_ipv6
  ON (test_ipv6.ip <<= cidr)
INNER JOIN t
  USING (ip)
WHERE
  '2017-01-01 00:00:00.100000' BETWEEN test_ipv6.start AND test_ipv6.end
  AND '2017-01-01 00:00:00.100000' BETWEEN test_cidr_ipv6.start AND test_cidr_ipv6.end;

Best Answer

Related Solutions

Postgresql – “Lambda” function returns type string instead of type record

PostgreSQL IP4R Extension – Optimizing Query

Related Question