Postgresql – Grouping by overlapping arrays, transitively, without duplicates

aggregatearraypostgresqlrecursive

I've found:

However, I'm having trouble putting it to use in my case.

I have a table like so (the real myid values are hashes, but simplified here for illustration):

create temp table a (myid text, ip inet);
insert into a (myid, ip)
values
  ('0a', '10.10.1.1'),
  ('0a', '10.10.1.2'),
  ('0a', '10.10.1.3'),
  ('0b', '10.10.1.2'),
  ('0b', '10.10.1.4'),
  ('0c', '10.10.1.5'),
  ('0d', '10.10.1.3'),
  ('0e', '10.10.1.6'),
  ('0e', '10.10.1.7'),
  ('0f', '10.10.1.8'),
  ('0f', '10.10.1.9'),
  ('10', '10.10.1.9'),
  ('11', '10.10.1.10'),
  ('12', '10.10.1.11'),
  ('12', '10.10.1.4'),
  ('1a', '10.10.1.2'),
  ('1a', '10.10.1.4'),
  ('1e', '10.10.1.11'),
  ('1f', '10.10.1.12'),
  ('23', '10.10.1.12');

The result I can't work out how to produce is:

         ids         |                         ips
---------------------+------------------------------------------------------
 {0a,0b,0d,12,1a,1e} | {10.10.1.1,10.10.1.2,10.10.1.3,10.10.1.4,10.10.1.11}
 {0c}                | {10.10.1.5}
 {0e}                | {10.10.1.6,10.10.1.7}
 {0f,10}             | {10.10.1.8,10.10.1.9}
 {11}                | {10.10.1.10}
 {1f,23}             | {10.10.1.12}

The logic here is that any ids with ips in common are grouped together, transitively. For instance, 0a has an ip in common with 0b; 0b has one in common with 12; 12 has one in common with 1e, and so forth.

There are tens of thousands of rows, no specific limit on how many ips for any given id, and no specific limit on how many ids may show any given ip.

I know how to aggregate by ip, or aggregate by id, but doing both transitively is giving me trouble. I tried a recursive CTE but I couldn't seem to get it right, and I'm not sure if that was the right approach in the first place. (If I could first group by id and then group by overlapping array of ips, and avoid duplicates in the aggregation, I would be all set, but there may be a better approach.)

Is there a way to produce the above results with standard SQL? Or at least with standard Postgres? (I'm using 9.6.6.)

Here is a failed attempt. (It is a legal query that does return results, but not the desired results.) It fails because:

It includes intermediate results rather than replacing them with later results, and
It doesn't sort the array concatenations, so it includes each result multiple times. This is also a hugely slow query for the actual dataset I am working with, since it returns up to n! times each result.

Here is the query:

with recursive b as (
  select
    array[myid] as ids,
    array_agg(ip) as ips 
  from a
  group by myid
), c as (
  select
    ids,
    ips
  from b
  union
  select
    b.ids || c.ids,
    b.ips || c.ips
  from
    b
    join c on
      (not b.ids && c.ids)
      and (b.ips && c.ips)
)
select * from c
;

Best Answer

One of the key parts of Jack Douglas's solution in Group by array overlapping is the | (pipe) operator used on arrays in the recursive part of the recursive t CTE like this:

...
select t.id, a.id, t.clst | a.clst
...

This operator concatenates two arrays suppressing duplicate items. The reason that answer cannot be directly applied to your setup is because apparently the | operator is defined for int arrays only, while you need a way to perform the same operation on inet arrays.

You can do that by treating the arrays as row sets. If you notice, what the | operator produces is effectively a union of two sets. Therefore, if you unnest both arrays, union them and aggregate the combined set back as an array, you will get the same result. So, this expression,

t.clst | a.clst

can be replaced with a correlated subquery:

(
  select
    array_agg(sub.n)
  from
    (
      select unnest(t.clst)
      union
      select unnest(a.clst)
    ) as sub (n)
)

Yes, the substitution is quite unwieldy in comparison, but it does the job, and that is something to start with.

Adapting the solution to your example (and adding a bit of white space to the original code), the complete query would look like this:

with recursive
  cte_a as
  (
    select
      myid,
      array_agg(distinct ip) as ip
    from
      a
    group by
      myid
  )
, cte_t (myid, pmyid, ip) as
  (
    select
      myid,
      myid,
      ip
    from
      cte_a

    union all

    select
      t.myid,
      a.myid,

      (  /* this is the replacement expression */
        select
          array_agg(sub.n)
        from
          (
            select unnest(t.ip)

            union

            select unnest(a.ip)
          ) as sub (n)
      )

    from
      cte_t as t
      join cte_a as a
        on a.myid <> t.pmyid and t.ip && a.ip and not t.ip @> a.ip
  )
, cte_d as
  (
    select distinct on (myid)
      myid,
      ip
    from
      cte_t
    order by
      myid,
      cardinality(ip) desc
  )
select
  array_agg(myid),
  ip
from
  cte_d
group by
  ip
;

You can test the query in this demo at dbfiddle logo db<>fiddle.uk.

Note also that Jack's word of caution probably applies to your situation as well:

Bear in mind that this is unlikely to perform well on millions of rows.

Related Solutions

Mysql – Recursive Query in MySQL using stored proceedure and CURSOR

So the three questions I'm hoping someone can answer are:

1.Should I be using a CURSOR here or is there an alternative that will get the same recursive result?

No, you should not. The CURSOR keep being opened over and over. Thinking of the overhead make me cringe. Personally, I stay away from CURSORs.

2.How can I get the results back in a single result set so that it can be used in the same fashion as a subselect?

3.What is the proper way of using the results from the CALL because I can't at least as far as I've tried get the sample SELECT statement above to work? I believe this is because I can't use CALL inline but I'm not sure.

I would like to bail you out of this by suggesting something I learned from my college days when learning about data structures. You need to perform tree traversal using a queue. This allows you to start at a root and express all descendants of a tree.

The algorithm goes like this

STEP 01 : Start with an empty queue
STEP 02 : Dequeue Node From the Front of the Queue
STEP 03 : Enqueue All Children of the Latest Node
STEP 04 : Process Info From the Latest Node
STEP 05 : If the Queue is Not Empty, Go Back to STEP 02
STEP 06 : All Done

This allows you to traverse a recursive structure without using Programmatic Recursion. At this point, you are probably asking: How can I traverse a tree structure without recursion?

I wrote a post about how to script three Stored Procedures that can using a loop in a single CALL that will traverse a table with nodes and its parent in a table:

GetParentIDByID
GetAncestry
GetFamilyTree

The post is Find highest level of a hierarchical field: with vs without CTEs (Oct 24, 2011). It contains the Stored Procedures already written that will traverse the following table structure:

+------------+--------------+------+-----+---------+----------------+
| Field      | Type         | Null | Key | Default | Extra          |
+------------+--------------+------+-----+---------+----------------+
| id         | int(11)      | NO   | PRI | NULL    | auto_increment | 
| parent_id  | int(11)      | YES  |     | NULL    |                | 
| name       | varchar(255) | YES  |     | NULL    |                | 
| notes      | text         | YES  |     | NULL    |                | 
+------------+--------------+------+-----+---------+----------------+

Please read the code carefully and apply the principles.

Give it a Try !!!

PostgreSQL return second ‘Group’ as an array

Maybe something like this:

with ranked_visits as (
  SELECT w.website_id, 
         v.visitor_id, 
         count(wv.visit_id) as visits,
         row_number() over (partition by w.website_id order by count(wv.visit_id) desc) as rnk
  FROM website_visits wv
    JOIN websites w ON wv.website_id = w.website_id
    JOIN visitors v ON wv.visitor_id = v.visitor_id
  GROUP BY w.website_id, v.visitor_id
  ORDER BY w.website_id ASC, count(wv.visit_id) DESC
)
select website_id, string_agg('visitor_id: '||visitor_id||',visits:'||visits, ', ')
from ranked_visits
where rnk <= 10
group by website_id;

Best Answer

Related Solutions

Mysql – Recursive Query in MySQL using stored proceedure and CURSOR

PostgreSQL return second ‘Group’ as an array

Related Question