MySQL – Counting Relationship Rows Between Three Tables Without Duplicates

countjoin;MySQL

I have a (local) MySQL database, which is a Drupal 6 export. It has these relevant tables:

  1. node – Contains articles, plus also information about their department and sections. I'm interested in nid and title columns.
  2. content_field_department – Has the relationships between article nodes and department nodes. I'm interested in nid (the article node) and field_department_nid (the department node) columns.
  3. content_field_section – Has the relationships between article nodes and section nodes. I'm interested in nid (the article node) and field_section_nid (the section node) columns.

I'm trying to get an accurate COUNT for how many times a given article node is related to each section or department node (performance is not relevant for me at this time, since it's local and it can take as long as it needs to without causing problems). That is, I'd like data like this:

+-----------------+-------------------+
| Section Name    | Count of Articles |
+-----------------+-------------------+
| Department Name | Count of Articles |
+-----------------+-------------------+

The issue

One problem is when the interface was originally built, a department choice for an article also included all of the sections. A section choice, though, never included the departments. So in cases where an article node is related to another node twice, both as a section and as a department, I would like to COUNT that only once.

My current try

My current attempt is like this:

  • To get a department COUNT:

    SELECT DISTINCT d.field_department_nid as tid, n.title as name, (
            SELECT COUNT(DISTINCT nid, field_department_nid)
            FROM content_field_department d2
            WHERE d2.field_department_nid = d.field_department_nid
            AND nid NOT IN (
                SELECT s.nid from content_field_section s
                LEFT OUTER JOIN node n2 ON s.nid = n2.nid
                WHERE field_section_nid IS NOT NULL
            )
        ) as drupal_department_count
    FROM content_field_department d
    LEFT OUTER JOIN node n ON d.field_department_nid = n.nid
    ORDER BY drupal_department_count DESC
    
  • To get a section COUNT:

    SELECT DISTINCT s.field_section_nid as tid, n.title as name, (
        SELECT COUNT(DISTINCT nid, field_section_nid)
            FROM content_field_section s2
            WHERE s2.field_section_nid = s.field_section_nid
            AND nid NOT IN (
                SELECT d.nid from content_field_department d
                LEFT OUTER JOIN node n2 ON d.nid = n2.nid
                WHERE field_department_nid IS NOT NULL
            )
        ) as drupal_section_count
    FROM content_field_section s
    LEFT OUTER JOIN node n ON s.field_section_nid = n.nid
    ORDER BY drupal_section_count DESC
    

I had also tried without the NOT IN additions, but the COUNT is too high, I think because the same node combination can be COUNTed in both tables.

What direction do I need to go with this?

Best Answer

I would need more information about your business restrictions to write an ideal query, but here's my advice given my interpretation of what you've said so far.

First thing you need to do is more clearly define what you are trying to achieve.

So in cases where an article node is related to another node twice, both as a section and as a department, I would like to COUNT that only once.

Since your final result set will be grouped by departments and sections, "counting it only once" means you need to pick where they should be counted. For example:

   content_field_department         content_field_section
+-----+----------------------+   +-----+-------------------+
| nid | field_department_nid |   | nid | field_section_nid |
+-----+----------------------+   +-----+-------------------+
|  1  |          2           |   |  1  |         3         |
+-----+----------------------+   +-----+-------------------+

should nid 1 be counted in the list for department nid 2 or counted for section nid 3? Based on your question, you may want to only count it for departments. In your current query drafts, you are trying to exclude the issue case from both counts, meaning you are effectively counting it 0 times instead of 1 (your goal) or 2 (what you are trying to avoid).

Here's a draft query as a starting point:

select
    DeptOrSect,
    nid,
    title,
    count(1)
from (
    select
        'DEPT' as DeptOrSect,
        dept.nid,
        dept.title,
        link.nid as article_nid
    from node dept
    join content_field_department link on link.field_department_nid = dept.nid
    union all
    select
        'SECT' as DeptOrSect,
        sect.nid,
        sect.title,
        link.nid as article_nid
    from node sect
    join content_field_section link on link.field_section_nid = sect.nid
    where not exists (
        select 1
        from content_field_department dept
        where dept.nid = link.nid
    )
) as x
group by DeptOrSect,
    nid,
    title;