MariaDB – Can CTE Simplify Repeated and Recursive Joins?

ctejoin;mariadbrecursive

I have three tables organized as depicted below, showing foreign keys and unique fields (green):

Simplified tables

That represents email messages received by a server. (Sorry I couldn't simplify more. Full version here.) Each message is authenticated according to SPF and DKIM, and thereby linked to the corresponding domains. An SPF authentication can be linked to two domains, the so-called helo domain, which is the name of the sending server, and the bounce address domain, which usually is the domain-part (the part after the at-sign) of the author's email address. DKIM, instead, permits any domain through which the message passed to add one or more signatures, in order to claim some responsibility for the message itself. Usually, a message carries one or two DKIM signatures, but there can be more of them.

The many-to-many msg_ref table is created as follows (see db<>fiddle for tables creation and example data):

CREATE TABLE msg_ref (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  message_in INT UNSIGNED NOT NULL COMMENT 'Foreign key to message_in',
  domain INT UNSIGNED NOT NULL COMMENT 'Foreign key to domain',
  auth SET ('author', 'spf_helo', 'spf', 'dkim', 'org', 'dmarc') NOT NULL,
  spf ENUM ('none', 'neutral', 'pass', 'fail', 'softfail', 'temperror', 'permerror') NOT NULL,
  dkim ENUM ('none', 'pass', 'fail', 'policy', 'neutral', 'temperror', 'permerror') NOT NULL,
  dkim_order TINYINT UNSIGNED NOT NULL DEFAULT 0,
  INDEX by_dom_msg(domain, message_in),
  INDEX by_msg_auth(message_in, auth)
)

In order to provide feedback about authentication, DMARC introduced aggregate reports to be sent at end-of-day to the domains who request them. A report communicates to the sender the results of authentication checks carried out by the receiver. It consists of various rows containing the source IP of the last relay (from message_in, converted), the number of messages having the same authentication results, and the results proper (from msg_ref).

Here's a query to extract an SPF and up to four DKIM results:

SELECT INET_NTOA(CONV(HEX(m.ip),16,10)) AS source, COUNT(DISTINCT(m.id)) AS count,\
m.dmarc_dispo AS disposition, \
da.domain AS author,\
dspf.domain AS spf, rspf.spf AS spf_result,\
d1.domain AS dkim1, r1.dkim AS dkim1_result,\
d2.domain AS dkim2, r2.dkim AS dkim2_result,\
d3.domain AS dkim3, r3.dkim AS dkim3_result,\
d4.domain AS dkim4, r4.dkim AS dkim4_result\
FROM message_in AS m\
LEFT JOIN (msg_ref AS rd INNER JOIN domain AS dd ON rd.domain = dd.id)\
  ON m.id = rd.message_in AND FIND_IN_SET('dmarc', rd.auth)\
LEFT JOIN (msg_ref AS ra INNER JOIN domain AS da ON ra.domain = da.id)\
  ON m.id = ra.message_in AND FIND_IN_SET('author', ra.auth)\
LEFT JOIN (msg_ref AS rspf INNER JOIN domain AS dspf ON rspf.domain = dspf.id)\
  ON m.id = rspf.message_in AND FIND_IN_SET('spf', rspf.auth)\
LEFT JOIN (msg_ref AS rhelo INNER JOIN domain AS dhelo ON rhelo.domain = dhelo.id)\
  ON m.id = rhelo.message_in AND FIND_IN_SET('spf_helo', rhelo.auth)\
LEFT JOIN (msg_ref AS r1 INNER JOIN domain AS d1 ON r1.domain = d1.id)\
  ON m.id = r1.message_in AND r1.dkim_order = 1\
LEFT JOIN (msg_ref AS r2 INNER JOIN domain AS d2 ON r2.domain = d2.id)\
  ON m.id = r2.message_in  AND r2.dkim_order = 2\
LEFT JOIN (msg_ref AS r3 INNER JOIN domain AS d3 ON r3.domain = d3.id)\
  ON m.id = r3.message_in AND r3.dkim_order = 3\
LEFT JOIN (msg_ref AS r4 INNER JOIN domain AS d4 ON r4.domain = d4.id)\
  ON m.id = r4.message_in  AND r4.dkim_order = 4\
GROUP BY source, disposition, author,\
 spf, spf_result,\
 dkim1, dkim1_result,\
 dkim2, dkim2_result,\
 dkim3, dkim3_result,\
 dkim4, dkim4_result

Not shown, there is a WHERE clause that limits the output to a specific rd.domain and an m.mtime lying within the given reporting period. The result is the content of the DMARC report, as described by wikipedia:

+-----------+-------+-------------+-------------+-------------+------------+-------------+--------------+-------------+--------------+-------------+--------------+-------------+--------------+
| source    | count | disposition | Header from | spf         | spf_result | dkim1       | dkim1_result | dkim2       | dkim2_result | dkim3       | dkim3_result | dkim4       | dkim4_result |
+-----------+-------+-------------+-------------+-------------+------------+-------------+--------------+-------------+--------------+-------------+--------------+-------------+--------------+
| 192.0.2.1 |    12 | none        | example.com | example.com | pass       | example.com | pass         | example.com | pass         | example.net | pass         | example.net | pass         |
| 192.0.2.1 |     1 | none        | example.com | example.com | pass       | example.com | pass         | example.net | pass         | example.net | pass         | NULL        | NULL         |
+-----------+-------+-------------+-------------+-------------+------------+-------------+--------------+-------------+--------------+-------------+--------------+-------------+--------------+
2 rows in set (0.004 sec)

Edited Questions:

(Removed non-working attempt and possibly ambiguous wording. Reordered most important first.)

  • How to express the query using Recursive CTE so that the resulting rows contain a variable number of dkim<n> and dkim<n>_result columns, covering all DKIM authentications that appear in the relevant messages.

  • How to significantly simplify the query by factoring repeated expressions.

Note:
The query is filtered through OpenDBX. It can use several DBMS, MariaDB v.15 is a reasonable target.

Best Answer

Like most (all?) SQL queries the number of columns in the result is fixed. Recursive CTEs can only add rows and not columns.

You are fortunate that the result you want to generate a ARF report is a XML. Because of this you can use good old GROUP_CONCAT to generate an XML form of the DKIM.

I haven't looked up the XML spec for the report however something in the form

SELECT ...,
GROUP_CONCAT(
  CONCAT('<dkimresults><dkim_domain>',
    d1.domain,
    '<dkim_domain><dkim_result>',
    r1.dkim,
    '</dkim_result></dkim_results>'
  ) ORDER BY r1.dkim_order SEPARATOR '') dkim_result
FROM message_in AS m
....

Other recommendations:

  • Use InnoDB as the table type and not MyISAM/Aria
  • If MariaDB-10.5 is the sole application use the INET6 datatype introduced in 10.5
  • DKIM enums are probably ok as defined in spec and the DKIM working group of the IETF is "Concluded WG". So nothing more is coming.
  • For non-dkim - probably separate tables of message_id, auth domain columns for each time. Then just join to those as required.

ref: fiddle