Mariadb – Get all rows where specific column value occurs more than once (filter out single occurences)

mariadbmariadb-10.3query

I am trying to filter all sessionIds that occur once out of an existing result set.

This query is being used in a web application and runs on a big dataset (~ 35 million rows), so I want to prevent having subqueries here.

I tried this, which provides a filtered result, except that I now only get one row for each sessionId (and I want every request and response):

CREATE TABLE `api_log` (
  `id` varchar(50) NOT NULL,
  `clientId` varchar(100) DEFAULT NULL,
  `inserted` int(11) DEFAULT NULL,
  `sessionId` mediumtext DEFAULT NULL,
  `stage` varchar(120) DEFAULT NULL,
  `request` longtext CHARACTER SET utf8mb4 DEFAULT NULL,
  `response` longtext CHARACTER SET utf8mb4 DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO api_log
  VALUES
    ("1", "abc", 1621008484, "session1", "production", '{"key":"value"}', '{"key":"value"}'),
    ("2", "abc", 1621008494, "session2", "production", '{"key":"value"}', '{"key":"value"}'),
    ("3", "abc", 1621008584, "session1", "production", '{"key":"value"}', '{"key":"value"}'),
    ("4", "abc", 1621008684, "session2", "production", '{"key":"value"}', '{"key":"value"}'),
    ("5", "abc", 1621008784, "session3", "production", '{"key":"value"}', '{"key":"value"}'),
    ("6", "abc", 1621008884, "session4", "production", '{"key":"value"}', '{"key":"value"}'),
    ("7", "abc", 1621008984, "session5", "production", '{"key":"value"}', '{"key":"value"}'),
    ("8", "abc", 1621009084, "session6", "production", '{"key":"value"}', '{"key":"value"}'),
    ("9", "abc", 1621009184, "session7", "production", '{"key":"value"}', '{"key":"value"}'),
    ("10", "abc", 1621009284, "session8", "production", '{"key":"value"}', '{"key":"value"}');
SELECT
    `clientId`,
    `sessionId`,
    `inserted`,
    `stage`,
    `request`,
    `response`
FROM
    `api_log`
WHERE
    (stage = 'production') 
    AND (clientId = 'abc') 
    AND (
        `inserted` BETWEEN 1621008482 AND 1621009285
    )
GROUP BY
    `clientId`,
    `stage`,
    `sessionId`
HAVING
    COUNT(sessionId) > 1

Is there any trick to get all rows where a sessionId occurs more than once?

In this case, I get two rows, one for session1 and one for session2, but I am missing two more because both mentioned sessionIds have an additional row that should match.

SQL fiddle: http://sqlfiddle.com/#!9/f87bb54/3

Best Answer

If I understand your requirement correctly, this is the kind of problem where window aggregation shines. Use the window version of the COUNT(*) function on the filtered dataset to obtain the counts alongside the other columns. Then filter on the count results to get only the rows you want. Your output can include any or all of the columns your table has:

SELECT
  id
, clientId
, inserted
, sessionId
, stage
, request
, response
FROM
  (
    SELECT
      *
    , COUNT(*) OVER (PARTITION BY sessionId) AS sessionIdCounter
    FROM
      api_log
    WHERE (stage = 'production') 
      AND (clientId = 'abc') 
      AND (inserted BETWEEN 1621008482 AND 1621009285)
  ) AS derived
WHERE
  sessionIdCounter > 1
ORDER BY
  inserted ASC
;

You can play with this solution at dbfiddle.uk: