Mariadb – Get all rows where specific column value occurs more than once (filter out single occurences)

mariadbmariadb-10.3query

I am trying to filter all sessionIds that occur once out of an existing result set.

This query is being used in a web application and runs on a big dataset (~ 35 million rows), so I want to prevent having subqueries here.

I tried this, which provides a filtered result, except that I now only get one row for each sessionId (and I want every request and response):

CREATE TABLE `api_log` (
  `id` varchar(50) NOT NULL,
  `clientId` varchar(100) DEFAULT NULL,
  `inserted` int(11) DEFAULT NULL,
  `sessionId` mediumtext DEFAULT NULL,
  `stage` varchar(120) DEFAULT NULL,
  `request` longtext CHARACTER SET utf8mb4 DEFAULT NULL,
  `response` longtext CHARACTER SET utf8mb4 DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO api_log
  VALUES
    ("1", "abc", 1621008484, "session1", "production", '{"key":"value"}', '{"key":"value"}'),
    ("2", "abc", 1621008494, "session2", "production", '{"key":"value"}', '{"key":"value"}'),
    ("3", "abc", 1621008584, "session1", "production", '{"key":"value"}', '{"key":"value"}'),
    ("4", "abc", 1621008684, "session2", "production", '{"key":"value"}', '{"key":"value"}'),
    ("5", "abc", 1621008784, "session3", "production", '{"key":"value"}', '{"key":"value"}'),
    ("6", "abc", 1621008884, "session4", "production", '{"key":"value"}', '{"key":"value"}'),
    ("7", "abc", 1621008984, "session5", "production", '{"key":"value"}', '{"key":"value"}'),
    ("8", "abc", 1621009084, "session6", "production", '{"key":"value"}', '{"key":"value"}'),
    ("9", "abc", 1621009184, "session7", "production", '{"key":"value"}', '{"key":"value"}'),
    ("10", "abc", 1621009284, "session8", "production", '{"key":"value"}', '{"key":"value"}');

SELECT
    `clientId`,
    `sessionId`,
    `inserted`,
    `stage`,
    `request`,
    `response`
FROM
    `api_log`
WHERE
    (stage = 'production') 
    AND (clientId = 'abc') 
    AND (
        `inserted` BETWEEN 1621008482 AND 1621009285
    )
GROUP BY
    `clientId`,
    `stage`,
    `sessionId`
HAVING
    COUNT(sessionId) > 1

Is there any trick to get all rows where a sessionId occurs more than once?

In this case, I get two rows, one for session1 and one for session2, but I am missing two more because both mentioned sessionIds have an additional row that should match.

SQL fiddle: http://sqlfiddle.com/#!9/f87bb54/3

Best Answer

If I understand your requirement correctly, this is the kind of problem where window aggregation shines. Use the window version of the COUNT(*) function on the filtered dataset to obtain the counts alongside the other columns. Then filter on the count results to get only the rows you want. Your output can include any or all of the columns your table has:

SELECT
  id
, clientId
, inserted
, sessionId
, stage
, request
, response
FROM
  (
    SELECT
      *
    , COUNT(*) OVER (PARTITION BY sessionId) AS sessionIdCounter
    FROM
      api_log
    WHERE (stage = 'production') 
      AND (clientId = 'abc') 
      AND (inserted BETWEEN 1621008482 AND 1621009285)
  ) AS derived
WHERE
  sessionIdCounter > 1
ORDER BY
  inserted ASC
;

You can play with this solution at dbfiddle.uk:

https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=94d4a17c3eb18de3cc0a5a39b5b66470

Related Solutions

Mariadb – Deleting rows from table where values less than an aggregate value

You can select the rows to keep using a left join - http://sqlfiddle.com/#!9/3a677/3

select *
from media_views m1
left join media_views m2
  on (m1.companyid = m2.companyid
      and m1.channel = m2.channel
      and m1.date = m2.date
      and m1.views < m2.views)
where m2.companyid is null;

And you can delete the other ones using similar a inner join - http://sqlfiddle.com/#!9/d5d36/2

delete m1 from media_views m1
join media_views m2
  on (m1.companyid = m2.companyid
      and m1.channel = m2.channel
      and m1.date = m2.date
      and m1.views < m2.views);

And do not forget to modify the primary key after deleting them so it is not possible anymore.

Mysql – Retrieving most recent X rows for each given user from a table

I hope you are willing to do some programming outside of SQL.

To find the recipients that need the most pruning:

SELECT recipient
    FROM inbox
    GROUP BY recipient
    HAVING COUNT(*) > 30
    ORDER BY COUNT(*) DESC
    LIMIT 100

Then, for each recipient, first find the 31st id:

SELECT id
    FROM inbox
    WHERE recipient = $recipient
    ORDER BY id DESC
    LIMIT 30,1

And delete the excess baggage:

DELETE FROM inbox
    WHERE recipient = $recipient
      AND id < $id

You would need INDEX(recipient, id).

This process could be running continually. The first SQL would be the slowest, but it would run "Using index" and probably not be too bad, maybe 0.1 sec for 200K rows in inbox.

If you want to stay in SQL, this would let you do one recipient at a time:

BEGIN;
SELECT @recipient := recipient
    FROM inbox
    GROUP BY recipient
    HAVING COUNT(*) > 30
    ORDER BY COUNT(*) DESC
    LIMIT 1
    FOR UPDATE;
SELECT @id := id
    FROM inbox
    WHERE recipient = @recipient
    ORDER BY id DESC
    LIMIT 30,1
    FOR UPDATE;
DELETE FROM inbox
    WHERE recipient = @recipient
      AND id < @id;
COMMIT;

Then, put that in a Stored Procedure with a loop around it. And, if you want to be further 'nice', add a SELECT SLEEP(1); in the loop but outside the transaction.

(Trying to do the entire thing in a single statement makes my brain hurt.)

Best Answer

Related Solutions

Mariadb – Deleting rows from table where values less than an aggregate value

Mysql – Retrieving most recent X rows for each given user from a table

Related Question