Sql-server – select rows on distinct column value without vendor-specific sql syntax

db2distinctMySQLpostgresqlsql server

I'm trying to select distinct rows without any vendor specific syntax. My table layout is as follows:

+--------------------------------------+---------+--------------------------+----------------------------+------------+---------------+---------------+-------------+-----------+--------------------------------------+------------+---------+
|                  id                  | version |        jobAsJson         |        jobSignature        |   state    |   createdAt   |   updatedAt   | scheduledAt | serverTag |                mutex                 | awaitingOn | isBatch |
+--------------------------------------+---------+--------------------------+----------------------------+------------+---------------+---------------+-------------+-----------+--------------------------------------+------------+---------+
| 6b56ef47-46e2-4905-9b4e-d8923306f98d |       0 | {some json value I want} | System.out.println(String) | PROCESSING | 1594385831872 | 1594385831872 | null        | DEFAULT   | resource-a                           | null       |       0 |
| c0514d03-5bd1-494f-9978-dfbc24f06d67 |       0 | {some json value I want} | System.out.println(String) | ENQUEUED   | 1594385832874 | 1594385832874 | null        | DEFAULT   | resource-a                           | null       |       0 |
| 6b33ef73-8aa7-4d83-a171-b30da6a95c5f |       0 | {some json value I want} | System.out.println(String) | ENQUEUED   | 1594385833874 | 1594385833874 | null        | DEFAULT   | resource-a                           | null       |       0 |
| d4d75118-57cc-476c-a86d-a6b73fc30d37 |       0 | {some json value I want} | System.out.println(String) | ENQUEUED   | 1594385834874 | 1594385834874 | null        | DEFAULT   | resource-a                           | null       |       0 |
| 2cfbf072-5233-496d-9236-28e66cddf054 |       0 | {some json value I want} | System.out.println(String) | ENQUEUED   | 1594385835874 | 1594385835874 | null        | DEFAULT   | resource-b                           | null       |       0 |
| d8f2d419-f394-4d4b-b375-4f1611b28170 |       0 | {some json value I want} | System.out.println(String) | ENQUEUED   | 1594385836875 | 1594385836875 | null        | DEFAULT   | resource-b                           | null       |       0 |
| 3cdf7878-0716-4928-a059-2a74b4172c74 |       0 | {some json value I want} | System.out.println(String) | ENQUEUED   | 1594385837875 | 1594385837875 | null        | DEFAULT   | resource-c                           | null       |       0 |
| d0bfffac-9d5b-4c7e-a82b-ca0f93d1a1da |       0 | {some json value I want} | System.out.println(String) | ENQUEUED   | 1594385838875 | 1594385838875 | null        | DEFAULT   | 79333f04-ab41-41dc-b004-8b2d74055d38 | null       |       0 |
| c1d568f3-f1e1-4d4d-9ca5-21129bf6d066 |       0 | {some json value I want} | System.out.println(String) | ENQUEUED   | 1594385839875 | 1594385839875 | null        | DEFAULT   | 1b8b941c-6bdc-45cf-93b5-1856f9de404a | null       |       0 |
+--------------------------------------+---------+--------------------------+----------------------------+------------+---------------+---------------+-------------+-----------+--------------------------------------+------------+---------+

I'm trying to select all rows where state = 'ENQUEUED', servertag is in ('DEFAULT'), the rows should be ordered by createdAt AND the rows should be unique on the field mutex.

The goal is to only have 1 row with state = 'PROCESSING' for each mutex. The selected rows with STATE='ENQUEUED' will be updated to 'PROCESSING'. In the example here, the expected results are:

  • 2cfbf072-5233-496d-9236-28e66cddf054 -> first job with mutex = resource-b
  • 3cdf7878-0716-4928-a059-2a74b4172c74 -> first job with mutex = resource-c
  • d0bfffac-9d5b-4c7e-a82b-ca0f93d1a1da -> no mutex thus unique value to make query easier
  • c1d568f3-f1e1-4d4d-9ca5-21129bf6d066 -> no mutex thus unique value to make query easier

=> no job with mutex resource-a is returned as it is already being used by the first job.

The query I have now works for SQLite but not for other databases (mySQL, oracle, DB2) and is as follows:

select 
    j.jobAsJson 
from 
    (select id 
     from jobrunr_jobs 
     where state = 'ENQUEUED' 
     AND servertag in ('DEFAULT') 
     and mutex NOT IN (select mutex from jobrunr_jobs where state = 'PROCESSING') 
     GROUP BY mutex order by createdat) r 
INNER JOIN 
    jobrunr_jobs j ON r.id = j.id

I would like to have a query that is database vendor agnostic.

Postgres and SQL Server complain about the fact that a GROUP BY is used and id is not part of it.

Best Answer

Errors?

First off, I suspect a subtle error in your query:

...
AND servertag IN ('DEFAULT') 
AND mutex NOT IN (select mutex from jobrunr_jobs where state = 'PROCESSING') 
...

You limit results to mutex with servertag 'DEFAULT', but the restriction is not repeated in the NOT IN subquery. So you exclude mutex that have a row with state 'PROCESSING' and any servertag. May be intentional, but it smells like an error. Even if nothing breaks (because of mutually exclusive sets of mutex values), it is probably still bad for performance.

Also, mutex NOT IN (<subselect>) is a notoriously treacherous construct. If mutex can be NULL, the query will break unexpectedly. In Postgres, this construct is generally discouraged. I assume the same is true for most RDBMS.

Query

Basically, you want the latest row per mutex with state = 'ENQUEUED' where no row with state = 'PROCESSING' exists. I assume you want to restrict both conditions to rows with servertag = 'DEFAULT':

SELECT jobAsJson          -- careful with mixed-case spelling!
FROM   jobrunr_jobs AS j
WHERE  state = 'ENQUEUED'
AND    servertag = 'DEFAULT'
AND    NOT EXISTS (
   SELECT 1
   FROM   jobrunr_jobs AS j2
   WHERE  j2.mutex = j.mutex
   AND    j2.servertag = 'DEFAULT'  -- assuming you want this?
   AND   (j2.state = 'PROCESSING'
       OR j2.state = 'ENQUEUED' AND j2.createdat > j.createdat)
   );

EXISTS is very basic standard SQL, and work in every halfway decent RDBMS - unlike CTEs and window functions, which are later additions and not supported everywhere.

Also removes the need for the dubious NOT IN.

And it should be fast, especially with index support.

Aside 1: to be "DB-agnostic" (which is never fully possible) don't use mixed-case identifiers, which are treated differently across various RDBMS.

Aside 2: Assuming you then want to mark identified rows as 'PROCESSING'. Then you'll want to return more than just jobAsJson. Or UPDATE identified rows right away. See: