Filter rows, but only select most recent row for each Id

cassandra

I have some code that is processing a lot of tasks. These tasks are either connected to a user or an account. Each Task go through many statuses -> Queued, Started, Error, Postponed, Queued, Started, Finished …

Since there will be processed a lot of these Tasks very fast, the goal was to only do writes. As you can see from the comments, the other goal of saving these statuses is to be able to show the Task status in a user interface.

This was solved like so:

-- Used to store processing data for a Task
drop table if exists task;
create table task (
    id uuid,
    data text,
    type text,
    primary key (id)
);

-- Used to find all statuses for a specific task
-- Use case: View one Task i Admin interface
drop table if exists task_status;
create table task_status (
    task_id uuid,
    time_id timeuuid,
    status_enum text,
    status_text text,
    primary key ((task_id), time_id)
) with clustering order by (time_id desc);

-- Used to find Tasks for an Account
-- Use case: Show Account Details > Tasks
drop table if exists account_tasks;
create table account_tasks (
    account_id bigint,
    task_id uuid,
    time_id timeuuid,
    primary key ((account_id), time_id)
) with clustering order by (time_id desc);

-- Used to find Tasks for a User
-- Use case: Show User Details > Tasks
drop table if exists user_tasks;
create table user_tasks (
    user_id bigint,
    task_id uuid,
    time_id timeuuid,
    primary key ((user_id), time_id)
) with clustering order by (time_id desc);

When a Task is created, a row will be written to "task" and depending on which task is it, also to "user_tasks" or "account_tasks".
During processing of the Task multiple rows will be written to "task_status" with different statuses and a user friendly text describing what happened in the step.

I also need to be able to find all the Tasks that currently have a specific status, but also using a filter on type.

I have tried something like this:

-- Used to find tasks with a specific status and type
-- Use case: Admin Task overview showing all tasks
-- select * from task_overview where status_enum = 'Failed' and type = 'ActivateAccount'
drop table if exists task_overview;
create table task_overview (
    status_enum text,
    type text,
    task_id uuid,
    time_id timeuuid,
    primary key ((status_enum, type), time_id)
) with clustering order by (time_id desc);

I would write to this at the same time as "task_status", however when filtering on this table, I am not sure that was actually the last status the Task had. It just shows that the task had this status at some point. I only want the row returned, if for all rows with a specific task_id this is the one with the highest time_id.

I have spend a lot of time thinking of other data structures that could support this only doing writes, but I have not been successful 🙁

Said in other words, I need to be able to do a query where I filter on status_enum and type, that only returns the Tasks that currently have that status_enum.

Best Answer

One thing you can do is create another table which will store the current status of every task. So whenever you are updating your task_overview you will also update this new table keeping only the latest status. This will make table task_overview more or less like a log.