MySQL Queue – How to Merge Multiple Queues Efficiently

mariadbMySQLqueueselect

Consider the following table:

CREATE TABLE `multiqueue` (
    `ID` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `CustomerID` BIGINT(20) NOT NULL,
    `Volume` INT(11) NOT NULL,
    `Content` MEDIUMTEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `PublishedTS` DATETIME NULL DEFAULT NULL,
    PRIMARY KEY (`ID`) USING BTREE
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
;

This table serves as a multi-queue, meaning that it aggregates the queues of requests coming from multiple customers (denoted by CustomerID), each request having a certain Volume of work.

How to write a query that will select top N rows from the table, interleaving the rows from different customers?

If customer 1 sends 100 requests, each of volume 1000, then customer 2 sends 20 requests, each of volume 300, I would like that the query doesn't force any of the customers to starve on responses while my program is busy handling the requests of another customer. It should take 1 request of customer 1 and 3-4 requests of customer 2 in the first fetch, process them, then take 1 more request of customer 1 and 3-4 requests of customer 2, and so on.

What I have tried so far:

SET @runtot := 0;
SELECT q1.id1, q1.customerId1, q1.volume1, q1.content1, (@runtot := @runtot + q1.volume1) AS rt
FROM (
  SELECT ID AS id1, CustomerID AS customerId1, Content AS content1
  FROM multiqueue
  ORDER BY id1
) AS q1
WHERE @runtot < 2000

As described here, the code above limits the number of items selected with a running total of some field in the rows selected. In the scenario above, customer 2 would starve when the above query is in use.

The database in use is MariaDB (version 10.4.13, but I can upgrade to the latest if needed), though a solution for MySQL should also work.

Best Answer

For recent versions, you can use window functions:

SELECT ID, CustomerID, Volume, Content, runtot
FROM ( 
    SELECT ID
         , CustomerID
         , Content
         , sum(Volume) over (
                partition by CustomerID 
                order by ID
           ) as runtot
    FROM multiqueue
) AS q1 
WHERE runtot < 2000;

EDIT:

If you want at least 1 row for each customer, you can add another window function ROW_NUMBER() that enumerates the result, and use that in your select:

SELECT ID, CustomerID, Volume, Content, runtot
FROM ( 
    SELECT ID
         , CustomerID
         , Content
         , sum(Volume) over (
                partition by CustomerID 
                order by ID
           ) as runtot
         , row_number() over (
                partition by CustomerID 
                order by ID
           ) as rn

    FROM multiqueue
) AS q1 
WHERE runtot < 2000
   OR rn = 1;

Related Solutions

Mysql – Use MySQL to regularly do multi-way joins on 100+ GB tables

Have you tried piling more data and benchmarking it? 100K rows is inconsequential. Try 250M or 500M like you're expecting you'll need to handle and see where the bottlenecks are.

An RDBMS can do a lot of things if you pay careful attention to the limitations and try and work with the strengths of the system. They're exceptionally good at some things, and terrible at others, so you will need to experiment to be sure it's the right fit.

For some batch processing jobs, you really cannot beat flat files, loading the data into RAM, smashing it around using a series of loops and temporary variables, and dumping out the results. MySQL will never, ever be able to match that sort of speed, but if tuned properly and used correctly it can come within an order of magnitude.

What you'll want to do is investigate how your data can be partitioned. Do you have one big set of data with too much in the way of cross-links to be able to split it up, or are there natural places to partition it? If you can partition it you won't have one table with a whole pile of rows, but potentially many significantly smaller ones. Smaller tables, with much smaller indexes, tend to perform better.

From a hardware perspective, you'll need to test to see how your platform performs. Sometimes memory is essential. Other times it's disk I/O. It really depends on what you're doing with the data. You'll need to pay close attention to your CPU usage and look for high levels of IO wait to know where the problem lies.

Whenever possible, split your data across multiple systems. You can use MySQL Cluster if you're feeling brave, or simply spin up many independent instances of MySQL where each stores an arbitrary portion of the complete data set using some partitioning scheme that makes sense.

MySQL – How to Decide Which Execution Plan is Better

Multiplying the rows is invalid for several reasons:

Many times, the rows examined are an approximation (based on statistics, not accurate), good for query plan selection, but not for performance calculation
The total number of rows examined on a nested loop join (A, B) is not rows_examined_on_table_A * rows_examined_on_table_B, but rows_examined_on_table_A + rows_returned_from_table_A * rows_examined_on_table_B. Where clauses can make a huge difference on that, although it is true that the mentioned calculations is many times used as a broad approximation, assuming the indexes are being created properly and the main causes of filtering out results.
Modern MySQL versions do not use always a nested loop join approach for executing joins and subqueries. Check 5.6 subquery optimizations and other optimization documents on the same manual. Additionally, some of the new optimization techniques do not modify the predicted examined rows, which at some times can be way lower than the one printed, even if it has been calculated exactly.

In particular, on your first query, you are hitting a well know MySQL bug? limitation? in which an IN subquery is identified as a DEPENDENT SUBQUERY, even if it really isn't, forcing the outmost query to be executed without an index (full table scan) in order to test all possible values of the first table. That is usually an indicator that it is a bad query. It seems not to bee too bad in this case, as the table is small, but it is usually an indication of bad performance.

The other thing that should bring your attention is the Using temporary; Using filesort. Filtering is not the only thing where you should focus, as these extra pieces of information are telling you that a large sorting has to be done using a temporary table (that may or may not end up on disk, but at least has to be materialized). That is another indicator of potential bad performance, that in some cases can be avoided with the right indexes.

I will not tell you which is the right query to use (partially, because I do not know all the variables: indexes, tables structure, etc., and in most cases it will depend on the particular hardware/resources available), but I will tell you the tools to decide:

Profile the query- obtain the post execution times and how much of it it is being invested in what. You can use SHOW PROFILES up to 5.5, and the performance_schema starting with 5.6.
As time can be sometimes variable (for example, depending on other queries being executed at the same time, depending on the buffer pool contents) Obtain post-execution statistics with SHOW SESSION STATUS. In particular:
```
FLUSH SESSION STATUS;
SELECT ... ;
SHOW STATUS like 'Hand%';
```
will give you the exact number of handler calls done (approximately, the number of rows read and written for that particular query- although that is not 100% accurate, as it depends on the particular engine implementation).

You may also want to monitor other status variables, like the created temporary tables, created temporary tables on disk and sort passes/sorted rows.

All of these will give you post-execution, exact, time-independent parameters to evaluate the performance of a query. Percona even has a patch for the slow log to output that information on the logs instead of using performance_schema.

With those extra pieces of information you will be able to evaluate more objectively which query is better, and not relying exclusively on EXPLAIN, which only provides limited pre-execution information.

Best Answer

Related Solutions

Mysql – Use MySQL to regularly do multi-way joins on 100+ GB tables

MySQL – How to Decide Which Execution Plan is Better

Related Question