Mysql – Get first part of result faster in sorting query on hundreds of millions of rows

MySQLperformancequery-performance

I have a single table containing around 280 million rows in MariaDB 10.4.10.

I need to process this full table in an external program, sorted by timestamp. The external program is fast; the largest factor is the query speed. This query takes around 37 minutes to run (old laptop, SSD, virtual machine with 8GByte RAM). This is normal, I think.

However, the first result is only returned after 17 minutes, in which the client program is just waiting.

Is there a way to make MySQL return the first results faster, even if the total query time is unchanged, or even slightly longer?

Table structure:

CREATE TABLE `eventlog` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `moment` datetime NOT NULL,
  `source_id` int(11) NOT NULL,
  `code` binary(5) NOT NULL,
  `category` binary(2) DEFAULT NULL,
  `type` enum('single','double','triple') CHARACTER SET binary NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_moment` (`moment`)
) ENGINE=InnoDB AUTO_INCREMENT=283252852 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

It is around 12GByte on disk (data length, excluding indexes). The index is 6.5GByte.

Query:

SELECT
  id,
  moment,
  source_id,
  code
FROM
  eventlog
ORDER BY
  moment ASC

EXPLAIN shows it is not using the index on moment to run the query.

MariaDB [test]> explain select id, moment, source_id, code from eventlog order by moment asc\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: eventlog
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 282499875
        Extra: Using filesort
1 row in set (0.017 sec)

Profiling the query shows it spends practically all time on "Creating sort index", even though an index on the ORDER BY field already exists. (All other stages in the profile are each less than 0.02s):

*************************** 14. row ***************************
             Status: Creating sort index
           Duration: 999.999999
           CPU_user: 999.999999
         CPU_system: 110.469717
  Context_voluntary: 307668
Context_involuntary: 18439
       Block_ops_in: 74089616
      Block_ops_out: 50343784
      Messages_sent: 0
  Messages_received: 0
  Page_faults_major: 111
  Page_faults_minor: 6653
              Swaps: 0
    Source_function: <unknown>
        Source_file: sql_select.cc
        Source_line: 21167

I have tried:

  • Forcing the index: I thought this would make sense, since the query is basically returning the rows in index order. EXPLAIN then shows the index will be used, but the first result is even later (175 minutes?!).
  • Increasing sort_buffer_size from 2M to 20M, without effect.
  • Running with LIMIT and OFFSET: the first batch of 500 000 rows is returned in 9 seconds, which is nice, but the total query time (getting all 280 million rows in batches of 500 000) increases drastically as the OFFSET increases, making the total time the external program is idle even longer.

How can I make make MySQL return the first results faster without making the total runtime several times longer?

Best Answer

First, I have to ask what the client will do with 280M rows? If it is a web page, that much data will crash the user's machine.

Here are two things to speed up getting data from that query:

  • Change the indexes to these:

    PRIMARY KEY(moment, id),  -- to cluster the data in the order to be fetched
    INDEX(id)   -- to keep AUTO_INCREMENT happy.
    

What you have now will fetch all the data, then sort it by moment, before delivering even the first row. With the indexing change, the sort is gone, and the rest can be 'streamed' directly. But will the client take advantage of that? Well...

  • Change the fetching mechanism to receive the data in chunks instead of "all at once". (The details of this choice are buried in the API of how the client talks to MySQL. I only use the "all at once". But I never ask for 280M rows.)

Drawbacks of "chunking":

  • The entire query takes longer. (That is, more than 37 minutes.)
  • Other connections may be impacted by such a long-running query.

Consider this alternative: SELECT ... INTO OUTFILE ... and then process the resulting file as you might process a CSV file.