Mysql – Get first part of result faster in sorting query on hundreds of millions of rows

MySQLperformancequery-performance

I have a single table containing around 280 million rows in MariaDB 10.4.10.

I need to process this full table in an external program, sorted by timestamp. The external program is fast; the largest factor is the query speed. This query takes around 37 minutes to run (old laptop, SSD, virtual machine with 8GByte RAM). This is normal, I think.

However, the first result is only returned after 17 minutes, in which the client program is just waiting.

Is there a way to make MySQL return the first results faster, even if the total query time is unchanged, or even slightly longer?

Table structure:

CREATE TABLE `eventlog` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `moment` datetime NOT NULL,
  `source_id` int(11) NOT NULL,
  `code` binary(5) NOT NULL,
  `category` binary(2) DEFAULT NULL,
  `type` enum('single','double','triple') CHARACTER SET binary NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_moment` (`moment`)
) ENGINE=InnoDB AUTO_INCREMENT=283252852 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

It is around 12GByte on disk (data length, excluding indexes). The index is 6.5GByte.

Query:

SELECT
  id,
  moment,
  source_id,
  code
FROM
  eventlog
ORDER BY
  moment ASC

EXPLAIN shows it is not using the index on moment to run the query.

MariaDB [test]> explain select id, moment, source_id, code from eventlog order by moment asc\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: eventlog
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 282499875
        Extra: Using filesort
1 row in set (0.017 sec)

Profiling the query shows it spends practically all time on "Creating sort index", even though an index on the ORDER BY field already exists. (All other stages in the profile are each less than 0.02s):

*************************** 14. row ***************************
             Status: Creating sort index
           Duration: 999.999999
           CPU_user: 999.999999
         CPU_system: 110.469717
  Context_voluntary: 307668
Context_involuntary: 18439
       Block_ops_in: 74089616
      Block_ops_out: 50343784
      Messages_sent: 0
  Messages_received: 0
  Page_faults_major: 111
  Page_faults_minor: 6653
              Swaps: 0
    Source_function: <unknown>
        Source_file: sql_select.cc
        Source_line: 21167

I have tried:

Forcing the index: I thought this would make sense, since the query is basically returning the rows in index order. EXPLAIN then shows the index will be used, but the first result is even later (175 minutes?!).
Increasing sort_buffer_size from 2M to 20M, without effect.
Running with LIMIT and OFFSET: the first batch of 500 000 rows is returned in 9 seconds, which is nice, but the total query time (getting all 280 million rows in batches of 500 000) increases drastically as the OFFSET increases, making the total time the external program is idle even longer.

How can I make make MySQL return the first results faster without making the total runtime several times longer?

Best Answer

First, I have to ask what the client will do with 280M rows? If it is a web page, that much data will crash the user's machine.

Here are two things to speed up getting data from that query:

Change the indexes to these:

PRIMARY KEY(moment, id),  -- to cluster the data in the order to be fetched
INDEX(id)   -- to keep AUTO_INCREMENT happy.

What you have now will fetch all the data, then sort it by moment, before delivering even the first row. With the indexing change, the sort is gone, and the rest can be 'streamed' directly. But will the client take advantage of that? Well...

Change the fetching mechanism to receive the data in chunks instead of "all at once". (The details of this choice are buried in the API of how the client talks to MySQL. I only use the "all at once". But I never ask for 280M rows.)

Drawbacks of "chunking":

The entire query takes longer. (That is, more than 37 minutes.)
Other connections may be impacted by such a long-running query.

Consider this alternative: SELECT ... INTO OUTFILE ... and then process the resulting file as you might process a CSV file.

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Mysql – Large INSERT INTO SELECT [..] FROM gradually gets slower

The large offsets can have this effect. I would try to remove the offset and use only LIMIT 10000:

INSERT INTO domains.dictionary_language
  (text_id, english)
    SELECT      t.old_id,
                IF(t.old_text LIKE '%==English==%', 1, 0)

    FROM        wikt.text AS t
      JOIN      ( SELECT COALESCE(MAX(text_id), 0) AS offset
                  FROM domains.dictionary_language
                ) AS m
                ON  t.old_id > m.offset

    ORDER BY    t.old_id
    LIMIT       100000;

Best Answer

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Mysql – Large INSERT INTO SELECT [..] FROM gradually gets slower

Related Question