MySQL MyISAM – Does Table Size Affect SELECT on MyISAM Table Index?

myisamMySQLselect

I was arguing with a coworker if SELECT * WHERE t BETWEEN ? AND ? on a MyIsam table where t is an index is significantly slower for huge tables. The hypothetically returned dataset is always the same, but in one query makes up for .1% of the tables content, and for the other… say 90%.

The table we were speaking about is used for logging. This means the index for the time t is increased chronologically, which makes analyzing the table unneccessary.

My explanation was that other then searching the index to find the first and the last row (aka the first and the last reference to the data) the performance should be the same. It would take a bit longer to search a long index compared to a small one – but returning the data should take the same time. Is this right or am I completely off the track on this one?

Thanks!

Best Answer

Broadly speaking, on a MyISAM Table with a range scan, the process is:

Find the first Index result using the BTREE (inside the .MYI file) and access the row result (on the .MYD file) - Handler_read_key
Get the next result, using the index (and in the same order), until the value retrieved is larger than the one defined (multiple instances of Handler_read_next)

You can actually get this plan by observing that you get a range join type on EXPLAIN and on the Handler_* counters on SHOW SESSION STATUS.

Theoretically, the first step is O(log n) -where n is the number of records indexed (the table size)- while the second is O(m)- where m is the number of records returned. So, theoretically, a larger table will take more to return the records. Why so I say theorically? Because the O() notation can be deceitful if you do not have into account the constants. Indexes usually end up in memory, while rows (specially on MyISAM, which has not a dedicated buffer for data) can be on disk, so the difference in performance of both operations is large. Also, MyISAM has problems with large tables, so then number of levels tend not to be too large.

Let me show you an unrelated graph: Full table scan

In the above graph, the full table scan (blue line) should be flat, because all rows are examined, but it is not mainly because at that point, reading and returning 16M rows is more costly than returning only 1.

So the answer is- both operations take time, which one dominates depend on the actual value of m and n, plus the state of the database in terms of speed of hardware (memory, disk) and the state of the buffers (filesystem, key buffer). In conventional usage, an index scan of a single row is a very fast operation, but it depends comparing to what, and if you have into account extreme cases, like large tables where the BTREE index doesn't fit into memory.

Get qualifying rows only

One way ..

WITH x AS (
   SELECT *
         ,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
   FROM   tbl
   )
, y AS (
   SELECT *
   FROM   x
   WHERE  x.rn = 1
   AND    date_out >= '2012-10-01'
   AND    date_out <  '2012-11-01'
   )
, z AS (
   SELECT x.*
   FROM   x
   JOIN   y USING (reference)
   WHERE  x.rn = 2
   AND    x.centreid = 1
   )
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM   y
JOIN   z USING (reference)

UNION  ALL
SELECT taskid, reference, centreid, date_out
FROM   z
ORDER  BY reference, date_out;

Another way:

WITH x AS (
   SELECT *
         ,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
   FROM   tbl
   )
,y AS (
    SELECT x.*, y.taskid AS taskid2, y.centreid AS centreid2, y.date_out AS date_out2
    FROM   x
    JOIN   x y USING (reference)
    WHERE  x.rn = 1
    AND    x.date_out >= '2012-10-01'
    AND    x.date_out <  '2012-11-01'
    AND    y.rn = 2
    AND    y.centreid = 1
    )
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM   y

UNION  ALL
SELECT y.taskid2, y.reference, y.centreid2, y.date_out2
FROM   y
ORDER  BY reference, date_out;

I'd expect the second one to be faster. Depends on your data distribution. Test with EXPLAIN ANALYZE.

Get all rows for qualifying references

WITH x AS (
   SELECT *
         ,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
   FROM   tbl
   )
,y AS (
    SELECT reference
    FROM   x
    JOIN   x y USING (reference)
    WHERE  x.rn = 1
    AND    x.date_out >= '2012-10-01'
    AND    x.date_out <  '2012-11-01'
    AND    y.rn = 2
    AND    y.centreid = 1
    )
SELECT *
FROM   tbl
JOIN   y USING (reference)
ORDER  BY reference, date_out;

-> sqlfiddle

Answer to follow-up in comment

Separate groups if more than 30 days between entries.

WITH a AS (
   SELECT *
         ,lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) AS last_date
         ,CASE WHEN date_out > 
                   (lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) - 30)
               THEN 0 ELSE 1
          END AS step
   FROM   tbl
   )
,b AS (
   SELECT *
         ,sum(step) OVER (PARTITION BY reference ORDER BY date_out DESC) AS grp
   FROM a
   )
 ,c AS (
   SELECT *
         ,row_number() OVER (PARTITION BY reference, grp ORDER BY date_out DESC) AS rn
   FROM b
   )
,d AS (
   SELECT reference, grp
   FROM   c
   JOIN   c d USING (reference, grp)
   WHERE  c.rn = 1
   AND    c.date_out >= '2012-10-01'
   AND    c.date_out <  '2012-11-01'
   AND    d.rn = 2
   AND    d.centreid = 1
   )
SELECT b.taskid, b.reference, b.centreid, b.date_out
FROM   b
JOIN   d USING (reference, grp)
ORDER  BY reference, date_out

-> sqlfiddle

But while pure SQL is a beaut .. I would solve this procedurally in a plpgsql function. Very similar to this recent answer on SO. Would be faster, because it can be done in a single table scan.

MySQL MyISAM index causes query to match no rows; indexes disabled, rows match

OBSERVATION #1

You have this:

CREATE TABLE mytable (id1 int, id2 int, score float) ENGINE=MyISAM;
LOAD DATA INFILE '50-billion-records.txt' INTO mytable (id1, id2, score);
ALTER TABLE mytable ADD INDEX id1_index (id1);

This is how you should load the MyISAM table:

CREATE TABLE mytable (id1 int, id2 int, score float,key (id1)) ENGINE=MyISAM;
ALTER TABLE mytable DISABLE KEYS;
LOAD DATA INFILE '50-billion-records.txt' INTO mytable (id1, id2, score);
ALTER TABLE mytable ENABLE KEYS;

When you do disable keys, it stops the .MYI file from updating nonunique indexes.

During the LOAD DATA INFILE, the .MYI file will not grow because it contains no unique indexes or a primary key. This will promarily focus on loading the .MYD.

The ENABLE KEYS phase will do a read pass through the .MYD and linearly build all the nonunique indexes. In your case, it will build the id1 index.

OBSERVATION #2

As for the bug you seem to be experiencing, think of this:

25 billions rows
4 byte integer for id1
100 billion bytes = 93.13G
That's bigger than the 80G myisam_sort_buffer_size

IMPLICATIONS

It is possible that incomplete or aborted indexing operations could have lost index pages that should have been written to the .MYI.
It is also possible that doing DISABLE KEYS makes your query run because the id1 index is somehow being ignored. This could be the case since the EXPLAIN plan says type ALL and Possible Keys NULL. That's a full table scan. The SELECT query will work.
With the keys enabled and incompletely written, some nonleaf node info being missing would lead to SELECT queries that are destined to fail because some of the id1 values are not among the nonleaf node in the .MYI.

SUGGESTIONS

Try running the LOAD DATA INFILE with my proposed code

CREATE TABLE mytable (id1 int, id2 int, score float,key (id1)) ENGINE=MyISAM;
ALTER TABLE mytable DISABLE KEYS;
LOAD DATA INFILE '50-billion-records.txt' INTO mytable (id1, id2, score);
ALTER TABLE mytable ENABLE KEYS;

Also, raise myisam_sort_buffer_size to 100G

Give it a Try !!!

UPDATE 2013-08-12 21:37 EDT

Repair By KeyCache is the result of hitting the max sort_buffer_size and myisam_sort_file_size. In turn, MySQL elects to sift through keys in the keycache.

There are three(3) options you could further adjust

OPTION #1

Set the sort_buffer_size to 16K. That's the minimum value allowed.

OPTION #2

Set the tmp_table_size to 1K. That's the minimum value allowed.

OPTION #3

Set the myisam_sort_buffer_size to its max value of 9223372036854775807

This should further prevent the Repair By KeyCache problem

Best Answer

Related Solutions

Postgresql – Tricky selection of grouped rows, selecting based on values of two distinct but related rows

Get qualifying rows only

Get all rows for qualifying references

Answer to follow-up in comment

MySQL MyISAM index causes query to match no rows; indexes disabled, rows match

OBSERVATION #1

OBSERVATION #2

IMPLICATIONS

SUGGESTIONS

Give it a Try !!!

UPDATE 2013-08-12 21:37 EDT

OPTION #1

OPTION #2

OPTION #3

Give it a Try !!!

Related Question