Mysql – Simple query returning < 10000 rows taking 3-4 seconds

MySQLperformancequery-performance

I'm having some trouble diagnosing this very simple query which seems quite slow to me, any help would be much appreciated.

The table of relevance for the query at issue is defined as follows:

CREATE TABLE `Episode` (
      `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
      `podcast_id` int(10) unsigned NOT NULL,
      `title` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
      `air_date` datetime NOT NULL,
      `episode_designation` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
      `length` int(11) DEFAULT NULL,
      `audio_quality` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
      `description` mediumtext COLLATE utf8mb4_unicode_ci,
      `image_url` mediumtext COLLATE utf8mb4_unicode_ci,
      `audio_url` mediumtext COLLATE utf8mb4_unicode_ci NOT NULL,
      `file_size` int(11) DEFAULT NULL,
      `created_at` timestamp NULL DEFAULT NULL,
      `updated_at` timestamp NULL DEFAULT NULL,
      `episode_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
      `is_explicit` tinyint(1) NOT NULL DEFAULT '0',
      `description_sanitized` longtext COLLATE utf8mb4_unicode_ci,
      `slug` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
      `availability` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
      `media_info` longtext COLLATE utf8mb4_unicode_ci,
      `guid` mediumtext COLLATE utf8mb4_unicode_ci,
      `visibility` varchar(16) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT 'visible',
      `source` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT 'rss',
      `meta_data` longtext COLLATE utf8mb4_unicode_ci,
      PRIMARY KEY (`id`),
      KEY `Episode_podcast_id` (`podcast_id`),
      KEY `Episode_guid` (`guid`(191)),
      CONSTRAINT `_Episode_podcast_id` FOREIGN KEY (`podcast_id`) REFERENCES `Podcast` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
    ) ENGINE=InnoDB AUTO_INCREMENT=40527565 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Entry in slow query log:

# Time: 190613 0:00:02
# Query_time: 3.903592 Lock_time: 0.000017 Rows_sent: 1405 Rows_examined: 1405
SET timestamp=1560384002;
select `guid` from `Episode` where `podcast_id` = 35011;

EXPLAIN for this query:

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  Episode ref Episode_podcast_id  Episode_podcast_id  4   const   1405    

The table has around 40 million rows. Is this just the time it's taking to retrieve 1405 values from this table? If so, how can I determine what the bottleneck is?

SELECT SQL_NO_CACHE guid from Episode where podcast_id = 35011 runs quite quickly from the console, it's taking no more than 0.01s, which confirms that it's pretty strange for it to ever take 3+ seconds.

The query returned a total of 85,255 characters across the 1405 rows (the guid is per episode, and this podcast has 1405 episodes.).

Interestingly, when I try the same query with different podcast_idvalues I often get a multi-second response time, but even when using SQL_NO_CACHE (including for the initial query) each subsequent query was near instant. No idea what's going on there. No specific timing that I can see, looks like it's fairly common throughout the day.

innodb_buffer_pool_size is 12101615616 (set by AWS RDS based on instance size).

The GUIDs are determined by the podcast publisher in this case so there is potential for quite large strings to be used and we can't risk truncating them. Looking at it now we should probably just be using TEXT, but either way it doesn't seem like the index is necessary to solve this issue.

Sample GUIDs:
http://cdnapikalturacom/p/1423662/sp/142366200/playManifest/entryId/1_vuavk909/flavorId/1_krk2nxm5/protocol/http/format/url/amp3?clientTag=feed:1_sws03aca,
9eb6f9b358aea911a2ae0fa0d4817008,
tag:soundcloud,2010:tracks/240612409

Additional information as requested:

RDS instance has 16GB RAM

Parameters (RDS doesn't have an accessible my.cnf): https://pastebin.com/nhxHszqt

SHOW GLOBAL STATUS: https://pastebin.com/w6BPnR0e

SHOW GLOBAL VARIABLES: https://pastebin.com/wtQXDnyq

SHOW FULL PROCESSLIST: https://pastebin.com/rMBrf6pF

mySQL Tuner Output: https://pastebin.com/dC7ZJytW

SHOW ENGINE INNODB STATUS: https://pastebin.com/XVpA0RG3

Best Answer

Assuming guid is the typical CHAR(36) CHARACTER SET ascii, change INDEX(podcast_id) to INDEX(podcast_id, guid). This will significantly speed up that query.

Guids are hard to deal with due to their randomness.

Now that you have provided SHOW CREATE TABLE Episode, I see that the above INDEX will not help.

However, if your guid does not really need to be MEDIUMTEXT, but is, instead, limited to 767 bytes, then my index is still possible, but only after changing the declaration of guid.

Show us some sample guid values.

Plan B

OK, I give in. Your guids are large. There is nothing practical to do for them. Here's a different approach that might help that query (and other queries that look at all rows for one podcast):

Change

PRIMARY KEY (`id`),
KEY `Episode_podcast_id` (`podcast_id`),

to

PRIMARY KEY(podcast_id,   -- to "cluster" the 1405 rows together
            id),          -- to make the PK unique
INDEX(id),    -- to keep AUTO_INCREMENT happy

Part of the issue is that big columns (all those xxTEXT columns) tend to be stored "off-record", thereby necessitating an extra fetch (probably from disk) to get the value.

Some values will be stored on-record, some off-record. So, by clustering on podcast_id, the on-record guids might be in the same block (some of the time), thereby speeding things up.

Also... Check the ROW_FORMAT. If it is COMPACT, the first 768 bytes is stored on-record, which (for the example given) means that it will fit. In contrast, DYNAMIC will always have to reach off-record.

Caution: Changing the indexes (especially since the PK is involved) will take a long time for 40M rows.