MySQL – Indexes deleted, reduced column size but table size increases

indexinnodbMySQL

Here is the original table definition which contains 107547934 (~107.5M) rows

    CREATE TABLE `twitter_tweets` (
      `id_str` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
      `id_user` int(10) unsigned DEFAULT NULL COMMENT 'Twitter id of the user',
      `created_at` datetime DEFAULT NULL,
      `in_reply_to_user_id` int(10) DEFAULT NULL COMMENT 'Twitter id of the user being replied to, if any',
      `text` varchar(1000) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `retweet_count` int(10) DEFAULT NULL,
      `geo` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `retweeted` tinyint(1) DEFAULT NULL COMMENT 'Indicates whether the tweet has been retweeted at all',
      `place` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `in_reply_to_status_id_str` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `in_reply_to_screen_name` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `tweet_source` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
      `query` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT 'None',
      `iso_language_code` varchar(5) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
      PRIMARY KEY (`id_str`,`tweet_source`,`query`),
      KEY `id_user` (`id_user`),
      KEY `id_str` (`id_str`),
      KEY `query` (`query`),
      KEY `created_at` (`created_at`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8        

The disk space usage stats are

twitter_tweets disk usage

I changed the primary key to just id_str, dropped the key named id_str, dropped the column named geo. I also changed the the column tweet_source from varchar(100) to tinyint. However, I did add two float columns coordinates_long and coordinates_lat. The column geo which was dropped only contained null values. The two new columns added also only contain null values.
Here is the defn of the changed table which contains 106766691 (~106.7M) rows:-

    CREATE TABLE `twitter_tweets_new` (
      `id_str` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
      `id_user` int(10) unsigned DEFAULT NULL COMMENT 'Twitter id of the user',
      `created_at` datetime DEFAULT NULL,
      `in_reply_to_user_id` int(10) DEFAULT NULL COMMENT 'Twitter id of the user being replied to, if any',
      `text` varchar(500) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `retweet_count` int(10) DEFAULT NULL,
      `coordinates_long` float DEFAULT NULL,
      `coordinates_lat` float DEFAULT NULL,
      `retweeted` tinyint(1) DEFAULT NULL COMMENT 'Indicates whether the tweet has been retweeted at all',
      `place` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `in_reply_to_status_id_str` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `in_reply_to_screen_name` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `tweet_source` tinyint(4) NOT NULL COMMENT '1=user_timeline, 2=search',
      `query` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT 'None',
      `iso_language_code` varchar(5) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
      `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
      PRIMARY KEY (`id_str`),
      KEY `id_user` (`id_user`),
      KEY `query` (`query`),
      KEY `created_at` (`created_at`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

The disk space usage stats are

twitter_tweets_new disk usage

When you compare the size of the 2 tables, the former (named twitter_tweets occupies less space than the latter twitter_tweets_new).

Although there was a net addition of one column, considering that the null values don't take much space and also that I reduced the index size by a lot and also changed one of the columns to a smaller datatype and that the new table has lesser number of rows than the original one, the table size of the new table twitter_tweets_new should have been much lesser than that of twitter_tweets.
Why could this be happening?

Best Answer

Of course, the index size collapsed because you reduced the PRIMARY KEY to a single column and removed KEYid_str(id_str),. This definitely reduced all secondary indexes. That is excellent.

What about the Removal of Columns ?

You removed geo, which were all NULL. There was no shrinkage for the data. Why ? Since all geo columns were NULL, it took no space at all in every row. The mapping of offsets for all fields had geo pointing at nothing at all.

In the MySQL Documentation on FIELD CONTENTS in InnoDB, it has the following example table

CREATE TABLE T
    (FIELD1 VARCHAR(3), FIELD2 VARCHAR(3), FIELD3 VARCHAR(3))
    Type=InnoDB;

Note that both FIELD1 and FIELD3 are VARCHAR fields with NULL values.

From the Helpful Notes About NULL values at the Bottom of that Page

For the third row, I inserted NULLs in FIELD2 and FIELD3. Therefore in the Field Start Offsets the top bit is on for these fields (the values are 94 hexadecimal, 94 hexadecimal, instead of 14 hexadecimal, 14 hexadecimal). And the row is shorter because the NULLs take no space.

Thus, removing a VARCHAR column where every value was NULL will not shrink the table.

What did adding FLOAT columns actually do ?

Let's play a little math game. You said the new table has 106.7M rows.

  • Each FLOAT column occupies 4 bytes
  • You added 2 FLOATs, so that is 8 bytes
  • The amount of space reserved for a FLOAT is still fixed at 4 bytes, even for a NULL FLOAT.
  • That 853.6M bytes which is 814 MiB.

At first glance, you would think that Data should have increased from 20.8 GiB to 21.6 GiB. What should have been 4% increase, increasing by 814MB, grew to 37.3 GiB, which is a 79.3% increase in Data.

Why Did the Data Grow So Much ?

Please be aware that there is an object in the table called gen_clust_index, a.k.a. the table's Clustered Index. This is the table's PRIMARY KEY. You may find this shocking, but the PRIMARY KEY does not reside in the table's index pages. It resides in the table's Data Pages.

Question: What is the size of a page in InnoDB ? Answer: 16KB.

What is inside the 16KB of a Data Page ?

  • Data for one or more rows
  • BTREE information for the PRIMARY KEY
  • Leaf Nodes that point to a row are in the same page

Remember you added two FLOATs ? This makes each row 8 bytes bigger. Each row will expand in the 16KB of the Data Page, leaving less room for BTREE information. What must InnoDB do to maintain the PRIMARY KEY ? Create additional Data pages to accommodate wider rows along with the associated BTREE information.

Since the PRIMARY KEY is a BTREE, you should expect O(n log n) growth in BTREE nodes, along with an overflow of InnoDB Data pages.

Without doing any further math, I will show you how this is the case.

MyISAM has data and indexes stored separately.

Run the following

CREATE TABLE twitter_tweets_myisam LIKE twitter_tweets_new;
ALTER TABLE twitter_tweets_myisam ENGINE=MyISAM;
ALTER TABLE twitter_tweets_myisam DISABLE KEYS;
INSERT INTO twitter_tweets_myisam SELECT * FROM twitter_tweets_new;
ALTER TABLE twitter_tweets_myisam ENABLE KEYS;

I can assure you that the .MYD file of twitter_tweets_myisam will not be 37.3 GiB

Q.E.D.