Here is the original table definition which contains 107547934 (~107.5M) rows
CREATE TABLE `twitter_tweets` (
`id_str` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`id_user` int(10) unsigned DEFAULT NULL COMMENT 'Twitter id of the user',
`created_at` datetime DEFAULT NULL,
`in_reply_to_user_id` int(10) DEFAULT NULL COMMENT 'Twitter id of the user being replied to, if any',
`text` varchar(1000) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`retweet_count` int(10) DEFAULT NULL,
`geo` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`retweeted` tinyint(1) DEFAULT NULL COMMENT 'Indicates whether the tweet has been retweeted at all',
`place` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`in_reply_to_status_id_str` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`in_reply_to_screen_name` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`tweet_source` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`query` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT 'None',
`iso_language_code` varchar(5) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id_str`,`tweet_source`,`query`),
KEY `id_user` (`id_user`),
KEY `id_str` (`id_str`),
KEY `query` (`query`),
KEY `created_at` (`created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
The disk space usage stats are
I changed the primary key to just id_str
, dropped the key named id_str
, dropped the column named geo
. I also changed the the column tweet_source
from varchar(100) to tinyint. However, I did add two float columns coordinates_long
and coordinates_lat
. The column geo
which was dropped only contained null values. The two new columns added also only contain null values.
Here is the defn of the changed table which contains 106766691 (~106.7M) rows:-
CREATE TABLE `twitter_tweets_new` (
`id_str` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`id_user` int(10) unsigned DEFAULT NULL COMMENT 'Twitter id of the user',
`created_at` datetime DEFAULT NULL,
`in_reply_to_user_id` int(10) DEFAULT NULL COMMENT 'Twitter id of the user being replied to, if any',
`text` varchar(500) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`retweet_count` int(10) DEFAULT NULL,
`coordinates_long` float DEFAULT NULL,
`coordinates_lat` float DEFAULT NULL,
`retweeted` tinyint(1) DEFAULT NULL COMMENT 'Indicates whether the tweet has been retweeted at all',
`place` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`in_reply_to_status_id_str` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`in_reply_to_screen_name` varchar(256) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`tweet_source` tinyint(4) NOT NULL COMMENT '1=user_timeline, 2=search',
`query` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT 'None',
`iso_language_code` varchar(5) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id_str`),
KEY `id_user` (`id_user`),
KEY `query` (`query`),
KEY `created_at` (`created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
The disk space usage stats are
When you compare the size of the 2 tables, the former (named twitter_tweets
occupies less space than the latter twitter_tweets_new
).
Although there was a net addition of one column, considering that the null values don't take much space and also that I reduced the index size by a lot and also changed one of the columns to a smaller datatype and that the new table has lesser number of rows than the original one, the table size of the new table twitter_tweets_new
should have been much lesser than that of twitter_tweets
.
Why could this be happening?
Best Answer
Of course, the index size collapsed because you reduced the
PRIMARY KEY
to a single column and removedKEY
id_str(
id_str),
. This definitely reduced all secondary indexes. That is excellent.What about the Removal of Columns ?
You removed
geo
, which were allNULL
. There was no shrinkage for the data. Why ? Since allgeo
columns wereNULL
, it took no space at all in every row. The mapping of offsets for all fields hadgeo
pointing at nothing at all.In the MySQL Documentation on FIELD CONTENTS in InnoDB, it has the following example table
Note that both
FIELD1
andFIELD3
areVARCHAR
fields withNULL
values.From the Helpful Notes About
NULL
values at the Bottom of that PageThus, removing a
VARCHAR
column where every value wasNULL
will not shrink the table.What did adding FLOAT columns actually do ?
Let's play a little math game. You said the new table has 106.7M rows.
At first glance, you would think that Data should have increased from 20.8 GiB to 21.6 GiB. What should have been 4% increase, increasing by 814MB, grew to 37.3 GiB, which is a 79.3% increase in Data.
Why Did the Data Grow So Much ?
Please be aware that there is an object in the table called gen_clust_index, a.k.a. the table's Clustered Index. This is the table's
PRIMARY KEY
. You may find this shocking, but thePRIMARY KEY
does not reside in the table's index pages. It resides in the table's Data Pages.Question: What is the size of a page in InnoDB ? Answer: 16KB.
What is inside the 16KB of a Data Page ?
PRIMARY KEY
Remember you added two FLOATs ? This makes each row 8 bytes bigger. Each row will expand in the 16KB of the Data Page, leaving less room for BTREE information. What must InnoDB do to maintain the
PRIMARY KEY
? Create additional Data pages to accommodate wider rows along with the associated BTREE information.Since the PRIMARY KEY is a BTREE, you should expect O(n log n) growth in BTREE nodes, along with an overflow of InnoDB Data pages.
Without doing any further math, I will show you how this is the case.
MyISAM has data and indexes stored separately.
Run the following
I can assure you that the
.MYD
file oftwitter_tweets_myisam
will not be 37.3 GiBQ.E.D.