Of course, the index size collapsed because you reduced the PRIMARY KEY
to a single column and removed KEY
id_str(
id_str),
. This definitely reduced all secondary indexes. That is excellent.
What about the Removal of Columns ?
You removed geo
, which were all NULL
. There was no shrinkage for the data. Why ? Since all geo
columns were NULL
, it took no space at all in every row. The mapping of offsets for all fields had geo
pointing at nothing at all.
In the MySQL Documentation on FIELD CONTENTS in InnoDB, it has the following example table
CREATE TABLE T
(FIELD1 VARCHAR(3), FIELD2 VARCHAR(3), FIELD3 VARCHAR(3))
Type=InnoDB;
Note that both FIELD1
and FIELD3
are VARCHAR
fields with NULL
values.
From the Helpful Notes About NULL
values at the Bottom of that Page
For the third row, I inserted NULLs in FIELD2 and FIELD3. Therefore in
the Field Start Offsets the top bit is on for these fields (the values
are 94 hexadecimal, 94 hexadecimal, instead of 14 hexadecimal, 14
hexadecimal). And the row is shorter because the NULLs take no space.
Thus, removing a VARCHAR
column where every value was NULL
will not shrink the table.
What did adding FLOAT columns actually do ?
Let's play a little math game. You said the new table has 106.7M rows.
- Each FLOAT column occupies 4 bytes
- You added 2 FLOATs, so that is 8 bytes
- The amount of space reserved for a FLOAT is still fixed at 4 bytes, even for a NULL FLOAT.
- That 853.6M bytes which is 814 MiB.
At first glance, you would think that Data should have increased from 20.8 GiB to 21.6 GiB. What should have been 4% increase, increasing by 814MB, grew to 37.3 GiB, which is a 79.3% increase in Data.
Why Did the Data Grow So Much ?
Please be aware that there is an object in the table called gen_clust_index, a.k.a. the table's Clustered Index. This is the table's PRIMARY KEY
. You may find this shocking, but the PRIMARY KEY
does not reside in the table's index pages. It resides in the table's Data Pages.
Question: What is the size of a page in InnoDB ? Answer: 16KB.
What is inside the 16KB of a Data Page ?
- Data for one or more rows
- BTREE information for the
PRIMARY KEY
- Leaf Nodes that point to a row are in the same page
Remember you added two FLOATs ? This makes each row 8 bytes bigger. Each row will expand in the 16KB of the Data Page, leaving less room for BTREE information. What must InnoDB do to maintain the PRIMARY KEY
? Create additional Data pages to accommodate wider rows along with the associated BTREE information.
Since the PRIMARY KEY is a BTREE, you should expect O(n log n) growth in BTREE nodes, along with an overflow of InnoDB Data pages.
Without doing any further math, I will show you how this is the case.
MyISAM has data and indexes stored separately.
Run the following
CREATE TABLE twitter_tweets_myisam LIKE twitter_tweets_new;
ALTER TABLE twitter_tweets_myisam ENGINE=MyISAM;
ALTER TABLE twitter_tweets_myisam DISABLE KEYS;
INSERT INTO twitter_tweets_myisam SELECT * FROM twitter_tweets_new;
ALTER TABLE twitter_tweets_myisam ENABLE KEYS;
I can assure you that the .MYD
file of twitter_tweets_myisam
will not be 37.3 GiB
Based on my experience, I would use LOAD DATA INFILE to import your CSV File.
The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.
Example I found on the internet Load Data example. I tested this example on my box and worked fine
Example Table
CREATE TABLE example (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`Column4` varchar(14) NOT NULL,
`Column5` DATE NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB
Example CSV File
# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013
Import Statement to be run from MySQL console
LOAD DATA LOCAL INFILE '/tmp/example.csv'
-> INTO TABLE example
-> FIELDS TERMINATED BY ','
-> LINES TERMINATED BY '\n'
-> IGNORE 1 LINES
-> (id, Column3,Column4, @Column5)
-> set
-> Column5 = str_to_date(@Column5, '%m/%d/%Y');
Result
MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5 |
+----+---------+---------+---------+------------+
| 1 | | Column2 | Column3 | 0000-00-00 |
| 2 | | B | Bar | 0000-00-00 |
| 3 | | C | Foo | 0000-00-00 |
| 4 | | D | Bar | 0000-00-00 |
| 5 | | E | FOObar | 0000-00-00 |
+----+---------+---------+---------+------------+
IGNORE just simply ignores the first line which are column headers.
After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.
Here is another example directly from Oracle : LOAD DATA INFILE example
This should be enough to get you started.
Best Answer
Some options:
Use InnoDB. If not for these reasons, then for compression! With InnoDB you can:
Replace 8 with 4 or 2 or 1 to (hopefully) get better compression. Since the table is static, I think this is a great solution for you (with strong compression, writes become slower, but you don't care about that).
Not only data is compressed - indexes also. I would suggest this is the easiest option you have.
Only index part of your column. You agree that the column is not UNIQUE but can be assumed to be. OK; is your index a
UNIQUE
index? If not, try to:to only index first 8 bytes of the columns.
Try TokuDB, which is an alternative storage engine to InnoDB, and which has an amazing compression (I've seen data get 20 times smaller than InnoDB with TokuDB aggressive compression).