Because this is really logging data that you are capturing, I would first store it in its raw form, then ETL/normalize it as needed.
For searching, you can index the front part of the URLs first few characters. Let's say the options are http:// and https:// - then limit the index size to three characters passed the protocol part.
CREATE TABLE log (
datetime_created DATETIME,
url VARCHAR(1024),
domainname VARCHAR(255),
someotherdata VARCHAR(255),
...
INDEX `idx_domain` (domainname(3)))
ENGINE=InnoDB ROW_FORMAT=COMPRESSED;
- InnoDB will allow you to search the table w/o locking it.
- Compression will help with disk space.
- INDEX
idx_domain
(domainname(3)) will index the first 3 characters of the domain name column and speed up searches. However, the strategy for REGEXP will be to try to match on the front of the domain name.
Scale will eventually be an issue if the site becomes popular, so buyer beware.
Implication is that you need ~1144Mb extra storage to index column time for ~100M rows...
Because the InnoDB engine wiil store the data off an PRIMARY or UNIQUE index within an non PRIMARY or UNIQUE index, as result your secondary index will be become larger
How much larger?? you can calculate it with this formula
int = 4 bytes
datetime = 8 bytes
100000000 records * (4 + 8 bytes) =
100000000 * 12 bytes ~ 1200000000 bytes ( 1144.40918 Mb ) extra storage (note index records/page overhead are not in the calculation)
An larger index size will slow down inserts, delete and only updates when you update an value whats indexed..
An larger index size in thoery can slow down selects because off the InnoDB index page off 16K (read http://www.ovaistariq.net/733/)
But still it depends on innodb configuration and cached data within the innodb buffer pool..
Or maybe you can use your approach by using an lookup table
CREATE TABLE tracker_snapshot_lookup (
tracker_date DATE NOT NULL
, tracker_snapshot_start_id INT UNSIGNED NOT NULL
, tracker_snapshot_end_id INT UNSIGNED NOT NULL
, PRIMARY KEY(tracker_date)
-- Covering index below is overkill...
-- , PRIMARY KEY(tracker_date, tracker_snapshot_start_id, tracker_snapshot_end_id)
) ENGINE = InnoDB;
insert into tracker_snapshot_lookup values('2013-11-13', 1, 10000);
insert into tracker_snapshot_lookup values('2013-11-14', 10001, 20000);
If you use an JOIN or deliverd table the MySQL optimzer can use in worse case
1 index key (Random disk I/O) lookup on tracker_snapshot_lookup.date (assuming with WHERE tracker_date = '2013-11-13' )
1 table (Random disk I/O) record key for tracker_snapshot_start_id and tracker_snapshot_end_id (not necessary when you make it an covering index)
Based on tracker_snapshot_start_id and tracker_snapshot_end_id MySQL will most likly choose an range scan (sequential disk I/O what is low costing with I/O waittime) on the tracker_snapshot table.
Your savings
DATE 3 bytes
INT NOT NULL 4 bytes
So in one year you lose on storage...
Table data
356 days * (3 + 4 bytes)
356 * 12 = 4272 bytes ( 0.004 Mb )
Index data
356 days * (3 bytes) = 1068 bytes ( 0.001 Mb )
It's magic because you use that ~1143Mb storage space for more important data
Best Answer
First approximation: An InnoDB table or index will be 2-3 times as large as you would expect from adding up the column sizes.
I get 53 for
VARCHAR(1023)
with an average of 51 characters.VARCHAR
for non-small max needs 2 bytes for length. This also assumes there are no non-English letters in the url. For example, each Chinese character takes 3-4 bytes. The index size needs to be computed in bytes.For your 3 examples, there won't be a lot of difference, since most of the 2x-3x comes from BTree overhead, padding, row overhead, etc.
Another thing that can make a significant difference in index size is whether the rows are are inserted in sorted order -- or not. That leads to well packed BTree blocks -- or not.
Also, your numbers are incomplete. Three different cases: MyISAM index, InnoDB PRIMARY KEY, InnoDB secondary key.
For InnoDB secondary key, the
PRIMARY KEY
is tacked on. So, you need to add whatever sized those column(s) are. Typical case: 4 bytes forINT
.Even if you get past all that, there will be anomalies. A 1-row table will have 16KB for each secondary index. At a certain table size, the "allocation unit" switches from 16KB to 4MB. After that, adding a row will rarely change the index size, but sometimes will show a big jump.
Etc, etc.
An aside... Indexes using
SHA2
(or other digests/hashes) are notoriously bad for performance on huge tables. They lead to random accesses, which blows out cache, which leads to reads and writes being I/O bound, which is deadly for huge tables.