MySQL Index – Is CREATE INDEX a Linear Operation?

indexMySQL

What I mean is the following:

If creating an index on a table with n rows takes t time. Will creating an index on the same table with 1000*ntake approximately 1000*t time.

What I'm trying to achieve is to to estimate the time it takes to create the index on the production database by creating the same index on the much smaller test database.

Best Answer

Index creation is essentially a sort operation, so is at best has a growth complexity of the order n log n on average (you might find it does better in some cases, and is not likely to do much worse).

If all your relevant data pages fit into RAM and are already in RAM, and the index will fit also, and your DBMS does not force index pages to be written before the creation is complete (so index blocks are not updated on disk multiple times during the operation), then the speed of writing the resulting index to disk will be more significant than the time taken to perform the sort - so you might find you get closer to a linear relationship between number of rows and the time the index creation takes - but if you assume the worse case you are less likely to be unpleasantly surprised!

Remember that unless you are not going to stop access to the production database during the operation any index create will be competing for IO bandwidth and/or locks with other activity, so you should try to account for this if you are doing your timing estimation tests on another system even if it is identically configured.

Related Solutions

Mysql – How to make CREATE INDEX in InnoDB table faster

RHEL5 ships with MySQL 5.0, unless you installed something else. So I would focus on increasing these three variables. They are not specific to either InnoDB or MyISAM.

read_buffer_size        = #of Mb
sort_buffer_size        = #of Mb
tmp_table_size          = #of Mb

http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_read_buffer_size

http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_sort_buffer_size

http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_tmp_table_size

example :

read_buffer_size        = 16M
sort_buffer_size        = 64M
tmp_table_size          = 256M

Increasing the buffer_pool won't help you to speed up index creation. It is used for caching.

http://dev.mysql.com/doc/refman/5.0/en/innodb-buffer-pool.html

Also, if it were possible, I would also try to put the tmpdir on a faster disk.

Postgresql – Create index on very large table with many shared values

For starters gid should probably be a numeric type. integer should be good enough or bigint if the key space shouldn't be big enough. Much smaller footprint, faster processing than with character data, faster and smaller indexes.

More importantly, to improve performance I suggest database normalization.

Quote:

There is a fairly regular pattern where each word appears about 1000 times.

Create a separate table for unique words:

CREATE TABLE word (
   word_id serial
 , word    text
);

Fill it with unique instances of word in your big_tbl:

INSERT INTO word (word)
SELECT DISTINCT word
FROM   big_tbl
ORDER  BY word;

ORDER BY is optional, not needed for query at hand. But it speeds up index creation and might be cheaper overall.

The table should be small in comparison: only ~ 50k rows for 50M rows in your big table.
Add indexes after filling the table:

ALTER TABLE word
    ADD CONSTRAINT word_word_uni UNIQUE (word) -- essential
  , ADD CONSTRAINT word_word_id_pkey PRIMARY KEY (word_id);  -- expendable?

If those are read-only tables, you can do without the pk. It's not relevant to the operations at hand.

Replace your big table with a much smaller new table. You may have to lock the big table to avoid concurrent writes. Concurrent reads are not a problem.

CREATE TABLE big_tbl_new AS
SELECT b.gid      -- or the suggested smaller, faster numeric replacement
     , w.word_id, b.stat
FROM   big_tbl b
JOIN   word w USING (word)
ORDER  BY word;   -- sorting by word helps query at hand

ORDER BY clusters the data (once) making the query at hand faster, because far fewer blocks have to be read (unless your data is clustered mostly already). The sort carries a cost, weigh cost and benefit once more.

DROP big_tbl;     -- make sure your new table has all data!
ALTER big_tbl_new RENAME TO big_tbl;

Recreate indexes:

ALTER TABLE big_tbl ADD CONSTRAINT big_tbl_gid_pkey PRIMARY KEY (gid);  -- expendable?
CREATE INDEX big_tbl_word_id_idx ON big_tbl (word_id);  -- essential

Your query looks like this now and should be faster:

SELECT b.*
FROM   word w
JOIN   big_tbl b USING (word_id)
WHERE  w.word = 'something';

Reorganization is meant to be a one-time operation to re-organize your data. Keep the new form and also consider keeping indexes permanently.

All of this together (including new indexes) should occupy about half of what you had before on disk, also cutting the time for creation in half (at least). Index creation should be considerably faster, the query as well. If RAM is a limiting factor, these modification pay double.

If you have to write to the table as well, it becomes more expensive (but you did not mention anything about that). You'd need to adjust your logic for DELETE / UPDATE / INSERT:
Example for INSERT: Fetch word_id for existing words or insert a new row in word returning the new word_id. Details for this:
How do I insert a row which contains a foreign key?

Best Answer

Related Solutions

Mysql – How to make CREATE INDEX in InnoDB table faster

Postgresql – Create index on very large table with many shared values

Related Question