Mysql – index all columns of a table in MySQL safely

indexmyisamMySQLperformance

I have one table with 4 columns. One is primary key and other three columns are references to columns of other tables. I am using join on these four tables for a search filter. These joins are taking time.

So I was thinking to have indexes on columns because I read that I should create indexes on columns of join conditions. My question is will there be problem if all columns of the table are indexed? Is there any other way to decrease time complexity of these search filters.

More Hint:

Table1(main search)-1000 million entries primary_key fk1 fk2 fk3
Table2-800 million entries pk1 ..(8-9 columns)
Table3-700 million entries pk2 ..(10-12 columns)
Table2-850 million entries pk3 ..(7-8 columns)

I am using MyISAM engine.

Best Answer

When it comes to querying, indexing of a table should never be your first concern.

The queries you plan to use should dictate the indexes you need.

Based on the queries, some columns can be individually indexed. Other queries require compound indexes. The ORDER BY and GROUP BY clauses should provide immediate hints for indexes to make. Not using such hints may result in temp table sorting rather than using the indexes for data in the desired order needed.

Low cardinality of column values should eliminate the need for an index.

Even with these things taken into consideration, you may find that query may need some adjustment (a.k.a. refactoring) for performance gains.

When you reach the point of having the right indexes, not you have to worry about the size of those indexes. For a MyISAM table, this would mean that the .MYI file may grow significantly.

The size of the index file as well as the number of indexes should now be weighed against the performance of your queries, especially if the indexes provide the proper ordering of data and fastest retrieval.

Explain plans for queries may change over time depending on the number of rows, cardinality of columns, number of DELETEs and UPDATEs. Once a query's explain plan changes from what it looked like months ago, you should explore the need to add or remove indexes.

innoDB

is bad for dml statement, #MyISAM is better. Innodb is try to commit all INSERTing data to disk and returns success after it pined.

when you need a big insert data you can avoid auto-commit=0 to get faster. or you can check this for hint.

Mysql – How to setup complex multi-column index for massive table

To answer your second question:

MySQL does not have a parallel query execution engine, so even if you partition the query, you are still single threaded. This will eventually kill your scale.

However, you could partition the table by visitor_id. This would allow you to run several queries (one per partition) in parallel, all of them form:

SELECT COUNT(DISTINCT visitor_id) 
FROM table WHERE location_id = # 
AND region_id = # 
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN <partition_start> and <partition_end>

The output of these parallel queries (which you could store in a temp table as they run) is trivially combinable into the final result by simply adding the distinct counts together.

This is very similar to sharding, but instead of doing it across machines, you are doing it on the same table. By picking a good hash function to generate visitor_id (for example, a modulo or bit reversal if the original id is generated with a AUTO_INCREMENT) you can ensure that all partitions are approximately equal sized.

The reason you want to partition by visitor_id and not one of the other columns is that it makes the DISTINCT additive across partitions. For example, consider a table with two partitions. One holds visitor_id 0-99 in one holds and 100-199. You can now express two queries that can run in parallel:

INSERT INTO TempResult(visitor_id)
SELECT COUNT(DISTINCT visitor_id) 
    FROM table WHERE location_id = # 
    AND region_id = # 
    AND action_id = # AND ts BETWEEN x AND y
    AND visitor_id BETWEEN 0 and 99

And this one in parallel:

INSERT INTO TempResults (visitor_id)
SELECT COUNT(DISTINCT visitor_id) 
    FROM table WHERE location_id = # 
    AND region_id = # 
    AND action_id = # AND ts BETWEEN x AND y
    AND visitor_id BETWEEN 100 and 199

Because you know the visitor_id is not overlapping between partitions, the final result is:

SELECT SUM(visitor_id) FROM TempResults

You would of course need to pick the partition boundaries in such a way that partitions have approximately the same size.

I will let ypercube file the answer to the indexing question as this is the one that deserves the reward.

Best Answer

Related Solutions

Mysql – poor and/or unreliable InnoDB performance

innoDB

Mysql – How to setup complex multi-column index for massive table

Related Question