To answer your second question:
MySQL does not have a parallel query execution engine, so even if you partition the query, you are still single threaded. This will eventually kill your scale.
However, you could partition the table by visitor_id
. This would allow you to run several queries (one per partition) in parallel, all of them form:
SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN <partition_start> and <partition_end>
The output of these parallel queries (which you could store in a temp table as they run) is trivially combinable into the final result by simply adding the distinct counts together.
This is very similar to sharding, but instead of doing it across machines, you are doing it on the same table. By picking a good hash function to generate visitor_id (for example, a modulo or bit reversal if the original id is generated with a AUTO_INCREMENT) you can ensure that all partitions are approximately equal sized.
The reason you want to partition by visitor_id
and not one of the other columns is that it makes the DISTINCT additive across partitions. For example, consider a table with two partitions. One holds visitor_id
0-99 in one holds and 100-199. You can now express two queries that can run in parallel:
INSERT INTO TempResult(visitor_id)
SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN 0 and 99
And this one in parallel:
INSERT INTO TempResults (visitor_id)
SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN 100 and 199
Because you know the visitor_id
is not overlapping between partitions, the final result is:
SELECT SUM(visitor_id) FROM TempResults
You would of course need to pick the partition boundaries in such a way that partitions have approximately the same size.
I will let ypercube file the answer to the indexing question as this is the one that deserves the reward.
Best Answer
When it comes to querying, indexing of a table should never be your first concern.
The queries you plan to use should dictate the indexes you need.
Based on the queries, some columns can be individually indexed. Other queries require compound indexes. The
ORDER BY
andGROUP BY
clauses should provide immediate hints for indexes to make. Not using such hints may result in temp table sorting rather than using the indexes for data in the desired order needed.Low cardinality of column values should eliminate the need for an index.
Even with these things taken into consideration, you may find that query may need some adjustment (a.k.a. refactoring) for performance gains.
When you reach the point of having the right indexes, not you have to worry about the size of those indexes. For a MyISAM table, this would mean that the .MYI file may grow significantly.
The size of the index file as well as the number of indexes should now be weighed against the performance of your queries, especially if the indexes provide the proper ordering of data and fastest retrieval.
Explain plans for queries may change over time depending on the number of rows, cardinality of columns, number of DELETEs and UPDATEs. Once a query's explain plan changes from what it looked like months ago, you should explore the need to add or remove indexes.