Mysql – Indexing a GROUP BY Select with Multiple Inequalities

group byindex-tuningMySQLoptimizationquery-performance

I have a pair of linked issues I need help with:

Issue #1: How to Index a Table to Optimize for a "Group By" SELECT

I have a table of stock price information with the following Create statement:

CREATE TABLE `price_daily` (
  `Symbol` varchar(20) NOT NULL,
  `Date` date NOT NULL,
  `Price` float DEFAULT NULL,
  `MarketCap` double DEFAULT NULL,
  `PriceToEarnings` double DEFAULT NULL,
  `PriceToSales` double DEFAULT NULL,
  PRIMARY KEY (`Symbol`,`Date`),
  KEY `Date_MCap` (`Date`,`MarketCap`,`PriceToSales`,`Symbol`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This table is fairly expansive; it spans roughly 100 million rows. Thus, imprecise SELECT queries are likely to take a long time to resolve, which I would like to avoid.

I want to be able to search the database for stocks within a specified date range that fall within selected PriceToSales and/or PriceToEarnings ranges. The query below is typical of one I would perform:

SELECT Symbol, Date, MarketCap, Price
FROM price_daily
WHERE Date BETWEEN '2002-07-31'AND '2017-07-31' 
AND MarketCap >= 1000
AND PriceToSales BETWEEN 2 AND 5
GROUP BY Symbol
;

I use GROUP BY because I only need to see one result per Symbol.
The problem is, when I add the 'GROUP BY' clause at the end, the query suddenly takes many times longer to complete than it does without that clause. What am I doing wrong? I assume it has something to do with my indexing, but I have not yet 100% figured out the 'rules' for ordering columns in indexes.
(A couple non-specific specifics: without GROUP BY the query has roughly 0.5s completion time but takes >200s to fetch, as there are potentially millions of results to display.)

EDIT 4/28/20 3pm: I have been told, in multiple locations, that the syntax of the above GROUP BY query is bad because the columns Date, MarketCap and Price are not aggregated. This is presumably a trivial issue to fix; for now, I leave Issue #1 alone to focus solely on the second issue.

Issue #2: SELECTing the Smallest Value From Within a GROUPed Query

A related question to the above:
Suppose, rather than allow the 'GROUP BY Symbol' clause from the above to choose whichever date it wants, I want to guarantee that my query will always SELECT the earliest Date chronologically for each Symbol. How would I do that? I assume the correct query would look something like this:

SELECT Symbol, MIN(Date)
FROM price_daily
WHERE Date BETWEEN '2002-07-31'AND '2017-07-31' 
AND MarketCap >= 1000
AND PriceToSales BETWEEN 2 AND 5
GROUP BY Symbol
;

…but, due to the issue I am having with GROUP BY above, I cannot currently test this.

Any help given on either or both of these problems will be much appreciated.
If you need me to provide more information, I'll edit this post tomorrow morning when I get back to the office.

EDIT 4/28/20 3:30pm: The EXPLAIN for the query given under Issue #2 is as follows:

+------+-------------+-------------+-------+--------------------+-----------+---------+------+----------+-----------------------------------------------------------+
| id   | select_type | table       | type  | possible_keys      | key       | key_len | ref  | rows     | Extra                                                     |
+------+-------------+-------------+-------+--------------------+-----------+---------+------+----------+-----------------------------------------------------------+
|    1 | SIMPLE      | price-daily | range | PRIMARY, date_mcap | date_mcap | 21      | NULL | 50622081 | Using where; using index; using temporary; using filesort |
+------+-------------+-------------+-------+--------------------+-----------+---------+------+----------+-----------------------------------------------------------+
1 row in set (0.00 sec)

The query took 331.1 secs to resolve, generating 2735 rows. An excerpt from the table:

Symbol  Date
------  ----------
A       2003-12-19
AABA    2007-07-26
AACC    2005-10-03
AAN     2010-02-11
AAOI    2017-03-07
AAON    2013-10-01
AAPL    2004-10-05
AAT     2012-07-27
...     ...

Best Answer

It seems there are a lot of rows which fit the conditions and database server needs to sort them for grouping. It probably even needs to create a temp table on disk if your data doesn't fit into memory. In-memory temporary table size is limited by max_heap_table_size and tmp_table_size variables, but what is more important - MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length. It means every varchar(20) value will consume 20 bytes of RAM. Add 3 bytes for dates and 4 for doubles and your row size will be 31 bytes. If 5% of 100 million rows fit the conditions than you need 150 MB temporary table to group your data. You can start with increasing the max temporary table size and check if it helps.

But also you can try to use another approach:

SELECT Symbol
    ,(SELECT Date FROM price_daily
        WHERE Symbol = s.Symbol
            AND Date BETWEEN '2002-07-31'AND '2017-07-31' 
            AND MarketCap >= 1000
            AND PriceToSales BETWEEN 2 AND 5
        ORDER BY Date LIMIT 1) AS min_date
FROM symbols AS s;

I assumed you have a table with list of symbols (Symbol data type should be the same).