Since there are few different values for pk2
column, it might be more efficient to have an index with the reverse order (pk2, pk1)
:
ALTER TABLE `table`
ADD INDEX tIX
(pk2, pk1) ;
and then use the query:
SELECT e3.pk2
FROM `table` e3
GROUP BY e3.pk2
HAVING EXISTS
( SELECT 1
FROM `table` t
WHERE t.pk2 = e3.pk2
AND t.pk1 BETWEEN '2013-09-24 12:27:00.0'
AND '2013-11-29 12:27:00.0'
)
-- ORDER BY NULL ;
Depending on the size of the tables and the values of the datetime range, you may get a plan that uses the primary index in the dependent subquery:
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1 PRIMARY e3 range (null) tIX 4 (null) 61 Using index for group-by
2 DEPENDENT SUBQUERY t range PRIMARY,tIX PRIMARY 4 (null) 4 Using where; Using index
But with a larger range or larger table (we can't test too many rows at SQLFiddle), see the SQL-Fiddle, and another possible execution plan. This one I expect to be more efficient when you have few distinct values in the pk2
column. With 318 distinct values, it will do 318 lookups using the newly added index which will be more efficient than the previous plan that first gets (possibly millions) rows that match the datetime range condition and then finds the DISTINCT
pk2 values among them:
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1 PRIMARY e3 range (null) tIX 4 (null) 61 Using index for group-by
2 DEPENDENT SUBQUERY t range PRIMARY,tIX tIX 4 func 52 Using where; Using index
The search is made on the server host where the database service is installed. And you will only get the result set sent to you.
It would make no sense your client to make this, as your client is just an interface that allows your to interact with your database.
This is available for all databases.
Best Answer
Scan operations aren't usually optimal but 500,000 records isn't something too crazy for the standard RDBMS. For example, I've seen Microsoft SQL Server serve up 1 million records on a decently wide table (about 100 columns) in a few seconds on a pretty basic server (4 core, 32 GB of RAM).
I'm not an expert on NoSQL databases or DynamoDB but I've researched into its use cases in the past and my understanding is it's most performant when caching smaller amounts of data, and recalling smaller "singletonish" datasets, such as paging product information on a catalog, or a specific user's settings. So I would say NoSQL databases (or at least DynamoDB) is likely not great for scanning large amounts of data, but a modern RDBMS should be able to handle scanning a few million records on basic architecture these days.