MySQL Performance – DATE vs. time_field in WHERE Clause

MySQLperformancequery-performance

I have 3 tables:

users
movements
unique_ids

I want to get the records gotten from march until now of the users registered in march, counting the unique id of their movements with a distinct and the total of it at the end:

SELECT MONTH(mo.ctime) AS month, COUNT(DISTINCT(unique_id))
FROM users us, movements mo,unique_ids id
WHERE us.time_field > '2014-03-01' AND us.time_field < '2014-03-31 23:59:59'
AND mo.time_field > '2014-03-01'
AND us.room_name LIKE 'name'
AND us.room_name=mo.room_name AND us.user=mo.user AND id.id_log=mo.id_log
GROUP BY month WITH ROLLUP

The thing is that few month ago in other query I realise that adding DATE() to the time_field into WHERE clause makes faster the query, but adding this into the query shown, gives me different results than the initial one.

What's the difference between time_field > 'x' AND DATE(time_field) > 'x' inside WHERE clause?

Best Answer

Just addressing this question

What's the difference between time_field > 'x' AND DATE(time_field) > 'x' inside WHERE clause?

The difference is how the Query Optimizer treats them.

If the WHERE clause has time_field > 'x', this signals the Query Optimizer to try looking for any index so as to take advantage of it doing a range scan.
If the WHERE clause has DATE(time_field) > 'x', this signals the Query Optimizer to throw all indexes under the bus because the DATE function has to be called across the whole table (in the form of a full table scan) or the join result.

Depending on the key distribution of time_field, DATE(time_field) > 'x' triggering a full table scan just happens to be better than a range scan on an index if the values make up a significant percentage of the index. This is even more true using InnoDB because a full table scan passes through the clustered index (where PRIMARY KEY and row data coexist), while a secondary index on time_field would cause a lookup of the secondary index in addition to the clustered index. If such a secondary index had a lopsided key distribution, such an index would be ignored in favor of a full table scan anyway.

This conjuecture is only indicative of your current dataset. Someone else's dataset may have a better (evenly distributed, more balanced) key distribution, resulting in time_field > 'x' working better than DATE(time_field) > 'x'.

To see such differences, run the EXPLAIN on both SELECT queries. The results may be different.

Consistent rows

The important question which does not seem to be on your radar yet:
From each set of rows for the same seriesName, do you want the columns of one row, or just any values from multiple rows (which may or may not go together)?

Your answer does the latter, you combine the maximum dbid with the maximum retreivaltime, which may come from a different row.

To get consistent rows, use DISTINCT ON and wrap it in a subquery to order the result differently:

SELECT * FROM (
   SELECT DISTINCT ON (seriesName)
          dbid, seriesName, retreivaltime
   FROM   FileItems
   WHERE  sourceSite = 'mk' 
   ORDER  BY seriesName, retreivaltime DESC NULLS LAST  -- latest retreivaltime
   ) sub
ORDER BY retreivaltime DESC NULLS LAST
LIMIT  100;

Details for DISTINCT ON:

Select first row in each GROUP BY group?

Aside: should probably be retrievalTime, or better yet: retrieval_time. Unquoted mixed case identifiers are a common source of confusion in Postgres.

Better Performance with rCTE

Since we are dealing with a big table here, we'd need a query that can use an index, which is not the case for the above query (except for WHERE sourceSite = 'mk')

On closer inspection, your problem seems to be a special case of a loose index scan. Postgres does not support loose index scans natively, but it can be emulated with a recursive CTE. There is a code example for the simple case in the Postgres Wiki.

Related answer on SO with more advanced solutions, explanation, fiddle:

Optimize GROUP BY query to retrieve latest record per user

Your case is more complex, though. But I think I found a variant to make it work for you. Building on this index (without WHERE sourceSite = 'mk')

CREATE INDEX mi_special_full_idx ON MangaItems
(retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)

Or (with WHERE sourceSite = 'mk')

CREATE INDEX mi_special_granulated_idx ON MangaItems
(sourceSite, retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)

The first index can be used for both queries, but is not fully efficient with the additional WHERE condition. The second index is of very limited use for the first query. Since you have both variants of the query consider creating both indexes.

I added dbid at the end to allow Index Only scans.

This query with a recursive CTE makes use of the index. I tested with Postgres 9.3 and it works for me: no sequential scan, all index-only scans:

WITH RECURSIVE cte AS (
   (
   SELECT dbid, seriesName, retreivaltime, 1 AS rn, ARRAY[seriesName] AS arr
   FROM   MangaItems
   WHERE  sourceSite = 'mk'
   ORDER  BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT i.dbid, i.seriesName, i.retreivaltime, c.rn + 1, c.arr || i.seriesName
   FROM   cte c
   ,      LATERAL (
      SELECT dbid, seriesName, retreivaltime
      FROM   MangaItems
      WHERE (retreivaltime, seriesName) < (c.retreivaltime, c.seriesName)
      AND    sourceSite = 'mk'  -- repeat condition!
      AND    seriesName <> ALL(c.arr)
      ORDER  BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
      LIMIT  1
      ) i
   WHERE  c.rn < 101
   )
SELECT dbid
FROM   cte
ORDER  BY rn;

You need to include seriesName in ORDER BY, since retreivaltime is not unique. "Almost" unique is still not unique.

Explain

The non-recursive query starts with the latest row.
The recursive query adds the next-latest row with a seriesName that's not in the list, yet etc., until we have 100 rows.
Essential parts are the JOIN condition (b.retreivaltime, b.seriesName) < (c.retreivaltime, c.seriesName) and the ORDER BY clause ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST. Both match the sort order of the index, which allows for the magic to happen.
Collecting seriesName in an array to rule out duplicates. The cost for b.seriesName <> ALL(c.foo_arr) grows progressively with the number of rows, but for just 100 rows it is still cheap.
Just returning dbid as clarified in the comments.

Alternative with partial indexes:

We have been dealing with similar problems before. Here is a highly optimized complete solution based on partial indexes and a looping function:

Can spatial index help a "range - order by - limit" query

Probably the fastest way (except for a materialized view) if done right. But more complex.

Materialized View

Since you do not have a lot of write operations and they are not performance-critical as stated in the comments (should be in the question), save the top n pre-computed rows in a materialized view and refresh it after relevant changes to the underlying table. Base your performance-critical queries on the materialized view instead.

Could just be a "thin" mv of the latest 1000 dbid or so. In the query, join to the original table. For instance, if content is sometimes updated, but the top n rows can remain unchanged.
Or a "fat" mv with whole rows to return. Faster, yet. Needs to be refreshed more often, obviously.

Details in the manual here and here.

Mysql – How to optimize query on MYSQL with multiple group by clause

Do not use double(10,2) it involves two roundings. Either do DECIMAL(10,2) or plain DOUBLE. For monetary values (price) use DECIMAL(...) so that there will be no rounding. DOUBLE is 8 bytes; DECIMAL(10,2) is 5 bytes.

Can you have "negative" Clicks? Suggest INT UNSIGNED.

It is usually a bad idea to splay arrays across columns (L1, ... and Attribute1, ...) Instead have another table for them. (OK, I don't know what impact that will have on queries that GROUP BY Attributes.)

PARTITION BY HASH as no known case of improving performance.

attribute_changed int(11) DEFAULT NULL -- If that is just a flag, make it TINYINT UNSIGNED NOT NULL DEFAULT '0'; that will save 3 bytes, plus space for NULL.

Will you have a billion different Campaigns? CampaignName varchar(255) should be normalized and replaced by, say, a MEDIUMINT UNSIGNED (3 bytes) to save a lot of space. Ditto for any other varchars that repeat a lot.

If you expecting billions of rows, squeezing out a few bytes per row can add up. That, in turn, will decrease the I/O, thereby speeding up queries.

Your covering index on the 5 attributes could consume a huge amount of space. Furthermore, with a billion rows, it may slow down INSERTs to one row per disk hit! On traditional drives, that is only 100/sec. You need more than 300/sec.

Since you say the users must include a Date clause, then it may be practical to use PARTITION BY RANGE(TO_DAYS(Date)). It is unclear, but it sounds like user_id is also a requirement in the queries? At that point, I would suggest INDEX(user_id, Date) without partitioning. That composite index is much better than "and index on Date". Adding more columns to that index will not help.

Your example has a Date range of one month plus one day; is that reasonable, or just a goof?

One table per client does not help.

Do not partition by user -- 250 partitions has its own performance problems. About 50 is the 'practical' limit for the number of partitions.

Summary tables (plural) is the only way you will make this multi-billion row table perform adequately. The PRIMARY KEY of each summary table would include user_id and date (probably truncated to the day), plus a couple of other "dimensions". Then several aggregations (mostly COUNTs and SUMs) would be the other columns. These tables would be significantly smaller than your 'Fact' table, and they can afford to have multiple indexes. Beware of AVG because the average of averages is not mathematically correct. Instead store the SUM and COUNT, then calculate the average as SUM(sums)/SUM(counts).

More on Data Warehousing and Summary Tables.

(Yes, some of my statements do disagree with previous Comments and Answers.)