The query has a lot of bloat:
- Two
group by
, the second is completely unnecessary.
- Three levels of nesting, the last is also completely unnecessary.
- It uses
GROUP_CONCAT()
to find all the product_id
that have rows with 'southern'
and rows with 'proper'
. Not the best way in my opinion.
- The
1189 AS query_id
seems to return redundant information (the same value in all the rows of the result.)
- Using
UNION ALL
instead of OR
may be ok for efficiency but the group by
in the next step is probably the performance killer (grouping 400K rows without the use of any index).
Things that are not a problem:
Using where; Using temporary; Using filesort
This looks bad but sorting 46 rows is not going to be an issue. using filesort
in MySQL Explain does not mean that a file is used (not a very good naming choice indeed).
One thing you can do without changing the query:
- add an index on
(store_id, entity_id, data_index, weight)
.
This will basically create a copy of the entire table (but a bit narrower as the primary index in InnoDB tables uses some extra hidden space for internal purposes). The query will be able to use the index, doing a partial index scan (only the rows with store_id = 1
) and then get all the info needed from the index (no table scan at all). Unfortunately, as the the query is written now, it will still need 2 partial index scans (and then all the unnecessary bloat that follows).
If you can change the query though, there are various options to try:
First, we could replace UNION ALL
with OR
so we only do one table scan (or one index scan if we have the index suggested above) and not two. Also remove the the levels of nesting and only have one, and only one GROUP BY
by altering the query logic (removing the GROUP_CONCAT()
and using HAVING
with counts and sums):
SELECT 1189 AS query_id,
i.entiity_id AS product_id,
SUM(CASE WHEN i.data_index LIKE '%southern%' THEN i.weight ELSE 0 END)
+ SUM(CASE WHEN i.data_index LIKE '%proper%' THEN i.weight ELSE 0 END)
AS weight
FROM activo_advancedsearch_weighted_search AS i
WHERE i.store_id = 1
AND ( i.data_index LIKE '%southern%'
OR i.data_index LIKE '%proper%'
)
GROUP BY i.entity_id
HAVING SUM(i.data_index LIKE '%southern%') > 0
AND SUM(i.data_index LIKE '%proper%') > 0
ORDER BY weight DESC;
The whole issue revolves around CAST
ing and quirky PostgreSQL syntax.
CAST
ing strings as DATE
s:
It's very simple - your first line of code:
ntest=# SELECT EXTRACT(DOW FROM '2011-07-07');
ERROR: function pg_catalog.date_part(unknown, unknown) is not unique
LINE 1: SELECT EXTRACT(DOW FROM '2011-07-07');
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.
ntest=#
does not work. This is because, as far as PostreSQL is concerned, '2011-07-07' is just a string literal, not a DATE
. This does work:
ntest=# SELECT EXTRACT(DOW FROM DATE '2011-07-07');
date_part
-----------
4
(1 row)
Note the extra DATE
that CAST
s the string as such. You can also CAST
it this way:
ntest=# SELECT EXTRACT(DOW FROM '2011-07-07'::DATE);
date_part
-----------
4
This double colon (::) is very much a PostreSQL thing. It's really quite "obvious", once you grasp the PostgreSQL syntax (it had me stumped for a couple of minutes!)
A date field:
ntest=# \d orders;
Table "public.orders"
Column | Type | Modifiers
-------------+---------+-----------
ldate | date |
customer_id | integer |
amount | integer |
So, we have a date field - with records as below:
ntest=# SELECT * FROM orders;
ldate | customer_id | amount
------------+-------------+--------
2009-10-08 | 3 | 3000
2009-10-08 | 3 | 1500
2009-11-20 | 2 | 1560
2008-05-20 | 4 | 2060
(4 rows)
So, to get the "day of week" from a date (note the two FROM
s - doesn't work without the FROM
between brackets):
ntest=# SELECT EXTRACT(DOW FROM ldate) FROM orders;
date_part
-----------
4
4
5
2
(4 rows)
Now, for a timestamp field:
ntest=# \d timestamptest;
Table "public.timestamptest"
Column | Type | Modifiers
--------+-----------------------------+-----------
x | timestamp without time zone |
And:
ntest=# SELECT * FROM timestamptest;
x
----------------------------
2016-04-18 22:52:30.020979
(1 row)
And DOW (again, note the two FROM
s):
ntest=# SELECT EXTRACT(DOW FROM x) FROM timestamptest;
date_part
-----------
1
Quirky syntax (the double colon also OK, if you want to be really explicit), but it works fine once you've mastered it.
Best Answer
Per Verace's suggestion I created a test table with 10M records. The TLDR: Indexes on datetime functions need to be explicit, any conversion of that datetime column for joining/filtering will not be served by an index.
Table size is 1.7 GB. All of the indexes are 214MB. From my tests the index on
obs_ts
is never used (probably because these comparisons involve changing the data-type of the timestamp column...). The function specific indexes are used, so there is definitely a performance-index size tradeoff.Day of Week
Hour of Day