I don't see a lot of opportunity for improvement.
The index you added was probably a big help, because it's being used for the range matching on the WHERE clause (type => range, key => tran_date), and it's being used as a covering index (extra => using index), avoiding the need to seek into the table to fetch the row data.
But since you're using functions to construct the financial_year value for the group by, both the "using filesort" and "using temporary" can't be avoided. But, those aren't the real problem. The real problem is that you're evaluating MONTH(tran_date) 346,485 times and YEAR(tran_date) at least that many times... ~700,000 function calls in one second doesn't seem too bad.
Plan B: I am definitely not a fan of storing redundant data, and I'm dead-set against making the application responsible for maintaining it... but one option I might be tempted to try would be to create a dashboard_stats_by_financial_year table, and use after-insert/update/delete triggers on the transactions1 table to manage keeping those stats current.
That option has a cost, of course -- adding to the amount of time it takes to update/insert/delete a transaction... but, waiting > 1200 milliseconds for stats for your dashboard is a cost, too. So it may come down to whether you want to pay for it now or pay for it later.
I won't comment about spills, tempdb or hints because the query seems pretty simple to need that much consideration. I think SQL-Server's optimizer will do its job quite good, if there are indexes suited for the query.
And your splitting into two queries is good as it shows what indexes will be useful. The first part:
(select convert(bigint, Value) NodeId
from Oav.ValueArray
where PropertyId = 3331
and ObjectId = 3540233
and Sequence = 2)
needs an index on (PropertyId, ObjectId, Sequence)
including the Value
. I'd make it UNIQUE
to be safe. The query would throw error anyway during runtime if more than one rows were returned, so it's good to ensure in advance that this won't happen, with the unique index:
CREATE UNIQUE INDEX
PropertyId_ObjectId_Sequence_UQ
ON Oav.ValueArray
(PropertyId, ObjectId, Sequence) INCLUDE (Value) ;
The second part of the query:
select Value
from Oav.ValueArray
where ObjectId = @a
and PropertyId = 2840
needs an index on (PropertyId, ObjectId)
including the Value
:
CREATE INDEX
PropertyId_ObjectId_IX
ON Oav.ValueArray
(PropertyId, ObjectId) INCLUDE (Value) ;
If efficiency is not improved or these indexes were not used or there are still differences in row estimates appearing, then there would be need to look further into this query.
In that case, the conversions (needed from the EAV design and the storing of different datatypes in the same columns) are a probable cause and your solution of splitting (as @AAron Bertrand and @Paul White comment) the query into two parts seems natural and the way to go. A redesign so to have different datatypes in their respective columns might be another.
Best Answer
Giving a complete overview of statistics and how they work feels a bit out of scope for a question here. You can find a ton of information about this by searching with your favorite search engine. This white paper is the first result for me.
However, I can say a little bit about your query, reproduced here:
When trying to figure out how Oracle arrived at a cardinality estimate I often find it helpful to break the query into smaller pieces and to get cardinality estimates on those smaller pieces. Consider this piece:
Statistics (if they exist) for the
indicator_flag
column store how many NULL values are in that column. So Oracle can use the statistics object directly to derive an estimate. This estimate can be off if the statistics object isn't present, if statistics are outdated, or if statistics were gathered with the wrong sample rate. For your query I'm going to guess that around 24% of the rows in the table have a NULL value forindicator_flag
, so the estimate here should be around 220k rows.Consider this piece:
Here a function is applied to the column. That can make things more complicated because Oracle won't store statistics on the length of the
office_number
column directly. In this situation RDBMs need to make a guess. On the version of Oracle that I'm using the guess appears to be pretty simple: 1% of the rows of the table for integer values of length. Comparisons to NULL or NOT NULL use the statistics on the column. This is very likely to be inaccurate because Oracle is making a guess that isn't based on your data. The estimate as calculated by Oracle should be around 9100 rows.Consider this piece:
The analysis above applies to this part as well, so the estimate as calculated by Oracle should be around 9100 rows.
How does Oracle combine the estimates to get a cardinality estimate for the full query? In my experience it tends to assume the predicates are independent unless it has better information. So your final estimate should be 910087 * 0.01 * 0.01 * 0.24 = 22 rows.
If you need the estimate to be better to improve the performance of queries you have a few options depending on where the issue is. You can create virtual columns based on the length, you can create multi column statistics to make the relationship between columns more obvious to Oracle, you can add a dynamic sampling hint to the query, and so on.