What could be the possible reasons for cardinality estimates being very inaccurate in an execution plan

optimizationoracleoracle-11g-r2performancequery-performance

I have a query that is selecting some data out of a table. The execution plan shows that an INDEX (FAST FULL SCAN) is used and the cardinality is 22.

On the running the query it gives 227,652 rows. I understand that query optimiser relies on row count estimates instead of actually counting rows.

So I run a select table_name, num_rows from dba_tables where table_name = 'reservations';

And it gives 910,087. So why is the cardinality estimate in query plan off by such a large margin?

Query as request in comments:

SELECT COUNT(*) FROM reservations WHERE length(office_number=2) and length(regexp_substr(text, '\d*')) = 3 and indicator_flag is null;

create index res_idx on reservations (office_number, text, indicator_flag);

Best Answer

Giving a complete overview of statistics and how they work feels a bit out of scope for a question here. You can find a ton of information about this by searching with your favorite search engine. This white paper is the first result for me.

However, I can say a little bit about your query, reproduced here:

SELECT COUNT(*) 
FROM reservations 
WHERE length(office_number)=2
and length(regexp_substr(text, '\d*')) = 3
and indicator_flag is null;

When trying to figure out how Oracle arrived at a cardinality estimate I often find it helpful to break the query into smaller pieces and to get cardinality estimates on those smaller pieces. Consider this piece:

SELECT COUNT(*) 
FROM reservations 
WHERE indicator_flag is null;

Statistics (if they exist) for the indicator_flag column store how many NULL values are in that column. So Oracle can use the statistics object directly to derive an estimate. This estimate can be off if the statistics object isn't present, if statistics are outdated, or if statistics were gathered with the wrong sample rate. For your query I'm going to guess that around 24% of the rows in the table have a NULL value for indicator_flag, so the estimate here should be around 220k rows.

Consider this piece:

SELECT COUNT(*) 
FROM reservations 
WHERE length(office_number)=2;

Here a function is applied to the column. That can make things more complicated because Oracle won't store statistics on the length of the office_number column directly. In this situation RDBMs need to make a guess. On the version of Oracle that I'm using the guess appears to be pretty simple: 1% of the rows of the table for integer values of length. Comparisons to NULL or NOT NULL use the statistics on the column. This is very likely to be inaccurate because Oracle is making a guess that isn't based on your data. The estimate as calculated by Oracle should be around 9100 rows.

Consider this piece:

SELECT COUNT(*) 
FROM reservations 
WHERE 
length(regexp_substr(text, '\d*')) = 3;

The analysis above applies to this part as well, so the estimate as calculated by Oracle should be around 9100 rows.

How does Oracle combine the estimates to get a cardinality estimate for the full query? In my experience it tends to assume the predicates are independent unless it has better information. So your final estimate should be 910087 * 0.01 * 0.01 * 0.24 = 22 rows.

If you need the estimate to be better to improve the performance of queries you have a few options depending on where the issue is. You can create virtual columns based on the length, you can create multi column statistics to make the relationship between columns more obvious to Oracle, you can add a dynamic sampling hint to the query, and so on.

Related Solutions

MySQL optimization – year column grouping – using temporary table, filesort

I don't see a lot of opportunity for improvement.

The index you added was probably a big help, because it's being used for the range matching on the WHERE clause (type => range, key => tran_date), and it's being used as a covering index (extra => using index), avoiding the need to seek into the table to fetch the row data.

But since you're using functions to construct the financial_year value for the group by, both the "using filesort" and "using temporary" can't be avoided. But, those aren't the real problem. The real problem is that you're evaluating MONTH(tran_date) 346,485 times and YEAR(tran_date) at least that many times... ~700,000 function calls in one second doesn't seem too bad.

Plan B: I am definitely not a fan of storing redundant data, and I'm dead-set against making the application responsible for maintaining it... but one option I might be tempted to try would be to create a dashboard_stats_by_financial_year table, and use after-insert/update/delete triggers on the transactions1 table to manage keeping those stats current.

That option has a cost, of course -- adding to the amount of time it takes to update/insert/delete a transaction... but, waiting > 1200 milliseconds for stats for your dashboard is a cost, too. So it may come down to whether you want to pay for it now or pay for it later.

Sql-server – How to row estimates be improved in order to reduce chances of spills to tempdb

I won't comment about spills, tempdb or hints because the query seems pretty simple to need that much consideration. I think SQL-Server's optimizer will do its job quite good, if there are indexes suited for the query.

And your splitting into two queries is good as it shows what indexes will be useful. The first part:

(select convert(bigint, Value) NodeId
 from Oav.ValueArray
 where PropertyId = 3331  
   and ObjectId = 3540233
   and Sequence = 2)

needs an index on (PropertyId, ObjectId, Sequence) including the Value. I'd make it UNIQUE to be safe. The query would throw error anyway during runtime if more than one rows were returned, so it's good to ensure in advance that this won't happen, with the unique index:

CREATE UNIQUE INDEX
    PropertyId_ObjectId_Sequence_UQ
  ON Oav.ValueArray
    (PropertyId, ObjectId, Sequence) INCLUDE (Value) ;

The second part of the query:

select Value
  from Oav.ValueArray
 where ObjectId = @a               
   and PropertyId = 2840

needs an index on (PropertyId, ObjectId) including the Value:

CREATE INDEX
    PropertyId_ObjectId_IX
  ON Oav.ValueArray
    (PropertyId, ObjectId) INCLUDE (Value) ;

If efficiency is not improved or these indexes were not used or there are still differences in row estimates appearing, then there would be need to look further into this query.

In that case, the conversions (needed from the EAV design and the storing of different datatypes in the same columns) are a probable cause and your solution of splitting (as @AAron Bertrand and @Paul White comment) the query into two parts seems natural and the way to go. A redesign so to have different datatypes in their respective columns might be another.

Best Answer

Related Solutions

MySQL optimization – year column grouping – using temporary table, filesort

Sql-server – How to row estimates be improved in order to reduce chances of spills to tempdb

Related Question