What could be the possible reasons for cardinality estimates being very inaccurate in an execution plan

optimizationoracleoracle-11g-r2performancequery-performance

I have a query that is selecting some data out of a table. The execution plan shows that an INDEX (FAST FULL SCAN) is used and the cardinality is 22.

On the running the query it gives 227,652 rows. I understand that query optimiser relies on row count estimates instead of actually counting rows.

So I run a select table_name, num_rows from dba_tables where table_name = 'reservations';

And it gives 910,087. So why is the cardinality estimate in query plan off by such a large margin?

Query as request in comments:

SELECT COUNT(*) FROM reservations WHERE length(office_number=2) and length(regexp_substr(text, '\d*')) = 3 and indicator_flag is null;

create index res_idx on reservations (office_number, text, indicator_flag);

Best Answer

Giving a complete overview of statistics and how they work feels a bit out of scope for a question here. You can find a ton of information about this by searching with your favorite search engine. This white paper is the first result for me.

However, I can say a little bit about your query, reproduced here:

SELECT COUNT(*) 
FROM reservations 
WHERE length(office_number)=2
and length(regexp_substr(text, '\d*')) = 3
and indicator_flag is null;

When trying to figure out how Oracle arrived at a cardinality estimate I often find it helpful to break the query into smaller pieces and to get cardinality estimates on those smaller pieces. Consider this piece:

SELECT COUNT(*) 
FROM reservations 
WHERE indicator_flag is null;

Statistics (if they exist) for the indicator_flag column store how many NULL values are in that column. So Oracle can use the statistics object directly to derive an estimate. This estimate can be off if the statistics object isn't present, if statistics are outdated, or if statistics were gathered with the wrong sample rate. For your query I'm going to guess that around 24% of the rows in the table have a NULL value for indicator_flag, so the estimate here should be around 220k rows.

Consider this piece:

SELECT COUNT(*) 
FROM reservations 
WHERE length(office_number)=2;

Here a function is applied to the column. That can make things more complicated because Oracle won't store statistics on the length of the office_number column directly. In this situation RDBMs need to make a guess. On the version of Oracle that I'm using the guess appears to be pretty simple: 1% of the rows of the table for integer values of length. Comparisons to NULL or NOT NULL use the statistics on the column. This is very likely to be inaccurate because Oracle is making a guess that isn't based on your data. The estimate as calculated by Oracle should be around 9100 rows.

Consider this piece:

SELECT COUNT(*) 
FROM reservations 
WHERE 
length(regexp_substr(text, '\d*')) = 3;

The analysis above applies to this part as well, so the estimate as calculated by Oracle should be around 9100 rows.

How does Oracle combine the estimates to get a cardinality estimate for the full query? In my experience it tends to assume the predicates are independent unless it has better information. So your final estimate should be 910087 * 0.01 * 0.01 * 0.24 = 22 rows.

If you need the estimate to be better to improve the performance of queries you have a few options depending on where the issue is. You can create virtual columns based on the length, you can create multi column statistics to make the relationship between columns more obvious to Oracle, you can add a dynamic sampling hint to the query, and so on.