Bloom Filter about filtering in oracle

execution-planoracleoracle-12cparallelism

I read a great blog about bloom filter in Oracle, and I have a few questions about the Bloom filter use case – Filtering in this blog.

    SQL> CREATE TABLE DIMENSION
      (
         col1,
         col2
      ) AS
      SELECT MOD( ROWNUM, 10 ),
             ROWNUM
      FROM   DUAL
      CONNECT BY ROWNUM <= 100;

    Table created.

    SQL> CREATE TABLE FACT
      (
         col1,
         col2
      ) AS
      SELECT MOD( ROWNUM, 25 ),
             ROWNUM
      FROM   DUAL
      CONNECT BY ROWNUM <= 1000000;

    Table created.

    SQL> exec dbms_stats.gather_table_stats(USER, 'DIMENSION');

    PL/SQL procedure successfully completed.

    SQL> exec dbms_stats.gather_table_stats(USER, 'FACT');

    PL/SQL procedure successfully completed.


    SQL> SELECT /*+ parallel(8) */ count( * )
         FROM   FACT,
                DIMENSION
         WHERE  DIMENSION.col1 = 1 AND
                DIMENSION.col2 = FACT.col2;

      COUNT(*)
    ----------
           100

The execution plan is as follows:

    SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS ALL'));

    PLAN_TABLE_OUTPUT
    -------------------------------------------------------------------------------------------------
    SQL_ID  587ty27pjwphu, child number 0
    -------------------------------------
    SELECT /*+ parallel(8) */ count(*) FROM FACT, DIMENSION WHERE DIMENSION.col1 = 1 AND DIMENSION.col2 = FACT.col2

    Plan hash value: 4106007966
    -----------------------------------------------------------------------------------------------------
    | Id  | Operation                | Name      | Starts | E-Rows |   TQ  |IN-OUT| PQ Distrib | A-Rows |
    -----------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT         |           |      1 |        |       |      |            |      1 |
    |   1 |  SORT AGGREGATE          |           |      1 |      1 |       |      |            |      1 |
    |   2 |   PX COORDINATOR         |           |      1 |        |       |      |            |      8 |
    |   3 |    PX SEND QC (RANDOM)   | :TQ10000  |      0 |      1 | Q1,00 | P->S | QC (RAND)  |      0 |
    |   4 |     SORT AGGREGATE       |           |      4 |      1 | Q1,00 | PCWP |            |      4 |
    |*  5 |      HASH JOIN           |           |      4 |    100 | Q1,00 | PCWP |            |    100 |
    |   6 |       JOIN FILTER CREATE | :BF0000   |      4 |    100 | Q1,00 | PCWP |            |    400 |
    |*  7 |        TABLE ACCESS FULL | DIMENSION |      4 |    100 | Q1,00 | PCWP |            |    400 |
    |   8 |       JOIN FILTER USE    | :BF0000   |      4 |   1000K| Q1,00 | PCWP |            |    102 |
    |   9 |        PX BLOCK ITERATOR |           |      4 |   1000K| Q1,00 | PCWC |            |    102 |
    |* 10 |         TABLE ACCESS FULL| FACT      |     71 |   1000K| Q1,00 | PCWP |            |    102 |
    -----------------------------------------------------------------------------------------------------

    Predicate Information (identified by operation id):
    ---------------------------------------------------
       5 - access("DIMENSION"."COL2"="FACT"."COL2")
       7 - filter("DIMENSION"."COL1"=1)
      10 - access(:Z>=:Z AND :Z<=:Z)
           filter(SYS_OP_BLOOM_FILTER(:BF0000,"FACT"."COL2"))

    Note
    -----
       - Degree of Parallelism is 8 because of hint

My question is as follows:

Does the bloom filter prevents all rows or some rows from table FACT that do not join table DIMENSION from being needlessly distributed?
Is the bloom filter created in parallel by the set of parallel execution processes that scanned table DIMENSION or the set of parallel execution processes that later perform hash join?

UPDATED

My second question is not very well, actually I want to confirm my understanding of the execution plan as follows:

the execution process first load DIMENSION table data (producer) and at the same time send the rows to the execution process which later perform hash join (consumer). These consumers execution process first build bloom filter.
the execution process load FACT table data (producer) ant at the same time send the rows to the execution process which later perform hash join (consumer). These consumer execution process probe FACT table by bloom filter created in step 1.
DIMENSION table data which load in step 1 as driving table (build table), and FACT data which filter by bloom filter in step 2 as driven table (probe table) are doing hash join to return result.

So I think the creation of bloom filter happened in the execution process that later perform hash join not the execution process load DIMENSION table data.

Is my understanding correct?

My reference:

Best Answer

Does the bloom filter prevents all rows or some rows from table FACT that do not join table DIMENSION from being needlessly distributed?

Look at Antognini's article! A bloom filter does not prevent false positives. This means exactly that a "bloom filter prevents [...] [only] some rows from table FACT that do not join table DIMENSION from being needlessly distributed". Of course there may be situation where it avoids false positive. This is the case in the example of Joe Obbish: If one has to filter 1 value from only 25 possible values this can be done without false positives with a bloom filter.

Is the bloom filter created in parallel by the set of parallel execution processes that scanned table DIMENSION or the set of parallel execution processes that later perform hash join?

To create the bloom filter one has to scan the table DIMENSION and add every COL2 value to the filter. Look at Antognini's plan at p.5:

You can see that the PX JOIN FILTER CREATE step is done by the Q1,02 process group. This is the group that is responsible for joining the data and not the group Q1,00 that scans the table T1.

Related Solutions

Fighting filter (nested loops) execution plan from Oracle

On 11R2 I usually start with gathering more statistics for the tables:

exec DBMS_STATS.GATHER_TABLE_STATS ('SCHEMA, 'SMALL_TABLE', estimate_percent => '100');
exec DBMS_STATS.GATHER_TABLE_STATS ('SCHEMA, 'LARGE_DATA_TABLE', estimate_percent => '10');

If that helps then table preferences for automatic jobs can be set:

EXEC DBMS_STATS.SET_TABLE_PREFS ('SCHEMA', 'SMALL_TABLE', 'ESTIMATE_PERCENT', '100');
EXEC DBMS_STATS.SET_TABLE_PREFS ('SCHEMA', 'LARGE_DATA_TABLE', 'ESTIMATE_PERCENT', '10');

If large_data_table is really big (tens GB and more) then 1% or something like that may be needed.

And do not believe that in dba_tables sample_size=num_rows. For big tables actual auto sample size is much much lower. I had the SR with Oracle about that. They found actual sample percentage only from session trace file. It was ~0.004% for 170GB table.

Slow running Oracle query caused by unnecessary full table scan

_{Answer left as comments by the question author}

I think I may have found the issue, and it definitely makes sense. The type of the data column on the temp table was nvarchar2(100) while the type of the scheduleid column was char(22).

So apparently with the different types, the index cannot be used. When I added a to_char to the data column, it was able to finally use the index.

This temp table is also used in some other joins, where the data is of type varchar(100). In hindsight, maybe I could have had multiple columns on the temp table for different types, but what I have now works.

Best Answer

Related Solutions

Fighting filter (nested loops) execution plan from Oracle

Slow running Oracle query caused by unnecessary full table scan

Related Question