Bloom Filter about filtering in oracle

execution-planoracleoracle-12cparallelism

I read a great blog about bloom filter in Oracle, and I have a few questions about the Bloom filter use case – Filtering in this blog.

    SQL> CREATE TABLE DIMENSION
      (
         col1,
         col2
      ) AS
      SELECT MOD( ROWNUM, 10 ),
             ROWNUM
      FROM   DUAL
      CONNECT BY ROWNUM <= 100;

    Table created.

    SQL> CREATE TABLE FACT
      (
         col1,
         col2
      ) AS
      SELECT MOD( ROWNUM, 25 ),
             ROWNUM
      FROM   DUAL
      CONNECT BY ROWNUM <= 1000000;

    Table created.

    SQL> exec dbms_stats.gather_table_stats(USER, 'DIMENSION');

    PL/SQL procedure successfully completed.

    SQL> exec dbms_stats.gather_table_stats(USER, 'FACT');

    PL/SQL procedure successfully completed.


    SQL> SELECT /*+ parallel(8) */ count( * )
         FROM   FACT,
                DIMENSION
         WHERE  DIMENSION.col1 = 1 AND
                DIMENSION.col2 = FACT.col2;

      COUNT(*)
    ----------
           100

The execution plan is as follows:

    SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS ALL'));

    PLAN_TABLE_OUTPUT
    -------------------------------------------------------------------------------------------------
    SQL_ID  587ty27pjwphu, child number 0
    -------------------------------------
    SELECT /*+ parallel(8) */ count(*) FROM FACT, DIMENSION WHERE DIMENSION.col1 = 1 AND DIMENSION.col2 = FACT.col2

    Plan hash value: 4106007966
    -----------------------------------------------------------------------------------------------------
    | Id  | Operation                | Name      | Starts | E-Rows |   TQ  |IN-OUT| PQ Distrib | A-Rows |
    -----------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT         |           |      1 |        |       |      |            |      1 |
    |   1 |  SORT AGGREGATE          |           |      1 |      1 |       |      |            |      1 |
    |   2 |   PX COORDINATOR         |           |      1 |        |       |      |            |      8 |
    |   3 |    PX SEND QC (RANDOM)   | :TQ10000  |      0 |      1 | Q1,00 | P->S | QC (RAND)  |      0 |
    |   4 |     SORT AGGREGATE       |           |      4 |      1 | Q1,00 | PCWP |            |      4 |
    |*  5 |      HASH JOIN           |           |      4 |    100 | Q1,00 | PCWP |            |    100 |
    |   6 |       JOIN FILTER CREATE | :BF0000   |      4 |    100 | Q1,00 | PCWP |            |    400 |
    |*  7 |        TABLE ACCESS FULL | DIMENSION |      4 |    100 | Q1,00 | PCWP |            |    400 |
    |   8 |       JOIN FILTER USE    | :BF0000   |      4 |   1000K| Q1,00 | PCWP |            |    102 |
    |   9 |        PX BLOCK ITERATOR |           |      4 |   1000K| Q1,00 | PCWC |            |    102 |
    |* 10 |         TABLE ACCESS FULL| FACT      |     71 |   1000K| Q1,00 | PCWP |            |    102 |
    -----------------------------------------------------------------------------------------------------

    Predicate Information (identified by operation id):
    ---------------------------------------------------
       5 - access("DIMENSION"."COL2"="FACT"."COL2")
       7 - filter("DIMENSION"."COL1"=1)
      10 - access(:Z>=:Z AND :Z<=:Z)
           filter(SYS_OP_BLOOM_FILTER(:BF0000,"FACT"."COL2"))

    Note
    -----
       - Degree of Parallelism is 8 because of hint

My question is as follows:

  • Does the bloom filter prevents all rows or some rows from table FACT that do not join table DIMENSION from being needlessly distributed?
  • Is the bloom filter created in parallel by the set of parallel execution processes that scanned table DIMENSION or the set of parallel execution processes that later perform hash join?

UPDATED

My second question is not very well, actually I want to confirm my understanding of the execution plan as follows:

  1. the execution process first load DIMENSION table data (producer) and at the same time send the rows to the execution process which later perform hash join (consumer). These consumers execution process first build bloom filter.
  2. the execution process load FACT table data (producer) ant at the same time send the rows to the execution process which later perform hash join (consumer). These consumer execution process probe FACT table by bloom filter created in step 1.
  3. DIMENSION table data which load in step 1 as driving table (build table), and FACT data which filter by bloom filter in step 2 as driven table (probe table) are doing hash join to return result.

So I think the creation of bloom filter happened in the execution process that later perform hash join not the execution process load DIMENSION table data.

Is my understanding correct?

My reference:

Best Answer

Does the bloom filter prevents all rows or some rows from table FACT that do not join table DIMENSION from being needlessly distributed?

Look at Antognini's article! A bloom filter does not prevent false positives. This means exactly that a "bloom filter prevents [...] [only] some rows from table FACT that do not join table DIMENSION from being needlessly distributed". Of course there may be situation where it avoids false positive. This is the case in the example of Joe Obbish: If one has to filter 1 value from only 25 possible values this can be done without false positives with a bloom filter.

Is the bloom filter created in parallel by the set of parallel execution processes that scanned table DIMENSION or the set of parallel execution processes that later perform hash join?

To create the bloom filter one has to scan the table DIMENSION and add every COL2 value to the filter. Look at Antognini's plan at p.5:

enter image description here

You can see that the PX JOIN FILTER CREATE step is done by the Q1,02 process group. This is the group that is responsible for joining the data and not the group Q1,00 that scans the table T1.