What slows down this histogram query

optimizationoracleplsql

I have a table my_link_t with columns t.weight type number and categorical speedcat type integer values from 2 to 8. I want to split data in buckets from min=0, max=0.6, step = 0.001 and build a 3D plot to see weight distribution per each category.

initial data looks like
weight speedcat
0.0234 2
0.8643 6
0.1854 7
(hundred million more entries with weight between 0 and 0.6 and speedcat between 2 and 8)

These queries return correct results and complete in less than a minute each:

--repeat for each variable. Here we loook for speedcat =8
--It takes seconds to run this query
create table   histogram_tbl_8  as (
  select  ttt."Start" as bucket_index, ttt.hist_row as bin8  --here
  FROM ((
  SELECT  Bucket*1 "Start" , Bucket "End", Count(Bucket) hist_row
  FROM (SELECT WIDTH_BUCKET (weight, 0, 0.6, 601) Bucket FROM my_link_t  where speedcat=8)
      GROUP BY Bucket ORDER BY Bucket ) ttt   )   );

The above query is repeated seven times for speedcat in range 2..8

--if a bin is empty populate it with zero, don't skip it.
create table histogram_output as (
select  tr.bucket_index,
    CASE
        WHEN 1 > (select count(*) from histogram_tbl_2   htm where htm.bucket_index = tr.bucket_index)   THEN 0
        ELSE (select htm.bin2 from histogram_tbl_2   htm where htm.bucket_index =  tr.bucket_index and rownum = 1)
     END
     as b2,
     --same for b3-b7
      CASE
        WHEN 1 > (select count(*) from histogram_tbl_8   htm where htm.bucket_index = tr.bucket_index)   THEN 0
        ELSE     (select htm.bin8    from histogram_tbl_8   htm where htm.bucket_index =  tr.bucket_index and rownum = 1)
     END
     as b8
     FROM  (SELECT LEVEL as bucket_index, 0 as b2, /* 0 as b3, 0 as b4, 0 as b5, 0 as b6, 0 as b7, */ 0 as b8  FROM DUAL CONNECT BY LEVEL < 600)  tr
     )

Finally

      select sum(b2), sum(b3),sum(b4),sum(b5),sum(b6),sum(b7),sum(b8) from histogram_output
       select bucket_index,
              round(b2 * 1000000 / 12921) as  b2, --normalize so that total is 1000000 ppm
              -- repeat for b3-b7
              round(b8 * 1000000 / 6262) as  b8 --normalize so that total is 1000000 ppm
         from histogram_output

I get a table like

bin_end speedcat_2 speedcat_3 speedcat_4 .. speedcat_8
0.001
0.002
..
0.599
0.600

Showing ppm of objects within this category and this bin
Now, when I combine queries in

-- DONT USE THE EXAMPLE BELOW - it is ineefficient  (runs 2+ hours instead of seconds for the method above)
       SELECT  Bucket_2*1 "Start" , Bucket_2 "End",
       Count(Bucket_2) as b2,
       --same for b3 .. b7
       Count(Bucket_8) as b8

      FROM
      (
          SELECT WIDTH_BUCKET (t2.weight, 0, 0.6, 601) Bucket_2,
                 --same for t3,.. t7
                 WIDTH_BUCKET (t8.weight, 0, 0.6, 601) Bucket_8

           FROM (select weight from my_link_t where  speedcat = 2) t2,
           -- ..speedcat = 3) t3, .. speedcat = 4) t4, etc
            (select weight from my_link_t where  speedcat = 8 ) t8

      )
      GROUP BY Bucket_2 ORDER BY Bucket_2
      ------

Query runs several hours (some 500 times longer runtime than individual queries) until I kill it. Books recommend to do all data slicing in SQL. This example suggests that loading data to Java and slicing it there might be better in case of complex queries.

What can cause the difference?

Best Answer

Short Answer

Your 7-way Cartesian JOIN is going to have some serious performance problems.

Long Answer

Think in Sets.

I am assuming that your Data Set needs to contain: speedcat, bucket_index, count(*)

The simple solution for that Data Set is straight forward:

select t.speedcat
  , WIDTH_BUCKET (t.weight, 0, 0.6, 601) as bucket_index
  , count(*) N
from my_link_t t
group by t.speedcat, WIDTH_BUCKET (t.weight, 0, 0.6, 601)

This is format (x,y,z) is the expect format for most graphing packages.

If you want the results in a "grid" format, then PIVOT the results.

with data as (
  -- same SELECT as previous answer
  select t.speedcat
    , WIDTH_BUCKET (t.weight, 0, 0.6, 601) as bucket_index
    , count(*) N
  from my_link_t t
  group by t.speedcat, WIDTH_BUCKET (t.weight, 0, 0.6, 601)
)
select *
from data
  pivot (
    sum(N)
    for speedcat in ( 2 as "speedcat_2"
                     ,3 as "speedcat_3"
                     ,4 as "speedcat_4"
                     ,5 as "speedcat_5"
                     ,6 as "speedcat_6"
                     ,7 as "speedcat_7"
                     ,8 as "speedcat_8"
                     )
  )