I have a table my_link_t
with columns t.weight
type number
and categorical speedcat
type integer values from 2 to 8. I want to split data in buckets from min=0, max=0.6, step = 0.001 and build a 3D plot to see weight distribution per each category.
initial data looks like
weight speedcat
0.0234 2
0.8643 6
0.1854 7
(hundred million more entries with weight between 0 and 0.6 and speedcat between 2 and 8)
These queries return correct results and complete in less than a minute each:
--repeat for each variable. Here we loook for speedcat =8
--It takes seconds to run this query
create table histogram_tbl_8 as (
select ttt."Start" as bucket_index, ttt.hist_row as bin8 --here
FROM ((
SELECT Bucket*1 "Start" , Bucket "End", Count(Bucket) hist_row
FROM (SELECT WIDTH_BUCKET (weight, 0, 0.6, 601) Bucket FROM my_link_t where speedcat=8)
GROUP BY Bucket ORDER BY Bucket ) ttt ) );
The above query is repeated seven times for speedcat
in range 2..8
--if a bin is empty populate it with zero, don't skip it.
create table histogram_output as (
select tr.bucket_index,
CASE
WHEN 1 > (select count(*) from histogram_tbl_2 htm where htm.bucket_index = tr.bucket_index) THEN 0
ELSE (select htm.bin2 from histogram_tbl_2 htm where htm.bucket_index = tr.bucket_index and rownum = 1)
END
as b2,
--same for b3-b7
CASE
WHEN 1 > (select count(*) from histogram_tbl_8 htm where htm.bucket_index = tr.bucket_index) THEN 0
ELSE (select htm.bin8 from histogram_tbl_8 htm where htm.bucket_index = tr.bucket_index and rownum = 1)
END
as b8
FROM (SELECT LEVEL as bucket_index, 0 as b2, /* 0 as b3, 0 as b4, 0 as b5, 0 as b6, 0 as b7, */ 0 as b8 FROM DUAL CONNECT BY LEVEL < 600) tr
)
Finally
select sum(b2), sum(b3),sum(b4),sum(b5),sum(b6),sum(b7),sum(b8) from histogram_output
select bucket_index,
round(b2 * 1000000 / 12921) as b2, --normalize so that total is 1000000 ppm
-- repeat for b3-b7
round(b8 * 1000000 / 6262) as b8 --normalize so that total is 1000000 ppm
from histogram_output
I get a table like
bin_end speedcat_2 speedcat_3 speedcat_4 .. speedcat_8
0.001
0.002
..
0.599
0.600
Showing ppm of objects within this category and this bin
Now, when I combine queries in
-- DONT USE THE EXAMPLE BELOW - it is ineefficient (runs 2+ hours instead of seconds for the method above)
SELECT Bucket_2*1 "Start" , Bucket_2 "End",
Count(Bucket_2) as b2,
--same for b3 .. b7
Count(Bucket_8) as b8
FROM
(
SELECT WIDTH_BUCKET (t2.weight, 0, 0.6, 601) Bucket_2,
--same for t3,.. t7
WIDTH_BUCKET (t8.weight, 0, 0.6, 601) Bucket_8
FROM (select weight from my_link_t where speedcat = 2) t2,
-- ..speedcat = 3) t3, .. speedcat = 4) t4, etc
(select weight from my_link_t where speedcat = 8 ) t8
)
GROUP BY Bucket_2 ORDER BY Bucket_2
------
Query runs several hours (some 500 times longer runtime than individual queries) until I kill it. Books recommend to do all data slicing in SQL. This example suggests that loading data to Java and slicing it there might be better in case of complex queries.
What can cause the difference?
Best Answer
Short Answer
Your 7-way Cartesian
JOIN
is going to have some serious performance problems.Long Answer
Think in Sets.
I am assuming that your Data Set needs to contain: speedcat, bucket_index, count(*)
The simple solution for that Data Set is straight forward:
This is format (x,y,z) is the expect format for most graphing packages.
If you want the results in a "grid" format, then
PIVOT
the results.