Now Our team are involved in a forum data analysis project. Its target data is so huge(I think). We should find the potential informations from it. Such as the popular products,the customers activity,also the complains of customers.
one case:(we try to analysis the customers' complain post)
we have the following tables:
Threads(ThreadID,TechnologyID,CreateTime,IsAnswer)20 millions of data
Posts(PostID,ThreadID,CreateTime,IsAnswer) 100 millions of data
--(multiply posts can set as Answer)
CustomerConcerns(ThreadID,PostID,NegativeWords,Verbatim)
--(if complained, there will be a record in it.)
We also contain many mapping tables(millions).
(When thread was set as Answered but it still have customers involved in it and it recored in the CustomerConcerns table. It looked as one complain.)
So from the three tables, we need to find the complained threads, also how many compains after answered.
select t1.ThreadID,t1.CreateTime,MaxAnswerDate,MinAnswerDate,p1.MaxReplyDate,p1.Replies,p1.Users
from
(
select th.ThreadID,th.CreateTime,max(p.CreateTime) as MaxAnswerDate,min(p.CreateTime) as MinAnswerDate
from Thread th,TopConcernTest tt,Post p
where tt.ThreadID=th.ThreadID and tt.[Type]=1 and th.LanguageID=58 and th.TechnologyID=''
and th.IsAnswer=1
and th.BigintTime>=20130101 and th.BigintTime<=20130501
and p.ThreadID=th.ThreadID and p.IsAnswer=1
group by th.ThreadID,th.CreateTime
-- First I try to get the MaxAnswerDate and MinAnswerDate of the threads.
-- Because The TopConcerTest is small.(compare with thread, post) So I joined with
-- it.
)t1
cross apply
(
select max(tp.CreateTime) as MaxReplyDate,Count(tp.PostID) as Replies,count(distinct tp.AuthorID) Users
from
(
select p.CreateTime,p.IsAnswer,p.AuthorID,p.PostID from Post p
where t1.threadID=p.ThreadID
)tp
where tp.IsAnswer<>1 and tp.CreateTime>MinAnswerDate
)p1
where p1.MaxReplyDate is not null
-- I use a cross apply here. Because I think I have minimize the data before.
-- And I try to get the MaxReplyDate and How many post and unique users
-- after the first answered reply.
)t2
inner join
(
select tt.ID as TopConcerID,th.ThreadID,th.ThreadTitle,th.ThreadUrl,p.PostID,p.AuthorID,p.CreateTime as PostCreateTime,tt.Entities,tt.NegativeWords,tt.Verbatim
from Thread th,TopConcernTest tt,Post p
where tt.ThreadID=th.ThreadID and tt.[Type]=1 and th.LanguageID=58 and th.TechnologyID=''
and th.BigintTime>=20130101 and th.BigintTime<=20130501
and th.IsAnswer=1
and p.ThreadID=th.ThreadID and tt.PostID=p.PostID
)p2
on t2.ThreadID=p2.ThreadID and p2.PostCreateTime>t2.MinAnswerDate
-- I get the threadTitle,ThreadUrl and the Complain data here.
-- You see I may get these data before instead to query them(ThreadTable,TopConcernTable) again. But I found I will spend much time to group them
-- and order.
)t3
where t3.ID between 1 and 20
-- Actually I have a page split at here. For I still have to query more infos.
This is my query. Sorry it's so complex. But My questions most of the time I can hardly to tun my code.
- I want to minimize the range of my query data. So I use the TopConcernTest table at two place. But I also doubt whether I should scan the table twice?
- You see I scan the Post table 3 times(It a very big table.), because I have to get the MaxAnsweredTime,MinAnsweredTime,MaxReplyTime. The three time are incompatible.
Whether I should scan 3 times? - I had try to get the data from TopConcernTest table one time. But it cost much time to group,or order the data.
Because our customers' query are mostly base on 3 variables:technology,language,time. I double whether I should set these three variables into every table, so I can minimize the range of data. base the results of subquery? (It will make too many duplicate data?
Best Answer
Possible this be helpful for you -