Sql-server – How to enhance the SQL query performance for thousands and millions of huge data

performancesql server

Now Our team are involved in a forum data analysis project. Its target data is so huge(I think). We should find the potential informations from it. Such as the popular products,the customers activity,also the complains of customers.

one case:(we try to analysis the customers' complain post)
we have the following tables:

Threads(ThreadID,TechnologyID,CreateTime,IsAnswer)20 millions of data

Posts(PostID,ThreadID,CreateTime,IsAnswer) 100 millions of data
--(multiply posts can set as Answer)  

CustomerConcerns(ThreadID,PostID,NegativeWords,Verbatim)
--(if complained, there will be a record in it.)  

 We also contain many mapping tables(millions).

(When thread was set as Answered but it still have customers involved in it and it recored in the CustomerConcerns table. It looked as one complain.)

So from the three tables, we need to find the complained threads, also how many compains after answered.

select t1.ThreadID,t1.CreateTime,MaxAnswerDate,MinAnswerDate,p1.MaxReplyDate,p1.Replies,p1.Users
from
(
    select  th.ThreadID,th.CreateTime,max(p.CreateTime) as MaxAnswerDate,min(p.CreateTime) as MinAnswerDate
    from Thread th,TopConcernTest tt,Post p
    where  tt.ThreadID=th.ThreadID and tt.[Type]=1 and th.LanguageID=58 and th.TechnologyID=''
        and th.IsAnswer=1
        and th.BigintTime>=20130101 and th.BigintTime<=20130501
        and p.ThreadID=th.ThreadID and p.IsAnswer=1
        group by th.ThreadID,th.CreateTime
    -- First I try to get the MaxAnswerDate and MinAnswerDate of the threads.
    -- Because The TopConcerTest is small.(compare with thread, post) So I joined with
    -- it.
)t1
cross apply 
(
    select max(tp.CreateTime) as MaxReplyDate,Count(tp.PostID) as Replies,count(distinct tp.AuthorID) Users 
        from 
    (
        select p.CreateTime,p.IsAnswer,p.AuthorID,p.PostID from Post p
        where t1.threadID=p.ThreadID 
    )tp
    where tp.IsAnswer<>1 and tp.CreateTime>MinAnswerDate
                )p1
    where p1.MaxReplyDate is not null 
    -- I use a cross apply here. Because I think I have minimize the data before.
    -- And I try to get the MaxReplyDate and How many post and unique users
    -- after the first answered reply.
)t2
inner join
(
    select tt.ID as TopConcerID,th.ThreadID,th.ThreadTitle,th.ThreadUrl,p.PostID,p.AuthorID,p.CreateTime as PostCreateTime,tt.Entities,tt.NegativeWords,tt.Verbatim
    from Thread th,TopConcernTest tt,Post p
    where  tt.ThreadID=th.ThreadID and tt.[Type]=1 and th.LanguageID=58 and th.TechnologyID=''
        and th.BigintTime>=20130101 and th.BigintTime<=20130501
        and th.IsAnswer=1 
        and p.ThreadID=th.ThreadID and tt.PostID=p.PostID
        )p2
    on t2.ThreadID=p2.ThreadID and p2.PostCreateTime>t2.MinAnswerDate
    -- I get the threadTitle,ThreadUrl and the Complain data here.
    -- You see I may get these data before instead to query them(ThreadTable,TopConcernTable) again. But I found I will spend much time to group them
    -- and order.
        )t3
where t3.ID between 1 and 20
-- Actually I have a page split at here. For I still have to query more infos.

This is my query. Sorry it's so complex. But My questions most of the time I can hardly to tun my code.

  1. I want to minimize the range of my query data. So I use the TopConcernTest table at two place. But I also doubt whether I should scan the table twice?
  2. You see I scan the Post table 3 times(It a very big table.), because I have to get the MaxAnsweredTime,MinAnsweredTime,MaxReplyTime. The three time are incompatible.
    Whether I should scan 3 times?
  3. I had try to get the data from TopConcernTest table one time. But it cost much time to group,or order the data.

Because our customers' query are mostly base on 3 variables:technology,language,time. I double whether I should set these three variables into every table, so I can minimize the range of data. base the results of subquery? (It will make too many duplicate data?

Best Answer

Possible this be helpful for you -

;WITH thread AS 
(
    SELECT 
          th.ThreadID
        , th.CreateTime
        , th.ThreadTitle
        , th.ThreadUrl
    FROM dbo.Thread th
    WHERE th.LanguageID = 58 
        AND th.TechnologyID = ''
        AND th.BigintTime BETWEEN 20130101 AND 20130501
        AND th.IsAnswer = 1
)
SELECT 
      t1.ThreadID
    , t1.CreateTime
    , t1.MaxAnswerDate
    , t1.MinAnswerDate
    , p1.MaxReplyDate
    , p1.Replies
    , p1.Users
FROM
(
    SELECT  
          th.ThreadID
        , th.CreateTime
        , MaxAnswerDate = MAX(p.CreateTime)
        , MinAnswerDate = MIN(p.CreateTime)
    FROM thread th
    JOIN dbo.TopConcernTest tt ON tt.ThreadID = th.ThreadID
    JOIN dbo.Post p ON p.ThreadID = th.ThreadID
    WHERE tt.[Type] = 1
        AND p.IsAnswer = 1
    GROUP BY    
          th.ThreadID
        , th.CreateTime
)t1
CROSS APPLY (
    SELECT 
          MaxReplyDate = MAX(p.CreateTime)
        , Replies = COUNT(p.PostID)
        , Users = COUNT(DISTINCT p.AuthorID) 
    FROM dbo.Post p
    WHERE p.IsAnswer != 1
        AND t1.ThreadID = p.ThreadID
        AND p.CreateTime > t1.MinAnswerDate 
)p1
JOIN (
    SELECT 
          TopConcerID = tt.ID
        , th.ThreadID
        , th.ThreadTitle
        , th.ThreadUrl
        , p.PostID
        , p.AuthorID
        , PostCreateTime = p.CreateTime
        , tt.Entities
        , tt.NegativeWords
        , tt.Verbatim
    FROM thread th
    JOIN TopConcernTest tt ON tt.ThreadID = th.ThreadID
    JOIN Post p ON tt.PostID = p.PostID AND p.ThreadID = th.ThreadID 
    WHERE tt.[Type] = 1
)t3 ON t2.ThreadID = p2.ThreadID AND p2.PostCreateTime > t2.MinAnswerDate
WHERE p1.MaxReplyDate IS NOT NULL 
    AND t3.TopConcerID BETWEEN 1 AND 20