Sql-server – Grouping results to get unique rows after multiple joins

join;sql server

this is my first question on dba.stackexchange, hopefully it will be worth it. I am usually a bit more active on Stackoverflow as I am a developer.

disclaimer : I don't have full control over the db schema don't judge the data structure or the naming conventions 🙂

I am doing this large query with multiple joins :

SELECT TOP 30   
iss.iss_lKey as IssueId,  
iss.iss_sName as IssueName,  
con.con_lKey as ContainerId,  
con.con_sName as ContainerName,  
sto.sto_lKey as StoryId,   
sto.sto_sName as StoryName,  
sto.sto_Guid as StoryGuid,  
sto.sto_sByline as Byline,  
sto.sto_created_dWhen as StoryCreatedDate,  
sto.sto_deadline_dWhen as StoryDeadline,  
sto.sto_lType as StoryType,  
sto.sto_sct_lKey as StoryCategory,  
sto.sto_created_use_lKey as CreatedBy,  
sfv.sfv_tValue as FieldValue,  
sf.sfe_lKey as StoryFieldId,  
sf.sfe_sCaption as StoryFieldCaption,   
sre.sre_lIndex as RevisionIndex  
FROM tStory30 sto  
JOIN tContainer30 con ON sto.sto_con_lKey = con.con_lKey  
JOIN tIssue30 iss ON con.con_iss_lKey = iss.iss_lKey  
LEFT OUTER JOIN tStoryRevision30 sre ON sre.sre_sto_lKey = sto.sto_lKey AND sre.sre_lIndex = 0 
LEFT OUTER JOIN tStoryField30 sf ON sre.sre_lKey = sf.sfe_sre_lKey  
LEFT OUTER JOIN tStoryFieldValue30 sfv ON sfv.sfv_sfe_lKey= sf.sfe_lKey  
WHERE sto.sto_sName LIKE '%' + @0 + '%'  
   OR sfv.sfv_tValue LIKE '%' + @0 + '%'

What I need is really only one row by StoryId, that includes the FieldValue that matched if there was any. I am currently grouping in the code to produce the output, but that prevents me from paging.

Is there any way to achieve this kind of grouping in sql, so that I could then page the result properly (using ROW_NUMBER() OVER)?

Also, I am aware that this is bad practice and should use FullText search. it is planned to setup a solr instance, or use the fulltext options in sqlserver. This is a first attempt to get a smthg going.

EDIT:
Here verbal description of what I try to achieve.

our app is for magazine editor.

for a given magazine they have many Issues
each issue has many Container (sort of logical article group)
in each container you have several stories
a story van have 0 or many revisions
the fields of a story are stored by revision (many field per revision)
and a field has a field value.

I need to retrieve the stories that have a given text in the name or in a field value of the first revision (that's the where revisionIndex = 0).
but I also need to retrieve associated data for each story. (issueId, name, containerId and name, and so one..)

hope this helps!

EDIT Sample data searching for "test". I simplified the columns to make it easier to understand.

Row | IssueId  |  IssueName    |    ContainerId  |  StoryId    |    FieldValue  
1   |   11       IssueName A          394             868          Test Marsupilami bla bla youpi
2   |   40       IssueName B          6               631          story save test
3   |   40       IssueName B          6               666          test story
4   |   4        IssueName c          30              846          test abs
5   |   4        IssueName c          30              846          absc test
6   |   4        IssueName c          30              846          hello test

I am able to get the row number in sqlserver on my query, but here, as you see, I get amultiple times the same story.
In this case, I could have simple the following result:

Row | IssueId  |  IssueName    |    ContainerId  |  StoryId    |    FieldValue  
1   |   11       IssueName A          394             868          Test Marsupilami bla bla youpi
2   |   40       IssueName B          6               631          story save test
3   |   40       IssueName B          6               666          test story
3   |   4        IssueName c          30              846          test abs

if a story would have test in the story name, then I am ok with a null value in the column FieldValue
which field value is selected doesn't matter much.

EDIT
ok, so I got a little further.
Following is a query I can run, and it will return me exactly the rows I want. Unfortunately, it is missing the column fieldvalue ~~(which is a ntext and cant be grouped)~~
If I could select the longest fieldvalue that matches, that would be perfect.

SELECT ROW_NUMBER() OVER (ORDER BY sto.sto_created_dWhen DESC)
    AS Row,
    iss.iss_lKey                as IssueId,
    iss.iss_sName               as IssueName,
    con.con_lKey                as ContainerId,
    con.con_sName               as ContainerName,
    sto.sto_lKey                as StoryId, 
    sto.sto_sName               as StoryName,
    sto.sto_Guid                as StoryGuid,
    sto.sto_sByline             as Byline,
    sto.sto_created_dWhen       as StoryCreatedDate,
    sto.sto_deadline_dWhen      as StoryDeadline,
    sto.sto_lType               as StoryType,
    sto.sto_sct_lKey            as StoryCategory,
    sto.sto_created_use_lKey    as CreatedBy,
    --sfv.sfv_tValue                as FieldValue,
    --sf.sfe_lKey                   as StoryFieldId,
    --sf.sfe_sCaption               as StoryFieldCaption, 
    sre.sre_lIndex              as RevisionIndex
    FROM tStory30 sto 
JOIN tContainer30 con ON sto.sto_con_lKey = con.con_lKey
JOIN tIssue30 iss ON con.con_iss_lKey = iss.iss_lKey
LEFT OUTER JOIN tStoryRevision30 sre ON sre.sre_sto_lKey = sto.sto_lKey 
                                    AND sre.sre_lIndex = 0
LEFT OUTER JOIN tStoryField30 sf ON sre.sre_lKey = sf.sfe_sre_lKey 
LEFT OUTER JOIN tStoryFieldValue30 sfv ON sfv.sfv_sfe_lKey = sf.sfe_lKey 
WHERE sto.sto_sName LIKE '%test%'
   OR sfv.sfv_tValue LIKE '%test%'
GROUP BY
    iss.iss_lKey,           
    iss.iss_sName,          
    con.con_lKey,           
    con.con_sName,          
    sto.sto_lKey,           
    sto.sto_sName,          
    sto.sto_Guid,           
    sto.sto_sByline,        
    sto.sto_created_dWhen,
    sto.sto_deadline_dWhen, 
    sto.sto_lType   ,       
    sto.sto_sct_lKey,   
    sto.sto_created_use_lKey,
  --sfv.sfv_tValue,     
  --sf.sfe_lKey,        
  --sf.sfe_sCaption,        
    sre.sre_lIndex

Best Answer

If you just want to return one row per story rather than a set, you can use MIN() or MAX() (and a GROUP BY) or SQL Server's TOP 1 function within a subquery to do this. See https://stackoverflow.com/questions/4229931/how-to-write-a-query-to-get-only-first-matching-row-while-joining-two-tables .

Without sample data, and with that complex a query, I don't trust myself to write out bug-free code for you, but you should be able to work it out yourself using the technique described in that article. If not, indicate where you get stuck.

Related Solutions

Mysql – Grouping results based on link table

You should be able to easily combine your current queries to get the group and users by using the following:

SELECT g.name group_name, u.id, u.name
FROM groups g
INNER JOIN user_groups ug
  ON g.id = ug.group_id
INNER JOIN users u
  on ug.user_id  = u.id;

See SQL Fiddle with Demo. This will give you multiple rows for each group if there is more than one user per group, similar to this:

| GROUP_NAME | ID | NAME |
--------------------------
|      admin |  1 | dave |
|      admin |  2 | cher |

You could also use the GROUP_CONCAT() function to return a comma separated list of the users with each group:

SELECT g.name group_name, 
  GROUP_CONCAT(u.name SEPARATOR ', ') name
FROM groups g
INNER JOIN user_groups ug
  ON g.id = ug.group_id
INNER JOIN users u
  on ug.user_id  = u.id
GROUP BY g.name;

See SQL Fiddle with Demo. This returns a result:

| GROUP_NAME |       NAME |
---------------------------
|      admin | cher, dave |

Then if you want to also include the users that have no group associated with them, you could use a UNION ALL query:

SELECT g.name group_name, u.id, u.name
FROM groups g
INNER JOIN user_groups ug
  ON g.id = ug.group_id
INNER JOIN users u
  on ug.user_id  = u.id
UNION ALL
SELECT '' group_name, u.id, u.name
FROM users u
WHERE id not in (select user_id
                 from user_groups);

See SQL Fiddle with Demo. This returns:

| GROUP_NAME | ID |  NAME |
---------------------------
|      admin |  1 |  dave |
|      admin |  2 |  cher |
|            |  3 |   meg |
|            |  4 | sarah |

Sql-server – Index not making execution faster, and in some cases is slowing down the query. Why is it so

Even though the index is suggested by the SQL Server, why does it slow things down by a significant difference?

Index suggestions are made by the query optimizer. If it comes across a logical selection from a table which is not well served by an existing index, it may add a "missing index" suggestion to its output. These suggestions are opportunistic; they are not based on a full analysis of the query, and do not take account of wider considerations. At best, they are an indication that more helpful indexing may be possible, and a skilled DBA should take a look.

The other thing to say about missing index suggestions is that they are based on the optimizer's costing model, and the optimizer estimates by how much the suggested index might reduce the estimated cost of the query. The key words here are "model" and "estimates". The query optimizer knows little about your hardware configuration or other system configuration options - its model is largely based on fixed numbers that happen to produce reasonable plan outcomes for most people on most systems most of the time. Aside from issues with the exact cost numbers used, the results are always estimates - and estimates can be wrong.

What is the Nested Loop join which is taking most of the time and how to improve its execution time?

There is little to be done to improve the performance of the cross join operation itself; nested loops is the only physical implementation possible for a cross join. The table spool on the inner side of the join is an optimization to avoid rescanning the inner side for each outer row. Whether this is a useful performance optimization depends on various factors, but in my tests the query is better off without it. Again, this is a consequence of using a cost model - my CPU and memory system likely has different performance characteristics than yours. There is no specific query hint to avoid the table spool, but there is an undocumented trace flag (8690) that you can use to test execution performance with and without the spool. If this were a real production system problem, the plan without the spool could be forced using a plan guide based on the plan produced with TF 8690 enabled. Using undocumented trace flags in production is not advised because the installation becomes technically unsupported and trace flags can have undesirable side-effects.

Is there something that I am doing wrong or have missed?

The main thing you are missing is that although the plan using the nonclustered index has a lower estimated cost according to the optimizer's model, it has a significant execution-time problem. If you look at the distribution of rows across threads in the plan using the Clustered Index, you will likely see a reasonably good distribution:

Scan plan

In the plan using the Nonclustered Index Seek, the work ends up being performed entirely by one thread:

Seek plan

This is a consequence of the way work is distributed among threads by parallel scan/seek operations. It is not always the case that a parallel scan will distribute work better than an index seek - but it does in this case. More complex plans might include repartitioning exchanges to redistribute work across threads. This plan has no such exchanges, so once rows are assigned to a thread, all related work is performed on that same thread. If you look at the work distribution for the other operators in the execution plan, you will see that all work is performed by the same thread as shown for the index seek.

There are no query hints to affect row distribution among threads, the important thing is to be aware of the possibility and to be able to read enough detail in the execution plan to determine when it is causing a problem.

With the default index (on primary key only) why does it take less time, and with the non clustered index present, for each row in the joining table, the joined table row should be found quicker, because join is on Name column on which the index has been created. This is reflected in the query execution plan and Index Seek cost is less when IndexA is active, but why still slower? Also what is in the Nested Loop left outer join that is causing the slowdown?

It should now be clear that the nonclustered index plan is potentially more efficient, as you would expect; it is just poor distribution of work across threads at execution time that accounts for the performance issue.

For the sake of completing the example and illustrating some of the things I have mentioned, one way to get a better work distribution is to use a temporary table to drive parallel execution:

SELECT
    val1,
    val2
INTO #Temp
FROM dbo.IndexTestTable AS ITT
WHERE Name = N'Name1';

SELECT 
    N'Name1',
    SUM(T.val1),
    SUM(T.val2),
    MIN(I2.Name),
    SUM(I2.val1),
    SUM(I2.val2)
FROM   #Temp AS T
CROSS JOIN IndexTestTable I2
WHERE
    I2.Name = 'Name1'
OPTION (FORCE ORDER, QUERYTRACEON 8690);

DROP TABLE #Temp;

This results in a plan that uses the more efficient index seeks, does not feature a table spool, and distributes work across threads well:

Optimal plan

On my system, this plan executes significantly faster than the Clustered Index Scan version.

If you're interested in learning more about the internals of parallel query execution, you might like to watch my PASS Summit 2013 session recording.

Best Answer

Related Solutions

Mysql – Grouping results based on link table

Sql-server – Index not making execution faster, and in some cases is slowing down the query. Why is it so

Related Question