Optimizing SQL Query: FOR XML PATH or Alternatives

concatexecution-planperformancequery-performancesql server

Is there any way to optimize the FOR XML PATH statement? Or maybe I should use another approach?

Current approach is not acceptable at all. It takes minutes. I know view is a really big union and it takes time to consume that, but maybe is another way…

Here is the query:

SELECT t.serialNumber as TVM
  ,[issuanceDate] as transactionDate
  ,ioy.tvmTransactionId as TVM_TRANS_ID
  ,STUFF((SELECT ', ' + stv.carrierSN
    FROM HERMES.wts.v_SaleTransactionView as stv
    WHERE stv.tvmTransactionId = ioy.tvmTransactionId
    and stv.serialNumber = 'M040'
    FOR XML PATH ('')), 1, 2, '') as [serial_numbers]
  FROM [hermes].[wts].[IOYLog] ioy
  left join Hermes.hermes.Terminals t on ioy.tp_terminalId = t.tp_terminalId
  left join hermes.hermes.POS p on t.tp_POSId = p.tp_POSId
  left join [Hermes].[wts].[IOUPaymentStatus] [is] on [is].[tp_paymentStatusId] = ioy.status
  WHERE [is].includeInReports = 1 
  and (issuanceDate BETWEEN '2017/09/01' AND '2017/09/08') and (t.serialNumber = 'M040')
  ORDER BY transactionDate

Execution plan: https://www.brentozar.com/pastetheplan/?id=Sk7i98Wjb

Best Answer

Looking at this in Sentry One Plan Explorer, there are four major indexing issues, all associated with index scans hiding an expensive residual predicate:

The following four indexes will eliminate these scans and speed up this execution plan:

--- Currently scanning 7 million rows to return 11
CREATE INDEX give_me_a_name
ON [wts].[Prepaid_TransactionLog]
    ([tvmTransactionId])
INCLUDE
    (tp_terminalId, tp_POSId, status);

-- Currently scanning 21.5 million rows to return zero    
CREATE INDEX give_me_a_name
ON [wts].[KBETicketTransactionLog]
    ([tvmTransactionId])
INCLUDE
    (carrierSN, tp_terminalId, tp_POSId, status);

-- Currently scanning 6.25 million rows to return 12
CREATE INDEX give_me_a_name
ON [wts].[Intercity_TicketTransactionLog]
    ([tvmTransactionId])
INCLUDE
    (formType, status, tp_terminalId, tp_POSId);

The fourth index is a little more complicated since the key involves a conversion to integer to match [wts].[IOYLog].[tvmTransactionId]. This means either altering the existing column, or providing a computed column as follows:

-- New computed column, should be instant to add (metadata only)
ALTER TABLE [mazovian].[Transaction]
ADD [SequenceAsInteger] AS CONVERT(integer, [Sequence])

-- Currently scanning 61.6 million rows to return 70
CREATE INDEX give_me_a_name
ON [mazovian].[Transaction]
    ([SequenceAsInteger])
INCLUDE
    (Id, Sequence, Terminal_id, Detail_id, Location_id);

Adding those four indexes should speed up the query significantly. As with all indexing changes, you should assess them on a test system before deciding on deployment.

There are a number of other small indexing improvements you could consider to avoid the Key Lookups in the plan:

add [status] to the existing index [IOYLog].[IX_IOYLog_tp_terminalId_tvmTransactionId_issuanceDate_owedAmount]
add carrierSN, tp_POSId, and status to index [TransportTransactionLog].[IX_TransportTransactionLog_tp_terminalId_tvmTransaction_KK]
add End_Status to index [Transaction_Distribution].[I_mazovian_Transaction_Distribution_Ticket_id]

Related Solutions

SQL Server – Different Execution Plans Based on Columns Selected from CTE

The plan without row number is below.

This is assigned a cost of 44.866.

You have a TOP without ORDER BY so SQL Server just needs to scan the clustered index and as soon as it finds the first 30,000 rows matching the predicate it can stop.

The table has 13,283,300 rows. A full clustered index scan is costed at 730.467 + 14.6118 = 745.0788 but this gets scaled down to 43.9392 because of the TOP.

Applying the same scaling of 5.9% to the number of rows in the table this would imply that SQL Server estimates that it will only have to scan 783,350 rows before it finds 30,000 matching the WHERE and can stop scanning.

NB: You say that only 474,296 rows match this predicate in the whole table but 508,747 are estimated to. That means that on average one in every 26.1 (13283300/508747) rows is assumed to match the filter. So it is estimated that 30,000 * 26.1 rows ( = 783K) will be read.

When you select * that means that the rownum column must be calculated. the plan for this is below. It is costed at 69.1185

You have an index on COLUMNE that can be seeked into. This satisfies the range predicate on COLUMNE >= 1472738400000 AND COLUMNE <= 1475244000000 and also supplies the required ordering for your row numbering.

However it does not cover the query and lookups are needed to return the missing columns. The plan estimates that there will be 30,000 such lookups. There may in fact be more as the predicate on COLUMNF = 1 may mean some rows are discarded after being looked up (though not in this case as you say COLUMNF always has a value of 1).

If the row numbering plan was to use a clustered index scan it would need to be a full scan followed by a sort of all rows matching the predicate. 69.1185 is considerably cheaper than the 745.0788 + sort cost so the plan with lookups is chosen.

You say that the plan with lookups is in fact 5 times faster than the clustered index scan. Likely a much greater proportion of the clustered index needed to be read to find 30,000 matching rows than was assumed in the costings. You are on SQL Server 2014 SP1 CU5. On SQL Server 2014 SP2 the actual execution plan now has a new attribute Actual Rows Read which would tell you how many rows it did actually read. On previous versions you can use OPTION (QUERYTRACEON 9130) to see the same information.

SQL Server – Why Are Row Estimates Always Too Low?

(summarizing my comments and putting as answer)

A query rewrite will solve the issue of getting low row estimates. As Joe Chang explains in his blog post Query Optimizer Gone Wild - Full-Text

CONTAINS is "a predicte used in a WHERE clause" per Microsoft documentation, while CONTAINSTABLE acts as a table.

You get a much better plan (merge join) using CONTAINSTABLE vs the actual plan using contains uses a nested loop join with low row estimates.

You can rewrite the query as :

SELECT TOP 30 p.PersonId,
              p.PersonParentId,
              p.PersonName,
              p.PersonPostCode
FROM dbo.People p
left join containstable (ContactFullText, '"mr" AND "ch*"') cf on cf.[yourKey] = p.PersonId
WHERE p.PersonDeletionDate IS NULL
      AND p.PersonCustomerId = 24
      --AND CONTAINS(ContactFullText, '"mr" AND "ch*"')
      AND p.PersonGroupId IN(197, 206, 186, 198)
      AND [RANK] > 0
ORDER BY p.PersonParentId,
         p.PersonName;

Best Answer

Related Solutions

SQL Server – Different Execution Plans Based on Columns Selected from CTE

SQL Server – Why Are Row Estimates Always Too Low?

Related Question