Sql-server – Service Broker – Conversation Lifetime

service-brokersql serversql-server-2016

We are trying to get Service Broker working in our environment in order to solve a business case. I don't know if the message title is a good one, but my question is below. But it may not be a good question, so after that is what we are doing and why I think it's the right question.

How many messages should be sent on a conversation before ending the conversation?

We want to use Service Broker in order to asynchronously update a result table. The result table is flattened and fast. We have triggers on the base tables that send a message with their table and primary key. We have three queues:

  • Low Latency – objective is 15 second to process. It handles items that change relating to a specific item.
  • Bulk Queue – objective is 5 minutes to process. It handles when something changes that affects many hundred (or thousands) of items. It breaks out the list of items that were affected and feeds them to the Deferred Low Latency Queue.
  • Deferred Low Latency – objective is 30 minutes to process. This processes items, but only from the bulk queue.

Basically, if a client's information updates, that affects many products, so that gets sent to the bulk queue for slower processing. However, if a product gets updated, that gets sent to the low latency queue.

We reuse conversations similar to Remus Rusanu's blog http://rusanu.com/2007/04/25/reusing-conversations/, with the exception that we do it based on the modulus of the primary key. This has the side benefit of aiding in de-duplication of primary key's.

So, we are re-using conversations and are within our guidelines. With two threads, I was able to burn through 125 messages/second (artificial drop of several thousand messages), which is more than capable of keeping up with production (est. 15 messages/sec).

However, the problem we are experiencing is that after a period of time, ~4 hours or 120K messages, we started seeing blocks and high contention on sysdesend and the queue table. The locks are LCK_M_U and are KEY locks. Sometimes the hobt resolves to sysdesend and othertimes to the specific queue table (queue_).

We have a process in place that will end conversations after either 24 hours or 30 minutes of inactivity, we could just increase the time before cycling over of conversations.

We are using SQL 2016 Enterprise (13.0.4001.0)

  1. Trigger Fires (send to either low latency or bulk)
  2. Look up or create conversation handle.
  3. Send message
  4. Queue activated procedure
  5. Update results table

Cleanup process runs every 10 minutes to see if there are any idle conversations. ltd it finds them more than three times in a row, it marks it as inactive and ends the conversations.

Please let me know if there are any additional details that may be beneficial. I don't have much experience with Service Broker, so I don't know if our messages/sec is low, high or indifferent.

UPDATE

So we tried again today and encountered the same problem. We changed the conversation lifetime to 2 hours and that had no effect. So we then implemented the 150 trick, which had the same issue.

Tons of waits on SEND CONVERSATION, waiting on sysdesend. Does anyone have any further ideas?

UPDATE 2

We ran the test for longer today, and for one of the sample periods of 17 minutes, we processed 41K messages on 4 conversation handles. We were able to keep up, except towards the end when the locks on the sysdesend and the queue table became too much, and we started drifting behind before stopping it. We seem to have no problem processing messages, without things entering the queue: we can pull them off and process them at least 5x that speed. Our speed appears to be limited based on adding messages.

On a later test, we removed one of the triggers which accounted for 80% of the messages. Even with this much reduced load, we started seeing the same waits.

UPDATE 3

Thank you, Remus for your advice (and thank you for posting such excellent blog articles on the subject, they were instrumental in getting to this point).

We ran it again today and did better (as in we went longer before seeing the waits and even longer before it crippled us). So, the details.

We changed:

  • Increased the number of maintained conversations per thread from 1:1 to 2:1. Basically, we had 8 conversation handles for 4 threads.

  • consolidated the bulk queue (because one incoming message could mean hundreds of outgoing messages) to consolidate into fewer, larger messages.

Notes about this attempt:

  • disabling the target queue activation procedure. no change in blocking (we waited 5 minutes) and the messages did get sent to sys.transmission_queues.

  • monitoring sys.conversation_endpoints. This number went from 0 13K very quickly, and then more slowly rose throughout the day ending up around 25K after ~5 hours. Blocking didn't start occurring until it reached 16K+/-

  • I went into the DAC and ran the DBREINDEX commands for the queues, although from a query, ghost records never got above 200 or so before the cleanup came along and dropped the count to 0.

  • sysdesend and sysdercv had identical counts of 24,932 when I ended the test.

  • we processed ~310K messages in 5 hours.

We went so far before things fell apart that I really thought we would make it this time. Tomorrow we will try forcing the messages to go through the wire.

Best Answer

I know it's bad form to answer your own question, but I wanted to close this out for anyone who was interested. We finally did manage to resolve the issue, or at least resolve it enough to meet our requirements. I want to thank everyone who contributed comments; Remus Rusanu and Kin as they were very helpful.

Our database is quite busy and is in RCSI mode. We have multiple (thousands) of mobile devices that update their location information every 45 seconds. Through these updates, multiple tables get their information updated (poor design, as I would have limited the volatile information to a single table and then joined it in for the results). These tables are the same ones that we were attempting to asynchronously generate reporting information for rather than having the end users go directly against the base tables.

We initially had the triggers doing a cursor over the modified records in every update/insert statement (should have been one row in most cases) and sending each primary key in a message to the service broker. Inside service broker, especially the bulk queue were further cursors which executed the upsert procedure for the report (one execution per primary key).

What finally got us working:

  • We removed the cursors and settled on sending larger messages. Still one message per user transaction per table, but we now send messages with more than one primary key.

  • The bulk processor also sends multiple keys per message, which reduced the number of SEND CONVERSATIONS that were going on as it shuffled messages to the other queue as appropriate.

  • The most volatile table (our mobile device data table) had it's triggers removed. We updated the upsert procedure to include the appropriate foreign keys and now we just join back on that table when fetching results to the users. This table easily contributed 80% of the messages we had to process in a day.

We process ~1M messages a day (without the Mobile table) and the vast majority (99%+) of our messages are processed inside our objective. We still have the occasional outlier but given the rare nature of the that it is deemed acceptable.

Contributing factors:

  • I found a bug in the conversation cleanup procedure mentioned earlier that was not actually cleaning conversations appropriately, and prematurely ending them. This has now resulted in our sysdesend count to never be more than a few thousand (most of that comes from using the 150 trick).

  • The cursors in the triggers appeared to hold more locking than anticipated (even with static, forward_only). removing those seems to have made the locks we do see on SEND CONVERSATION more transient in nature (or at least the times we see are much lower).

  • We were essentially running two solutions side by side (the Service Broker solution backend (for testing under production load)) and the current solution (terrible query that spans many tables).

As a side benefit, this has uncovered a Ghost Record Cleanup problem and while it was not on the Service Broker tables (system or queue), it is quite rampant in our system and the symptoms line up really well with our "no clear cause" issues we experience at times. Investigation is ongoing on that, we are trying to find the tables that are contributing to it and we will probably just routinely rebuild their indexes.

Thank you once again.

*** UPDATE 2021 ***

I got another upvote on this so I wanted to come back and revisit. We eventually abandoned Service Broker shortly after I asked this question as our transaction workload was still far too high/fast to work effectively. We eventually moved to this style of queue for our high throughput message system and it works great.

http://kejser.org/implementing-message-queues-in-relational-databases/