Service Broker – Certificate Not Found Error

service-broker

My organization has been successfully using service broker, dialog security, for a few years.

I have a new instance of service broker that I would like to configure, but I believe I am running into a situation where the order in which the new broker instance is being incorrectly placed into the databases (with existing broker instances in place) causing the new broker instance to fail.

My error is on the target server; Certificate Not Found. Security context could not be retrieved.

My situation is:
Source server A (dmz) sending to target server B, where there are already existing broker instances sending messages both ways but on different databases.

Test 1: I switched the role of the source and target server around, making server B my source and server A (dmz) my target while using the same databases. This was successful.

Test 2: I dropped the test 1 instance and created 2 new databases on each server. I have again, reconfigured the source and target roles, now sending from server A (dmz) to server B and this was successful. There is only the one service broker instance configured in the new databases, so the order of placement is not an issue.

I have come into this situation in the past when we first set up our dev/test environments; However, the situation was slightly different as many other broker instances were also failing. Not all of them, but some. Once I realized what was happening, dropping all instances and recreating them in a specific order, source instances before target instances, the situation was resolved.

I would think that service broker would add a new instance in the correct place of order, automatically.

Am I correct in thinking that the broker instances must be configured within a specific order? If so, how can I add new broker instances without interrupting what is already there?

Please share your thoughts and suggestions.

Best Answer

Set up a profiler trace and monitor for Security Audit/Audit Broker Conversation. Repro the issue, an event of this class should be produced. Look at the event details. It should shed light. I think two of the event columns (can't remeber which, and MSDN seems to omit the info) contain the problem certificate issuer name/serial number, which should further help narrow down which certificate is being used.

[Edit] Run this query query on all three databases to find out which certificates are used by which service and remote service binding:

select s.name, c.* 
from sys.services s
join sys.certificates c
on s.principal_id = c.principal_id
where c.is_active_for_begin_dialog = 1
and pvt_key_encryption_type != 'NA'
order by c.expiry_date desc


select b.remote_service_name, b.remote_service_binding_id, c.* 
from sys.remote_service_bindings b
join sys.certificates c on b.remote_principal_id = c.principal_id
where c.is_active_for_begin_dialog = 1
order by c.expiry_date desc

[Edit after your data]

1) In PRd2DB services tab: all services in the DB are owned by dbo, and dbo owns 4 certificates that are valid for SSB (certificate ids 280, 288, 327 and 357). The xls did not correctly capture the expiry_date (is all 00:00.0) so I cannot know for sure which one is used, but if the results come from the query I posted above then the first one will be used, certificate id 280 with the serial number 54 89 2c 1a.... So this is the certificate used to identify any service from this db.

2) In Prd2 in the RSB tab, the query shows a mixture of certificates with private key and ones without a private key. This should never happen. An user associated with an RSB represents the identity of a remote service and should own only certificates with no private key. Also, there is only one RSB user (the one with principal_id 58) for all remote services. There should be a different identity user for each remote service. Right now any dialog started in PrdDB2, to any service, will use the certificate with serial number 34 99 ....

3) DMZDB1 services: same as prd2, all services owned by dbo, dbo has 3certificates (ids 335, 341, 371) and the one used will be the one with serial number 74 0a ...

4) DMZDB RSBs: there are two RSB users, one principal id 9 and one 30. The principal id 9 has a mixture of certificates, including some with private key, while principal id 30 has only one certificate but that certificate has a private key.

Overall, the configuration is deeply flawed and and no wonder it doesn't work. You do not understand the role and the proper configuration of a remote service binding.

Please follow the following rules. Consider a remote service 'Foo':

  • create a different user (principal) for each remote service, call it remote_user_Foo
  • create a RSB for each remote service, using the principal for this service (ie. remote_user_Foo)
  • The remote_user_Foo should own exactly one certificate, and it should not have a private key
  • Grant necessary permissions (SEND) to remote_user_Foo

Next time you run the query for showing the RSB it should show a number of rows equal to the number of remote service bindings, the principal_id values should be all distinct and the pvt_key_encryption_type should all be NA.