Running Process Monitor causes application to work

process-monitorsocketssysinternalstcpwireshark

This is a longshot, but perhaps someone with knowledge of the internal workings of Sysinternal's Process Monitor may have an idea.

Recently we've had a very murky problem at work. We have a software (call it SW1) which creates a socket connection on a particular port with another software (call it SW2) and receives some data from this software. It then creates another socket connection with another process belonging to it, and sends it some data, after which the cycle restarts and it starts receiving some more data from SW2.

This is a very vague description and I have nothing to do with neither of these applications, however as the owner of the workstations I've been heavily involved in support. This whole system worked without any hitches on one particular workstation, however refused to work on four other identical workstations. The symptom was a sudden halt of packets being sent between SW1's two own processes, naturally followed by a timeout by SW2.

Now, for the wacky bit: After weeks of debugging with the relevant teams and running Wireshark, I decided to run Process Monitor perhaps something would show up. Weirdly enough, the socket connections remained established and the whole thing worked! Thinking it was a coincidence, we tried running process monitor on the other three and they all started working. Also, it looks like rebooting everything still keeps the applications working.

Of course the question remains: what impact could Process Monitor possibly have on these applications? Due to the nature of the solution I can't really analyse a procmon capture since it seems to be solving the issue…

Thanks!

Best Answer

It sounds like a race condition or dead lock.

I.e.: SW1 and SW2 must have a communication protocol with requests and acks. If this protocol is not well designed, there can ben a race condition, in which packets are not send in the correct order. SW1 get stacked waiting for a packet from SW2, but which SW2 has already sent in the past (and SW1 missed it) and SW2 is not going to send it again, becoming to a lock state on SW1.

If this is the case, the failure depends on the execution speed of SW1 and SW2, and further more on the load of the servers. Let say, if both processes are executing slowly, it's more difficult that SW1 misses the packet from SW2 which creates the lock state. Running the system monitor slightly slow the whole system down, which might be enough to make this work.

As for the different servers, if the first sever has more load than the others, then there you have, it works.

Related Question