Logging/profiling an unpredictably slow stored procedure

oracleplsql

I have a stored procedure that sometimes runs significantly slower than others. I have two different timers measuring the execution time at different scopes. One is in the client code, measuring only the OCI execute call. The second is inside the PL/SQL procedure itself, timing the inside of the procedure's main block.

Normally, if the procedure's timer takes, say, 5 seconds to run, the client's timer is only a few milliseconds greater. This is completely reasonable. The problem is that sometimes the client's timer will be much greater than the procedure's, even double.

I've eliminated any issue on the client or in the transport layer. I enabled client and server logging in sqlnet.ora at support level. On the client, the gap between request and response packets is essentially equal to the client code's timer. On the server, a similar gap is seen between the incoming request and the outgoing response.

The question remains, if the overall call takes 10 seconds, and inside the procedure takes only 5 seconds, where did the other 5 seconds go?

I don't expect the exact answer to the above question, but I would like to hear ideas on how to go about finding the answer. What logs can I turn on to give me insight into what Oracle is doing both after the request arrives but before the actual call to the procedure, and after the procedure until the response is sent back to the client?

Edit:
Perhaps there's something wrong with my timing methodology? I'm using dbms_utility.get_cpu_time() (which provides time in hundredths of a second, hence the division by 100.)

PROCEDURE foo (elapsed_time OUT NUMBER) IS
  start_time NUMBER;
BEGIN
  start_time := dbms_utility.get_cpu_time();
  -- ... PL/SQL stuffs
  elapsed_time := (dbms_utility.get_cpu_time() - start_time) / 100;
END;

Best Answer

You can set tracing at the ADMIN level. Your timing is suspiciously similar to an issue I dealt with in April/May this year. The admin level trace file showed a 5 second gap when reading from a socket. (10:44:05 -> 10:44:11) NTTRD is the read from tcp socket call. I'd definitely work this issue with Oracle support, but if you see gaps like the one below then the pause is somewhere in your network stack or hardware (i.e outside Oracle). In our case Oracle support pointed us towards network switch logs etc, but the culprit turned out to be an issue in the Microsoft Windows TCP stack down to the new Receive Window Auto-Tuning feature (see http://support.microsoft.com/kb/983528 for example if on Server 2008). Obviously it would be pure fluke if your issue was the same as ours, but if you can see the gaps in the admin level trace file it's a fair bet that it will turn out to be a network issue.

[08-APR-2011 10:44:05:938] nsprecv: reading from transport...
[08-APR-2011 10:44:05:938] nttrd: entry
[08-APR-2011 10:44:11:095] nttrd: socket 2156 had bytes read=60
[08-APR-2011 10:44:11:095] nttrd: exit
[08-APR-2011 10:44:11:095] nsprecv: 60 bytes from transport
[08-APR-2011 10:44:11:095] nsprecv: tlen=60, plen=60, type=6