If you are distributing the computations with MPI, then using an MPI-aware tool would give you more sensible results: with a distributed application, you might have issues of load imbalance, where one MPI process is idle waiting for data to come from other processes. If you happen to be profiling exactly that MPI process, your performance profile will be all wrong.
So, the first step is usually to find out about the communication and load balance pattern of your program, and identify a sample input that gives you the workload you want (e.g., CPU-intensive on rank 0) For instance,
mpiP is an MPI profiling tool that can produce a very complete report about the communication pattern, how much time each MPI call took, etc.
Then you can run a code profiling tool on one or more selected MPI ranks. Anyway, using perf
on a single MPI rank is likely not a good idea because its measurements will contain also the performance of the MPI library code, which is probably not what you are looking for.
I know this question is pretty old (Feb 16) but here a response in case it helps someone else.
The problem is that you've entered the '-F 999' indicating that you want to sample the events at a frequency of 999 times a second. For 'trace' events, you don't generally want to do sampling. For instance, when I select sched:sched_switch, I want to see every context switch.
If you enter -F 999 then you will get a sampling of the context switches...
If you look at the output of your 'perf record' cmd with something like:
perf script --verbose -I --header -i perf.dat -F comm,pid,tid,cpu,time,period,event,trace,ip,sym,dso > perf.txt
then you would see that the 'period' (the number between the timestamp and the event name) would not (usually) be == 1.
If you use a 'perf record' cmd like below, you'll see a period of 1 in the 'perf script' output like:
Binder:695_5 695/2077 [000] 16231.700440: 1 sched:sched_switch: prev_comm=Binder:695_5 prev_pid=2077 prev_prio=120 prev_state=S ==> next_comm=kworker/u16:17 next_pid=7665 next_prio=120
A long winded explanation but basically: don't do that (where 'that' is '-F 999').
If you just do something like:
perf record -a -g -e sched:sched_switch -e sched:sched_blocked_reason -e sched:sched_stat_sleep -e sched:sched_stat_wait sleep 5
then the output would show every context switch with the call stack for each event.
And you might need to do:
echo 1 > /proc/sys/kernel/sched_schedstats
to get the sched_stat events.
Best Answer
This is an old question, but this in now possible with
--call-graph dwarf
. From the man page:I believe this requires a somewhat recent Linux kernel (>=3.9? I'm not entirely sure). You can check if your distro's perf package is linked with libdw or libunwind with
readelf -d $(which perf) | grep -e libdw -e libunwind
. On Fedora 20, perf is linked with libdw.