Linux – How to analyze profile data from `perf record –a` (system-wide collection)

linuxperf-eventprofiling

I am using perf from linux-2.6.36-gentoo-r4. The /proc/sys/kernel/perf_event_paranoid is set to 0, so there should be no problems from there.

Because long-running application I am profiling sometimes crashes for some indetermined reason (no information as to reason for it stopping working could be found by me), I turned to system-wide profiling with perf events.

The application in question does parallelized numerical calculations, using MPI (Message Passing Interface) for communication. Before running the application (with mpirun) I have started recording system-wide profile data on one of nodes it is run with:

$ perf record -o perf.all.cycles,graph.data -g -e cycles -a &

After I have realized that application crashed, I have killed the perf task.

It had left

$ du -sh perf.all.cycles,graph.data 
14G     perf.all.cycles,graph.data

14GB of data. Unfortunately perf report doesn't support the -a switch.

How can I analyze system-wide profiling data from perf tool?


Added 2011.08.12

Simply running perf report doesn't produce useful output:

$ perf report -i perf.all.cycles,graph.data
#
# (For a higher level overview, try: perf report --sort comm,dso)
#

That is the whole of output from 14GB profile data…

Best Answer

If you are distributing the computations with MPI, then using an MPI-aware tool would give you more sensible results: with a distributed application, you might have issues of load imbalance, where one MPI process is idle waiting for data to come from other processes. If you happen to be profiling exactly that MPI process, your performance profile will be all wrong.

So, the first step is usually to find out about the communication and load balance pattern of your program, and identify a sample input that gives you the workload you want (e.g., CPU-intensive on rank 0) For instance, mpiP is an MPI profiling tool that can produce a very complete report about the communication pattern, how much time each MPI call took, etc.

Then you can run a code profiling tool on one or more selected MPI ranks. Anyway, using perf on a single MPI rank is likely not a good idea because its measurements will contain also the performance of the MPI library code, which is probably not what you are looking for.

Related Question