Linux Tracing – How to Trace DMA

linuxmonitoringtracing

I am working on software that communicates with a PCI card through direct memory access (DMA) transactions. My programs use a suit of drivers and a library that handles the DMA. Everything runs on a Red Hat Linux.

To test and measure the performance of my programs I would like to trace the start and end of the DMA transactions. Now I do this by looking at a couple of functions in the library:

dma_from_host and dma_to_host that initiate the transactions by configuring the values in the registers of the card and writing 1 to a register called DMA_DESC_ENABLE
dma_wait that waits until the transaction has finished by continuously checking the value of the DMA_DESC_ENABLE register.

But I would like to have a more robust confirmation that a transaction has started and a more precise signal when the transaction has ended. Something from Linux or hardware itself would be the best.

I understand that in principle it is a cumbersome situation. The idea of DMA is that the hardware (the PCI card or the DMA controller on the motherboard) copies things directly into the memory of the process, bypassing the CPU and the OS. But I hope that it does not just copy things into RAM without notifying the CPU somehow. Are there some standard ways to trace these transactions or it is very platform-specific?

Are there some special interrupts that notify the CPU about the start and end of the DMA? I could not spot anything like that in the drivers that I use. But I am not experienced with drivers, so I could have easily looked at wrong places.

Another idea, are there any PMU-like hardware monitors that could provide this information? Something that just counts transactions on PCI lanes?

Also an idea, do I understand right that one could write a custom DMA-tracer as a Linux module or a BPF program that would continuously check the value of that DMA_DESC_ENABLE register? Is this a viable approach? Are there known tracers like that?

Best Answer

Encouraged by the comment from @dirkt, I looked better at the drivers and found the PCI MSI interrupts that correspond to these DMA transactions.

The driver enables these interrupts with a call

pci_enable_msix(.., msixTable,..)

that sets up the struct msix_entry msixTable[MAXMSIX]. Then it asings them to the handler static irqreturn_t irqHandler() by calling request_irq() in a loop:

request_irq(msixTable[interrupt].vector, irqHandler, 0, devName,...)

The handler just counts the interrupts in a local int array. These counters are exported in the /proc/<devName> file that this driver creates for diagnostics etc. In fact, the proc file is from where I started the search for the interrupts.

But there is a better way: the /proc/interrupts file. The enabled MSI-X interrupts show up there in lines like these:

$ cat /proc/interrupts 
            CPU0       CPU1  ...  CPU5       CPU6       CPU7       
  66:          0          0  ...     0          0          0  IR-PCI-MSI-edge      <devName>
  67:          0          0  ...     0          0          0  IR-PCI-MSI-edge      <devName>
  68:         33          0  ...     0          0          0  IR-PCI-MSI-edge      <devName>
  69:          0          0  ...     0          0          0  IR-PCI-MSI-edge      <devName>
  70:          0          0  ...     0          0          0  IR-PCI-MSI-edge      <devName>
  71:          0          0  ...     0          0          0  IR-PCI-MSI-edge      <devName>
  72:          0          0  ...     0          0          0  IR-PCI-MSI-edge      <devName>
  73:          0          0  ...     0          0          0  IR-PCI-MSI-edge      <devName>

And one more way is to find the PCI address of the card in the lspci output and to check the interrupts assigned to the card in the /sys directory:

$ ls  /sys/bus/pci/devices/0000:17:00.0/msi_irqs
66  67  68  69  70  71  72  73

# but these are empty
$ cat  /sys/bus/pci/devices/0000:17:00.0/irq
0

The interrupt number 68 fires up by the end of the transactions. The interrupt handlers have a static tracepoint irq:irq_handler_entry in Linux. The tracepoint parameters in /sys/kernel/debug/tracing/events/irq/irq_handler_entry/format have the interrupt number in the int irq field. Hence, this interrupt can be traced with the standard Linux facilities by this tracepoint with a filter condition:

# setup the ftrace
trace-cmd start -e irq:irq_handler_entry -f "irq == 68"
# for live stream
cat /sys/kernel/debug/tracing/trace_pipe
# or just
trace-cmd stop
trace-cmd show
trace-cmd reset

# with perf
perf record -e "irq:irq_handler_entry" --filter "irq == 68"

With this, one thing that is still worth confirming is that these interrupts are essential to the DMA, to be sure that I monitor something relevant to the system instead of just a handy counter for the proc file that might not be implemented in another situation. But I could not spot that any other relevant interrupts by watching at how they increment in the /proc/interrupts. There are interrupts for the devices dmar[0123] that seem like something about DMA, but they have never incremented. And that is to be expected, as in this case the DMA engine must be implemented as an FPGA core in the PCI card itself.

Terminology, things the question involves

From Intel:

processor is a core device (it's 1 device/process) and a bunch of uncore devices, core is what runs the program (clock, ALU, registers etc), uncore are devices put on die, close to the processor for speed and low latency (the real reason is "because the manufacturer can do it"); as I understood it is basically the Northbridge, like on PC motherboard, plus caches; and AMD actually calls these devices NorthBridgeinstead ofuncore`;
ubox which shows up in my sysfs
```
$ find /sys/devices/ -type d -name events 
/sys/devices/cpu/events
/sys/devices/uncore_cbox_0/events
/sys/devices/uncore_cbox_1/events
```
-- is an uncore device, which manages Last Level Cache (LLC, the last one before hitting RAM); I have 2 cores, thus 2 LLC and 2 ubox;
Processor Monitoring Unit (PMU) is a separate device which monitors operations of a processor and records them in Processor Monitoring Counter (PMC) (counts cache misses, processor cycles etc); they exist on core and uncore devices; the core ones are accessed with rdpmc (read PMC) instruction; the uncore, since these devices depend on actual processor at hand, are accessed via Model Specific Registers (MSR) via rdmsr (naturally);

apparently, the workflow with them is done via pairs of registers -- 1 register sets which events the counter counts, 2 register is the value in the counter; the counter can be configured to increment after a bunch of events, not just 1; + there are some interupts/tech noticing overflows in these counters;
more one can find in Intel's "IA-32 Software Developer's Manual Vol 3B" chapter 18 "PERFORMANCE MONITORING";

also, the MSR's format concretely for these uncore PMCs for version "Architectural Performance Monitoring Version 1" (there are versions 1-4 in the manual, I don't know which one is my processor) is described in "Figure 18-1. Layout of IA32_PERFEVTSELx MSRs" (page 18-3 in mine), and section "18.2.1.2 Pre-defined Architectural Performance Events" with "Table 18-1. UMask and Event Select Encodings for Pre-Defined Architectural Performance Events", which shows the events which show up as Hardware event in perf list.

From linux kernel:

kernel has a system (abstraction/layer) for managing performance counters of different origin, both software (kernel's) and hardware, it is described in linux-source-3.13.0/tools/perf/design.txt; an event in this system is defined as struct perf_event_attr (file linux-source-3.13.0/include/uapi/linux/perf_event.h), the main part of which is probably __u64 config field -- it can hold both a CPU-specific event definition (the 64bit word in the format described on those Intel's figures) or a kernel's event

The MSB of the config word signifies if the rest contains [raw CPU's or kernel's event]

the kernel's event defined with 7 bits for type and 56 for event's identifier, which are enum-s in the code, which in my case are:
```
$ ak PERF_TYPE linux-source-3.13.0/include/
...
linux-source-3.13.0/include/uapi/linux/perf_event.h
29: PERF_TYPE_HARDWARE      = 0,
30: PERF_TYPE_SOFTWARE      = 1,
31: PERF_TYPE_TRACEPOINT    = 2,
32: PERF_TYPE_HW_CACHE      = 3,
33: PERF_TYPE_RAW           = 4,
34: PERF_TYPE_BREAKPOINT    = 5,
36: PERF_TYPE_MAX,         /* non-ABI */
```
(ak is my alias to ack-grep, which is the name for ack on Debian; and ack is awesome);

in the source code of kernel one can see operations like "register all PMUs dicovered on the system" and structure types struct pmu, which are passed to something like int perf_pmu_register(struct pmu *pmu, const char *name, int type) -- thus, one could just call this system "kernel's PMU", which would be an aggregation of all PMUs on the system; but this name could be interpreted as monitoring system of kernel's operations, which would be misleading;

let's call this subsystem perf_events for clarity;
as any kernel subsystem, this subsystem can be exported into sysfs (which is made to export kernel subsystems for people to use); and that's what are those events directories in my /sys/ -- the exported (parts of?) perf_events subsystem;
also, the user-space utility perf (built into linux) is still a separate program and has its' own abstractions; it represents an event requested for monitoring by user as perf_evsel (files linux-source-3.13.0/tools/perf/util/evsel.{h,c}) -- this structure has a field struct perf_event_attr attr;, but also a field like struct cpu_map *cpus; that's how perf utility assigns an event to all or particular CPUs.

Answer

Indeed, Hardware cache event are "shortcuts" to the events of the cache devices (ubox of Intel's uncore devices), which are processor-specific, and can be accessed via the protocol Raw hardware event descriptor. And Hardware event are more stable within architecture, which, as I understand, name the events from the core device. There no other "shortcuts" in my kernel 3.13 to some other uncore events and counters. All the rest -- Software and Tracepoints -- are kernel's events.

I wonder if the core's Hardware events are accessed via the same Raw hardware event descriptor protocol. They might not -- since the counter/PMU sits on core, maybe it is accessed differently. For instance, with that rdpmu instruction, instead of rdmsr, which accesses uncore. But it is not that important.
Kernel PMU event are just the events, which are exported into sysfs. I don't know how this is done (automatically by kernel all discovered PMCs on the system, or just something hard-coded, and if I add a kprobe -- is it exported? etc). But the main point is that these are the same events as Hardware event or any other in the internal perf_event system.

And I don't know what those
```
$ ls /sys/devices/uncore_cbox_0/events
clockticks
```
are.

Details on `Kernel PMU event`

Searching through the code leads to:

$ ak "Kernel PMU" linux-source-3.13.0/tools/perf/
linux-source-3.13.0/tools/perf/util/pmu.c                                                            
629:                printf("  %-50s [Kernel PMU event]\n", aliases[j]);

-- which happens in the function

void print_pmu_events(const char *event_glob, bool name_only) {
   ...
        while ((pmu = perf_pmu__scan(pmu)) != NULL)
                list_for_each_entry(alias, &pmu->aliases, list) {...}
   ... 
   /* b.t.w. list_for_each_entry is an iterator
    * apparently, it takes a block of {code} and runs over some lost
    * Ruby built in kernel!
    */
    // then there is a loop over these aliases and
    loop{ ... printf("  %-50s [Kernel PMU event]\n", aliases[j]); ... }
}

and perf_pmu__scan is in the same file:

struct perf_pmu *perf_pmu__scan(struct perf_pmu *pmu) {
    ...
                pmu_read_sysfs(); // that's what it calls
}

-- which is also in the same file:

/* Add all pmus in sysfs to pmu list: */
static void pmu_read_sysfs(void) {...}

That's it.

Details on `Hardware event` and `Hardware cache event`

Apparently, the Hardware event come from what Intel calls "Pre-defined Architectural Performance Events", 18.2.1.2 in IA-32 Software Developer's Manual Vol 3B. And "18.1 PERFORMANCE MONITORING OVERVIEW" of the manual describes them as:

The second class of performance monitoring capabilities is referred to as architectural performance monitoring. This class supports the same counting and Interrupt-based event sampling usages, with a smaller set of available events. The visible behavior of architectural performance events is consistent across processor implementations. Availability of architectural performance monitoring capabilities is enumerated using the CPUID.0AH. These events are discussed in Section 18.2.

-- the other type is:

Starting with Intel Core Solo and Intel Core Duo processors, there are two classes of performance monitoring capa-bilities. The first class supports events for monitoring performance using counting or interrupt-based event sampling usage. These events are non-architectural and vary from one processor model to another...

And these events are indeed just links to underlying "raw" hardware events, which can be accessed via perf utility as Raw hardware event descriptor.

To check this one looks at linux-source-3.13.0/arch/x86/kernel/cpu/perf_event_intel.c:

/*
 * Intel PerfMon, used on Core and later.
 */
static u64 intel_perfmon_event_map[PERF_COUNT_HW_MAX] __read_mostly =
{
    [PERF_COUNT_HW_CPU_CYCLES]              = 0x003c,
    [PERF_COUNT_HW_INSTRUCTIONS]            = 0x00c0,
    [PERF_COUNT_HW_CACHE_REFERENCES]        = 0x4f2e,
    [PERF_COUNT_HW_CACHE_MISSES]            = 0x412e,
    ...
}

-- and exactly 0x412e is found in "Table 18-1. UMask and Event Select Encodings for Pre-Defined Architectural Performance Events" for "LLC Misses":

Bit Position CPUID.AH.EBX | Event Name | UMask | Event Select
...
                        4 | LLC Misses | 41H   | 2EH

-- H is for hex. All 7 are in the structure, plus [PERF_COUNT_HW_REF_CPU_CYCLES] = 0x0300, /* pseudo-encoding *. (The naming is a bit different, addresses are the same.)

Then the Hardware cache events are in structures like (in the same file):

static __initconst const u64 snb_hw_cache_extra_regs
                            [PERF_COUNT_HW_CACHE_MAX]
                            [PERF_COUNT_HW_CACHE_OP_MAX]
                            [PERF_COUNT_HW_CACHE_RESULT_MAX] =
{...}

-- which should be for sandy bridge?

One of these -- snb_hw_cache_extra_regs[LL][OP_WRITE][RESULT_ACCESS] is filled with SNB_DMND_WRITE|SNB_L3_ACCESS, where from the def-s above:

#define SNB_L3_ACCESS           SNB_RESP_ANY
#define SNB_RESP_ANY            (1ULL << 16)                                                                            
#define SNB_DMND_WRITE          (SNB_DMND_RFO|SNB_LLC_RFO)
#define SNB_DMND_RFO            (1ULL << 1)
#define SNB_LLC_RFO             (1ULL << 8)

which should equal to 0x00010102, but I don't know how to check it with some table.

And this gives an idea how it is used in perf_events:

$ ak hw_cache_extra_regs linux-source-3.13.0/arch/x86/kernel/cpu/
linux-source-3.13.0/arch/x86/kernel/cpu/perf_event.c
50:u64 __read_mostly hw_cache_extra_regs
292:    attr->config1 = hw_cache_extra_regs[cache_type][cache_op][cache_result];

linux-source-3.13.0/arch/x86/kernel/cpu/perf_event.h
521:extern u64 __read_mostly hw_cache_extra_regs

linux-source-3.13.0/arch/x86/kernel/cpu/perf_event_intel.c
272:static __initconst const u64 snb_hw_cache_extra_regs
567:static __initconst const u64 nehalem_hw_cache_extra_regs
915:static __initconst const u64 slm_hw_cache_extra_regs
2364:       memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
2365:              sizeof(hw_cache_extra_regs));
2407:       memcpy(hw_cache_extra_regs, slm_hw_cache_extra_regs,
2408:              sizeof(hw_cache_extra_regs));
2424:       memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
2425:              sizeof(hw_cache_extra_regs));
2452:       memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs,
2453:              sizeof(hw_cache_extra_regs));
2483:       memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs,
2484:              sizeof(hw_cache_extra_regs));
2516:       memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
$

The memcpys are done in __init int intel_pmu_init(void) {... case:...}.

Only attr->config1 is a bit odd. But it is there, in perf_event_attr (same linux-source-3.13.0/include/uapi/linux/perf_event.h file):

...
    union {
            __u64           bp_addr;
            __u64           config1; /* extension of config */                                                      
    };
    union {
            __u64           bp_len;
            __u64           config2; /* extension of config1 */
    };
...

They are registered in kernel's perf_events system with calls to int perf_pmu_register(struct pmu *pmu, const char *name, int type) (defined in linux-source-3.13.0/kernel/events/core.c: ):

static int __init init_hw_perf_events(void) (file arch/x86/kernel/cpu/perf_event.c) with call perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW);
static int __init uncore_pmu_register(struct intel_uncore_pmu *pmu) (file arch/x86/kernel/cpu/perf_event_intel_uncore.c, there are also arch/x86/kernel/cpu/perf_event_amd_uncore.c) with call ret = perf_pmu_register(&pmu->pmu, pmu->name, -1);

So finally, all events come from hardware and everything is ok. But here one could notice: why do we have LLC-loads in perf list and not ubox1 LLC-loads, since these are HW events and they actualy come from uboxes?

That's a thing of the perf utility and its' perf_evsel structure: when you request a HW event from perf you define the event which processors you want it from (default is all), and it sets up the perf_evsel with the requested event and processors, then at aggregation is sums the counters from all processors in perf_evsel (or does some other statistics with them).

One can see it in tools/perf/builtin-stat.c:

/*
 * Read out the results of a single counter:
 * aggregate counts across CPUs in system-wide mode
 */
static int read_counter_aggr(struct perf_evsel *counter)
{
    struct perf_stat *ps = counter->priv;
    u64 *count = counter->counts->aggr.values;
    int i;

    if (__perf_evsel__read(counter, perf_evsel__nr_cpus(counter),
                           thread_map__nr(evsel_list->threads), scale) < 0)
            return -1;

    for (i = 0; i < 3; i++)
            update_stats(&ps->res_stats[i], count[i]);

    if (verbose) {
            fprintf(output, "%s: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
                    perf_evsel__name(counter), count[0], count[1], count[2]);
    }

    /*
     * Save the full runtime - to allow normalization during printout:
     */
    update_shadow_stats(counter, count);

    return 0;
}

(So, for the utility perf a "single counter" is not even a perf_event_attr, which is a general form, fitting both SW and HW events, it is an event of your query -- the same events may come from different devices and they are aggregated.)

Also a notice: struct perf_evsel contains only 1 struct perf_evevent_attr, but it also has a field struct perf_evsel *leader; -- it is nested. There is a feature of "(hierarchical) groups of events" in perf_events, when you can dispatch a bunch of counters together, so that they can be compared to each other and so on. Not sure how it works with independent events from kernel, core, ubox. But this nesting of perf_evsel is it. And, most likely, that's how perf manages a query of several events together.

Best Answer

Related Solutions

Linux – Tool to trace library calls in Linux/ARM

Linux – What are Kernel PMU event-s in perf_events list

Terminology, things the question involves

Answer

Details on Kernel PMU event

Details on Hardware event and Hardware cache event

Related Question

Details on `Kernel PMU event`

Details on `Hardware event` and `Hardware cache event`