Accessing Memory Mapped I/O is slow

armiovirtual-memory

I have a Terasic-SoCKIT(fpga & arm cortex a9) and I have Linux running on the HPS. I'm trying to access the memory mapped I/O, wrote a simple character driver with functions "request_mem_region" and "ioremap".

The memory mapped IO is an AXI bus, using which I can transmit data to the FPGA. I see that each write is taking almost 6us and for my application I need it to be less than 1us. Also, the driver stops writing to the mapped IO after a few writes(don't see the data being changed in the fpga; is the buffer in the driver getting full??).

The question is am I missing something or because the writes are happening from the virtual address to the physical address it cannot be anymore faster? If the writing from virtual address is slowing down, is there a way to speed it up? I know, that ARM has a DMAC but I haven't explored it yet.

Thank you,
Karthik

I'm sorry, I missed to tell that I was measuring the time in the user space code. Later, I checked the time it took to write in the driver and it was in nanoseconds. So, I figured most of the time taken was for the write from the userspace to kernel.

So, I did some further reading and understood that ioremap() maps the physical address to kernel virtual address and remap_pfn_range() maps the physical address/IO memory to the user virtual space (this is what I need; write to IO memory from user space). And I used the simple mmap example – http://web.cecs.pdx.edu/~jrb/ui/linux/examples.dir/simple/simple.c as the kernel driver. The following code is my userspace code:

    using namespace std;
    #include <iostream>
    #include <sys/mman.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <sys/mman.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <stdint.h>
    #include <ctime>


    #define PAGE_SIZE 4096
    #define HPS2FPGA_BRIDGE_BASE 0xc0000000
    #define BLINK_OFFSET 0x0

    volatile unsigned char *blink_mem;
    void *bridge_map;

    int main()
    {
        int fd, ret = EXIT_FAILURE;
        unsigned int i;
        unsigned char value;
        int dummy;
        off_t blink_base = HPS2FPGA_BRIDGE_BASE;
        clock_t start, stop;
        double duration;

        /* open the memory device file */
        fd = open("/dev/HPS2FPGA", O_RDWR|O_SYNC);
        if (fd < 0) {
            perror("open");
            exit(EXIT_FAILURE);
        }

        /* map the LWHPS2FPGA bridge into process memory */
        bridge_map = mmap(NULL, PAGE_SIZE, PROT_WRITE|PROT_READ|PROT_EXEC, MAP_SHARED,
                    fd, blink_base);
        if (bridge_map == MAP_FAILED) {
            perror("mmap");
            goto cleanup;
        }


        /* get the delay_ctrl peripheral's base address */
        blink_mem = (unsigned char *) (bridge_map + BLINK_OFFSET);

        start = clock();
        /* write the value */
        for(i = 0; i < 1000000; i++)
        {
            *blink_mem = i;
            dummy = *blink_mem;
        }
        stop = clock();
        duration = ( stop - start ) / (double) CLOCKS_PER_SEC;
        printf("%f", duration);

        if (munmap(bridge_map, PAGE_SIZE) < 0) {
            perror("munmap");
            goto cleanup;
        }

        ret = 0;

    cleanup:
        close(fd);
        return ret;
    }

I'm writing to the virtual address space mmap returns and I can verify the write by reading the value at that address but I don't see the value getting updated in the FPGA.

How does the physical address get written when I write to the user virtual space? Is there a way to debug and see if the physical address space is actually being written?

Best Answer

OK, it looks like the subject of this question is a furphy ... memory mapped I/O (done correctly) will be as fast as the processor can do it for the hardware being accessed and there will be no overhead for doing this from user mode as opposed to kernel mode (i.e. there is no "write from the userspace to kernel").

However, you still have to think about what is happening when you do a read from or a write to an address (this is where the question moved on to). In most architectures, there are two mappings - the virtual to the physical, and the physical to the device. The first is set up in the virtual memory hardware and the second is set up in the memory controller.

In addition to the mappings, all accesses usually go via cache hardware, so you have to decide whether you want the accesses to be cached or not. If the underlying device being accessed is RAM of some sort, then you usually want accesses to be cached. For other types of devices, generally you don't.

There may be a lot of other things to think about (e.g. whether the VM mapping is resident in the VM hardware, the width and timing of accesses, priority, permissions, etc) but cache is the first.

In @Karthik's case, because he had not turned off cache in his mapping, depending on the type of cache, either an entire cache line was being written when he wrote to the address (write-through), or the write was being delayed (write-back) (if you want some nitty gritty about cache, try this).

To answer the specific (follow up) questions, once the virtual address mapping is done and the cache has done its job, the access goes to the memory controller - this hardware decides which bus and/or device is being accessed and does the "right thing" for that hardware, usually involving asserting a chip select and/or a write enable signal, maybe copying a part or all of the physical address onto address lines, maybe some setup timing, etc.

... and the best way to debug this stuff is to have an analyser of some sort connected to your device or bus, or if this is too difficult/expensive, there might be some support for debugging in the memory controller.

One other minor but important point ... take note of the declaration of blink_mem in the code above - the volatile type qualifier is very important. It tells the compiler not to muck with accesses to the address. In addition to this, you should be aware of any special pipeline instructions to do with memory accesses (check out the eieio instruction in powerpc - someone has a sense of humour :-)

Finally, just to re-iterate what was said in the comments, which turned out to be the real answer to the question, when calling remap_pfn_range() you turn off cache by modifying the page protections specified in the last argument (prot) using the pgprot_noncached() macro. Also read this and this and particularly this. Cheers!