Accessing Memory Mapped I/O is slow

armiovirtual-memory

I have a Terasic-SoCKIT(fpga & arm cortex a9) and I have Linux running on the HPS. I'm trying to access the memory mapped I/O, wrote a simple character driver with functions "request_mem_region" and "ioremap".

The memory mapped IO is an AXI bus, using which I can transmit data to the FPGA. I see that each write is taking almost 6us and for my application I need it to be less than 1us. Also, the driver stops writing to the mapped IO after a few writes(don't see the data being changed in the fpga; is the buffer in the driver getting full??).

The question is am I missing something or because the writes are happening from the virtual address to the physical address it cannot be anymore faster? If the writing from virtual address is slowing down, is there a way to speed it up? I know, that ARM has a DMAC but I haven't explored it yet.

Thank you,
Karthik

I'm sorry, I missed to tell that I was measuring the time in the user space code. Later, I checked the time it took to write in the driver and it was in nanoseconds. So, I figured most of the time taken was for the write from the userspace to kernel.

So, I did some further reading and understood that ioremap() maps the physical address to kernel virtual address and remap_pfn_range() maps the physical address/IO memory to the user virtual space (this is what I need; write to IO memory from user space). And I used the simple mmap example – http://web.cecs.pdx.edu/~jrb/ui/linux/examples.dir/simple/simple.c as the kernel driver. The following code is my userspace code:

    using namespace std;
    #include <iostream>
    #include <sys/mman.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <sys/mman.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <stdint.h>
    #include <ctime>


    #define PAGE_SIZE 4096
    #define HPS2FPGA_BRIDGE_BASE 0xc0000000
    #define BLINK_OFFSET 0x0

    volatile unsigned char *blink_mem;
    void *bridge_map;

    int main()
    {
        int fd, ret = EXIT_FAILURE;
        unsigned int i;
        unsigned char value;
        int dummy;
        off_t blink_base = HPS2FPGA_BRIDGE_BASE;
        clock_t start, stop;
        double duration;

        /* open the memory device file */
        fd = open("/dev/HPS2FPGA", O_RDWR|O_SYNC);
        if (fd < 0) {
            perror("open");
            exit(EXIT_FAILURE);
        }

        /* map the LWHPS2FPGA bridge into process memory */
        bridge_map = mmap(NULL, PAGE_SIZE, PROT_WRITE|PROT_READ|PROT_EXEC, MAP_SHARED,
                    fd, blink_base);
        if (bridge_map == MAP_FAILED) {
            perror("mmap");
            goto cleanup;
        }


        /* get the delay_ctrl peripheral's base address */
        blink_mem = (unsigned char *) (bridge_map + BLINK_OFFSET);

        start = clock();
        /* write the value */
        for(i = 0; i < 1000000; i++)
        {
            *blink_mem = i;
            dummy = *blink_mem;
        }
        stop = clock();
        duration = ( stop - start ) / (double) CLOCKS_PER_SEC;
        printf("%f", duration);

        if (munmap(bridge_map, PAGE_SIZE) < 0) {
            perror("munmap");
            goto cleanup;
        }

        ret = 0;

    cleanup:
        close(fd);
        return ret;
    }

I'm writing to the virtual address space mmap returns and I can verify the write by reading the value at that address but I don't see the value getting updated in the FPGA.

How does the physical address get written when I write to the user virtual space? Is there a way to debug and see if the physical address space is actually being written?

Best Answer

OK, it looks like the subject of this question is a furphy ... memory mapped I/O (done correctly) will be as fast as the processor can do it for the hardware being accessed and there will be no overhead for doing this from user mode as opposed to kernel mode (i.e. there is no "write from the userspace to kernel").

However, you still have to think about what is happening when you do a read from or a write to an address (this is where the question moved on to). In most architectures, there are two mappings - the virtual to the physical, and the physical to the device. The first is set up in the virtual memory hardware and the second is set up in the memory controller.

In addition to the mappings, all accesses usually go via cache hardware, so you have to decide whether you want the accesses to be cached or not. If the underlying device being accessed is RAM of some sort, then you usually want accesses to be cached. For other types of devices, generally you don't.

There may be a lot of other things to think about (e.g. whether the VM mapping is resident in the VM hardware, the width and timing of accesses, priority, permissions, etc) but cache is the first.

In @Karthik's case, because he had not turned off cache in his mapping, depending on the type of cache, either an entire cache line was being written when he wrote to the address (write-through), or the write was being delayed (write-back) (if you want some nitty gritty about cache, try this).

To answer the specific (follow up) questions, once the virtual address mapping is done and the cache has done its job, the access goes to the memory controller - this hardware decides which bus and/or device is being accessed and does the "right thing" for that hardware, usually involving asserting a chip select and/or a write enable signal, maybe copying a part or all of the physical address onto address lines, maybe some setup timing, etc.

... and the best way to debug this stuff is to have an analyser of some sort connected to your device or bus, or if this is too difficult/expensive, there might be some support for debugging in the memory controller.

One other minor but important point ... take note of the declaration of blink_mem in the code above - the volatile type qualifier is very important. It tells the compiler not to muck with accesses to the address. In addition to this, you should be aware of any special pipeline instructions to do with memory accesses (check out the eieio instruction in powerpc - someone has a sense of humour :-)

Finally, just to re-iterate what was said in the comments, which turned out to be the real answer to the question, when calling remap_pfn_range() you turn off cache by modifying the page protections specified in the last argument (prot) using the pgprot_noncached() macro. Also read this and this and particularly this. Cheers!

Related Solutions

Linux – ZONE_NORMAL and it’s association with Kernel/User-pages

On a 32-bit architecture you have 0xffffffff (4'294'967'295 or 4 GB) linear addresses (not physical space) to refer to a physical address.
Even with only 512 MB of physical storage (the real RAM stick connected to the bus), the kernel will still use 4'294'967'295 (4 GB) linear addresses to calculate the physical ones.

The linux kernel divides these 4 GB (of addresses) into the user space (high memory) and the kernel space (low memory) by 3/1, so the kernel space has 1'073'741'823 (1 GB) of linear addresses to use.

These 1 GB of linear addresses, are only accessible by the kernel and are getting divided up even further.

ZONE_DMA: Contains page frames of memory below 16 MB. This is used for old ISA buses, they are able to address only the first 16 MB of RAM.

ZONE_NORMAL: Contains page frames of memory at and above 16 MB and below 896 MB, these are the addresses, which the kernel can map/access directly.

ZONE_HIGHMEM: Contains page frames of memory at and above 896 MB, page frames above this border are not generally mapped to the kernel space and therefore not directly accessible by the kernel. Page frames from the user space can be temporarily or permanently mapped here.

How much real, physical RAM space is occupied by the different zones depends on the form and number of processes you run.

If you enter free -ml in your console, you can see the usage including low- and high memory:

             total       used       free     shared    buffers     cached
Mem:          3022       2116        905          0        105       1342
Low:           839        196        642
High:         2182       1919        263
-/+ buffers/cache:        667       2354
Swap:         2859         93       2766

Kernel address space mappings with respect to virtual address space – a question based on text by Robert Love

Physical Address Extension (PAE) sounds exactly like what he's referring to.

A 32-bit CPU can only map ~4gb of memory, even if the system has more. But with PAE, you can use >4gb, though only 4gb of it is mapped at any one time (a single process will never be able to use >4gb).

So basically when the kernel changes the actively running process, it re-maps the virtual memory to the physical memory which that process is currently using.

Best Answer

Related Solutions

Linux – ZONE_NORMAL and it’s association with Kernel/User-pages

Kernel address space mappings with respect to virtual address space – a question based on text by Robert Love

Related Question