I need to be able to read data sequentially from a file while not storing the data that is being read in the page cache as the file contents are not expected to ever be read again and also because there is memory pressure on the box (want to use the precious memory for useful disk I/O caching).
The question I have is about how I can optimize these reads. Since I know that the data that is being read is sequentially placed on the disk (minus the fragmentation), I want to be able to read ahead (by increasing /sys/block/sda/queue/read_ahead_kb) but am not sure if this will lead to any benefit because I have to prevent the data that is being read from being stored in the page cache by using posix_fadvise (with the POSIX_FADV_DONTNEED flag).
Will the read ahead data be simply discarded because of the hint to drop the data from the page cache?
Best Answer
Use direct IO:
For example:
Direct IO on Linux is quirky and has some restrictions. The application IO buffer must be page-aligned, and some file systems require that each IO request be an exact multiple of the page size. That last restriction can make reading/writing the last portion of a file difficult.
An easy-to-code way to handle readahead in your application can be done using
fdopen
and setting a large page-aligned buffer usingposix_memalign
andsetvbuf
:You can also use
mmap()
to get anonymous memory to use for the buffer. That has the advantage of being naturally page-aligned:Then just use
fread()
/fgets()
or anyFILE *
-type read function you want to read from thefile
stream.You do need to check using a tool such as
strace
that the actualread
system calls are done with a page-aligned and page-sized buffer - some C library implementations ofFILE *
-based stream processing don't use the buffer specified bysetvbuf
for just IO buffering, so the alignment and size can be off. I don't think Linux/glibc does that, but if you don't check and the size and/or alignment is off, your IO calls will fail.And again - Linux direct IO can be quirky. Only some file systems support direct IO, and some of them are more particular than others. TEST this thoroughly if you decide to use it.
The posted code will do a 1 MB read-ahead whenever the stream's buffer needs to be filled. You can also implement more sophisticated read-ahead using threads - one thread fills one buffer, other thread(s) read from a full buffer. That would avoid processing "stutters" as the read-ahead is done, but at the cost of a good amount of relatively complex multi-threaded code.