Optimize ext4 for always full operation

ext3ext4filesystems

Our application writes data to disk as a huge ring buffer (30 to 150TB); writing new files while deleting old files. As such, by definition, the disk is always "near full".

The writer process creates various files at a net input speed of about 100-150 Mbits/s. Data files are a mixture of 1GB 'data' files and several smaller meta data files. (The input speed is constant, but note new file sets are created only once per two minutes).

There is a separate deleter process which deletes the "oldest" files every 30s. It keeps deleting until there it reaches 15GB free space headroom on the disk.

So in stable operation, all data partitions have only 15GB free space.

On this SO question relating to file system slowdown, DepressedDaniel commented:

Sync hanging just means the filesystem is working hard to save the
latest operations consistently. It is most certainly trying to shuffle
data around on the disk in that time. I don't know the details, but
I'm pretty sure if your filesystem is heavily fragmented, ext4 will
try to do something about that. And that can't be good if the
filesystem is nearly 100% full. The only reasonable way to utilize a
filesystem at near 100% of capacity is to statically initialize it
with some files and then overwrite those same files in place (to avoid
fragmenting). Probably works best with ext2/3.

Is ext4 a bad choice for this application? Since we are running live, what tuning can be done to ext4 to avoid fragmentation, slow downs, or other performance limitations? Changing from ext4 would be quite difficult…

(and re-writing statically created files means rewriting the entire application)

Thanks!

EDIT I

The server has 50 to 100 TB of disks attached (24 drives). The Areca RAID controller manages the 24 drives as a RAID-6 raid set.

From there we divide into several partitions/volumes, with each volume being 5 to 10TB. So the size of any one volume is not huge.

The "writer" process finds the first volume with "enough" space and writes a file there. After the file is written the process is repeated.

For a brand new machine, the volumes are filled up in order. If all volumes are "full" then the "deleter" process starts deleting the oldest files until "enough" space is available.

Over a long time, because of the action of other processes, the time sequence of files becomes randomly distributed across all volumes.

EDIT II

Running fsck shows very low fragmentation: 1 – 2%. However, in the meantime, slow filesystem access has been traced to various system calls like fclose(), fwrite(), ftello() etc taking a very long time to execute (5 to 60 seconds!).

So far no solution on this problem. See more details at this SO question: How to debug very slow (200 sec) fwrite()/ftello()/fclose()?

I've disabled sysstat and raid-check to see if any improvement.

Best Answer

In principle, I don't see why strict ring-buffer writes would pose any challenge regarding fragmentation. It seems like it would be straightforward. The quote sounds to me like it is based on advice from more general write workloads. But looking at the linked SO question I see you have a real problem...

Since you are concerned about fragmentation, you should consider how to measure it! e4defrag exists. It has only two options. -c only shows the current state and does not defrag. -v shows per-file statistics. All combinations of options are valid (including no options). Although it does not provide any explicit method to limit the performance impact on a running system, e4defrag supports being run on individual files, so you can rate-limit it yourself.

(XFS also has a defrag tool, though I haven't used it.)

e2freefrag can show free space fragmentation. If you use the CFQ IO scheduler, then you can run it with a reduced IO priority using ionice.

The quote guesses wrong, the reply by Stephen Kitt is correct. ext4 does not perform any automatic defragmentation. It does not try to "shuffle around" data which has already been written.

Discarding this strange misconception leaves no reason to suggest "ext2/ext3". Apart from anything else, the ext3 code does not exist in current kernels. The ext4 code is used to mount ext3. ext3 is a subset of ext4. In particular when you are creating relatively large files, it just seems silly not to use extents, and those are an ext4-specific feature.

I believe "hanging" is more often associated with the journal. See e.g. comments from (the in-progress filesystem) bcachefs -

Tail latency has been the bane of ext4 users for many years - dependencies in the journalling code and elsewhere can lead to 30+ second latencies on simple operations (e.g. unlinks) on multithreaded workloads. No one seems to know how to fix them.

In bcachefs, the only reason a thread blocks on IO is because it explicitly asked to (an uncached read or an fsync operation), or resource exhaustion - full stop. Locks that would block foreground operations are never held while doing IO. While bcachefs isn't a realtime filesystem today (it lacks e.g. realtime scheduling for IO), it very conceivably could be one day.

Don't ask me to interpret the extent to which using XFS can avoid the above problem. I don't know. But if you were considering testing an alternative filesystem setup, XFS is the first thing I would try.

I'm struggling to find much information about the effects of disabling journalling on ext4. At least it doesn't seem to be one of the common options considered when tuning performance.

I'm not sure why you're using sys_sync(). It's usually better avoided (see e.g. here). I'm not sure that really explains your problem, but it seems an unfortunate thing to come across when trying to narrow this down.

Related Solutions

Ext4: where did the free space go

You're playing with reserved blocks count, and trying to control it via free space within df with rounded values.

10645080-10645080*0.05=10112826 blocks are available for normal user

and you have 10353576 blocks used, therefore it is normal.

Also df rounds values. you have 10645080 1K-blocks and that is rounded to 11G.

Really it is 10645080/1024/1024=10.15G, not 11G.

You can run df -BM to check size in MB, it's rounded with much less inaccuracy.

Tune2fs – how much space to reserve on large ext4 filesystem

This reserve is primarily for the core system partitions so that root can still log in if a regular user manages to fill the drive and clog up the works. The space is needed for temp files, copying, and general elbow room for shell commands.

None is strictly needed on simple user data volumes. 5% on large modern drives is way too much, I use 2% just to be safe but likely is still overkill.(100 MiB would likely be enough for emergency mode on many systems)

However, the secondary reason for reserved space is that it leaves gaps between the end of one file and the beginning of the next. A little space can help prevent fragmenting of frequently altered files, essentially the modifications can be kept within the same physical area as the original file. Media files are rarely modified, unless of course you are editing said media. On SSDs fragmenting doesn't matter, as all segments are accessed at equal speed.

Best Answer

Related Solutions

Ext4: where did the free space go

Tune2fs – how much space to reserve on large ext4 filesystem

Related Question