Millions of (small) text files in a folder

ext4filesfilesystemsperformance

We would like to store millions of text files in a Linux filesystem, with the purpose of being able to zip up and serve an arbitrary collection as a service. We've tried other solutions, like a key/value database, but our requirements for concurrency and parallelism make using the native filesystem the best choice.

The most straightforward way is to store all files in a folder:

$ ls text_files/
1.txt
2.txt
3.txt

which should be possible on an EXT4 file system, which has no limit to number of files in a folder.

The two FS processes will be:

Write text file from web scrape (shouldn't be affected by number of files in folder).
Zip selected files, given by list of filenames.

My question is, will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

Best Answer

The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we're talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

This can be solved by doing your wildcard operations using the find command:

find <directory> -name '<wildcard expression>' -exec <command> {} \+

or a similar syntax whenever possible. The find ... -exec ... \+ will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in ~/trans, and when an access is made to ~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses /a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to ? because that could create conflicts...

How to reset the folder metadata size without recreating the folder

e4fsck supports -D flag which seems to do what you want:

try to optimize all directories, either by reindexing them if the filesystem supports directory indexing, or by sorting and compressing directories for smaller directories, or for filesystems using traditional linear directories.

Of course, you'll need to unmount the filesystem to use fsck, meaning downtime for your server.

You'll want to use the -f option to make sure e4fsck processes the file system even if clean.

Testing:

# truncate -s1G a; mkfs.ext4 -q ./a; mount ./a /mnt/1
# mkdir /mnt/1/x; touch /mnt/1/x/{1..4000}
# ls -ld /mnt/1/x
drwxr-xr-x 2 root root 69632 Nov 22 12:54 /mnt/1/x/
# rm -f /mnt/1/x/*
# ls -ld /mnt/1/x
drwxr-xr-x 2 root root 69632 Nov 22 12:55 /mnt/1/x/
# umount /mnt/1
# e2fsck -f -D ./a
e2fsck 1.43.3 (04-Sep-2016)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Pass 5: Checking group summary information

./a: ***** FILE SYSTEM WAS MODIFIED *****
./a: 12/65536 files (0.0% non-contiguous), 12956/262144 blocks
# mount ./a /mnt/1
# ls -ld /mnt/1/x
drwxr-xr-x 2 root root 4096 Nov 22 12:55 /mnt/1/x/

Best Answer

Related Solutions

Linux Filesystems – Questions About Character Encoding

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

How to reset the folder metadata size without recreating the folder

Related Question