Millions of (small) text files in a folder

ext4filesfilesystemsperformance

We would like to store millions of text files in a Linux filesystem, with the purpose of being able to zip up and serve an arbitrary collection as a service. We've tried other solutions, like a key/value database, but our requirements for concurrency and parallelism make using the native filesystem the best choice.

The most straightforward way is to store all files in a folder:

$ ls text_files/
1.txt
2.txt
3.txt

which should be possible on an EXT4 file system, which has no limit to number of files in a folder.

The two FS processes will be:

  1. Write text file from web scrape (shouldn't be affected by number of files in folder).
  2. Zip selected files, given by list of filenames.

My question is, will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

Best Answer

The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we're talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

This can be solved by doing your wildcard operations using the find command:

find <directory> -name '<wildcard expression>' -exec <command> {} \+

or a similar syntax whenever possible. The find ... -exec ... \+ will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

Related Question