Shell – Random files selector in filesystem

backupfilespythonshell

Context :
In order to test some backup process which are supposed to work all nights, i want to select a random files on a large amount of datas. (Around 7 millions of files. This is a NFS server with ~8To of data used, mainly web applications.

The random selecting part need to be called 2 times, first with pure random selection, and then i want to pick some fresh files (something like : find /data/ -mtime 1|shuf -n 1 )
I wrote a script which made a parsing of backup configuration files, try to restore via the tool, compare original checksum with the old one and report all tests in a mail.
Everything work except the "random selecting part", indeed i've some performance issues.

I've tested many ways to select random files on a large FS. here is some of my ideas :

  • Select random used inodes and getting the filename of associed inode (Performance issue, to lot of ram needed, process is very long.)
  • find /data/ -type f -mtime 1 |shuf -n 1 → (Too much files piped to shuf (time $command ~ 46seconds) )
  • RANDOM=$(shuf -i 1-7000000 -n 1) && find /data/ -type f -mtime 1 |head -n ${RANDOM} → Same performance problem when random > 1000000 (time $command ~ 49seconds))
  • Python discovery script with (os.listdir) → Good performance but i used with ctime, horrible performance issues

I'm very surprised to not find some library/tools/scripts (in Python, Bash, C or whatever) for making things like explained before. It seems not be a specific problem, i think that some admins in the world and to test randomly if backup are working correctly.

So i'm interrested of some ways to do it, with specific GNU/Linux/BSD/*nix tools, Python script/library. Hope you consider that i'm searching for "high performance" things. My script will call solution for each path backup config files.

Thanks in advance

Best Answer

The reason there isn't a 'standard tool' is because the logic is - as you've found - quite simple. The limiting factor is that you must do a deep directory traversal, and that's always an expensive process.

It doesn't really matter what approach you take in terms of scripting tools - the 'cost' is the disk IO.

So the optimisations I'd suggest would be:

  • Don't walk the whole FS. Bail out of your traversal when you've found enough. (find | shuf and find | head won't do this).
  • You can probably approximate directory sizing by referring to last traversals, and 'skip ahead' by some margin.
  • statting files as you go and recording mtime will help you build both lists. If you generate a random number, and select the 'last' file, and the last recent file before that number.

Something like this (In perl but I'm sure you could do it in Python)

#!/usr/bin/env perl
use strict;
use warnings;

use File::Find;

my $random_file;
my $recent_random_file;

my $limit = rand ( 7_000_000 ); #ideally set to file count on fs. 

sub search {
    if ( $count++ > $limit ) { 
        $File::Find::prune = 1; #stop traversing
        return; 
    }
    return unless -f; 
    if ( -M $File::Find::name < 1 ) { $recent_random_file = $File::Find::name }; 
    $random_file = $File::Find::name; 
}

find ( \&search, "/path/to/search");
print "$recent_random_file $random_file\n";
Related Question