Shell – Random files selector in filesystem

Context :
In order to test some backup process which are supposed to work all nights, i want to select a random files on a large amount of datas. (Around 7 millions of files. This is a NFS server with ~8To of data used, mainly web applications.

The random selecting part need to be called 2 times, first with pure random selection, and then i want to pick some fresh files (something like : find /data/ -mtime 1|shuf -n 1 )
I wrote a script which made a parsing of backup configuration files, try to restore via the tool, compare original checksum with the old one and report all tests in a mail.
Everything work except the "random selecting part", indeed i've some performance issues.

I've tested many ways to select random files on a large FS. here is some of my ideas :

Select random used inodes and getting the filename of associed inode (Performance issue, to lot of ram needed, process is very long.)
find /data/ -type f -mtime 1 |shuf -n 1 → (Too much files piped to shuf (time $command ~ 46seconds) )
RANDOM=$(shuf -i 1-7000000 -n 1) && find /data/ -type f -mtime 1 |head -n ${RANDOM} → Same performance problem when random > 1000000 (time $command ~ 49seconds))
Python discovery script with (os.listdir) → Good performance but i used with ctime, horrible performance issues

I'm very surprised to not find some library/tools/scripts (in Python, Bash, C or whatever) for making things like explained before. It seems not be a specific problem, i think that some admins in the world and to test randomly if backup are working correctly.

So i'm interrested of some ways to do it, with specific GNU/Linux/BSD/*nix tools, Python script/library. Hope you consider that i'm searching for "high performance" things. My script will call solution for each path backup config files.

Thanks in advance

#!/usr/bin/env perl use strict; use warnings; use File::Find; my $random_file; my $recent_random_file; my $limit = rand ( 7_000_000 ); #ideally set to file count on fs. sub search { if ( $count++ > $limit ) { $File::Find::prune = 1; #stop traversing return; } return unless -f; if ( -M $File::Find::name < 1 ) { $recent_random_file = $File::Find::name }; $random_file = $File::Find::name; } find ( \&search, "/path/to/search"); print "$recent_random_file $random_file\n";

Best Answer

The reason there isn't a 'standard tool' is because the logic is - as you've found - quite simple. The limiting factor is that you must do a deep directory traversal, and that's always an expensive process.

It doesn't really matter what approach you take in terms of scripting tools - the 'cost' is the disk IO.

So the optimisations I'd suggest would be:

Don't walk the whole FS. Bail out of your traversal when you've found enough. (find | shuf and find | head won't do this).
You can probably approximate directory sizing by referring to last traversals, and 'skip ahead' by some margin.
statting files as you go and recording mtime will help you build both lists. If you generate a random number, and select the 'last' file, and the last recent file before that number.

Something like this (In perl but I'm sure you could do it in Python)

Best Answer

Related Solutions

Python – Measuring Python script bandwidth usage

Ubuntu – Reasons for rsync NOT transferring all files

Related Question