Context :
In order to test some backup process which are supposed to work all nights, i want to select a random files on a large amount of datas. (Around 7 millions of files. This is a NFS server with ~8To of data used, mainly web applications.
The random selecting part need to be called 2 times, first with pure random selection, and then i want to pick some fresh files (something like : find /data/ -mtime 1|shuf -n 1 )
I wrote a script which made a parsing of backup configuration files, try to restore via the tool, compare original checksum with the old one and report all tests in a mail.
Everything work except the "random selecting part", indeed i've some performance issues.
I've tested many ways to select random files on a large FS. here is some of my ideas :
- Select random used inodes and getting the filename of associed inode (Performance issue, to lot of ram needed, process is very long.)
find /data/ -type f -mtime 1 |shuf -n 1
→ (Too much files piped to shuf (time $command ~ 46seconds) )RANDOM=$(shuf -i 1-7000000 -n 1) && find /data/ -type f -mtime 1 |head -n ${RANDOM}
→ Same performance problem when random > 1000000 (time $command ~ 49seconds))- Python discovery script with (os.listdir) → Good performance but i used with ctime, horrible performance issues
I'm very surprised to not find some library/tools/scripts (in Python, Bash, C or whatever) for making things like explained before. It seems not be a specific problem, i think that some admins in the world and to test randomly if backup are working correctly.
So i'm interrested of some ways to do it, with specific GNU/Linux/BSD/*nix tools, Python script/library. Hope you consider that i'm searching for "high performance" things. My script will call solution for each path backup config files.
Thanks in advance
Best Answer
The reason there isn't a 'standard tool' is because the logic is - as you've found - quite simple. The limiting factor is that you must do a deep directory traversal, and that's always an expensive process.
It doesn't really matter what approach you take in terms of scripting tools - the 'cost' is the disk IO.
So the optimisations I'd suggest would be:
find | shuf
andfind | head
won't do this).mtime
will help you build both lists. If you generate a random number, and select the 'last' file, and the last recent file before that number.Something like this (In
perl
but I'm sure you could do it in Python)