Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.
For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.
head
and tail
can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.
Best Answer
The
shuf
command (part of coreutils) can do this:And at least for now non-ancient versions (added in a commit from 2013), that will use reservoir sampling when appropriate, meaning it shouldn't run out of memory and is using a fast algorithm.