How to Randomly Sample a Subset of a File – Command Line Guide

commandcommand linefiles

Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.

For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.

head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.

Best Answer

The shuf command (part of coreutils) can do this:

shuf -n 1000 file

And at least for now non-ancient versions (added in a commit from 2013), that will use reservoir sampling when appropriate, meaning it shouldn't run out of memory and is using a fast algorithm.

Related Question