Extract random sample of N lines based on pattern

awkrandomsedtext processing

I have a file formatted like this:

train/t/temple/east_asia/00000025.jpg 94
train/t/temple/east_asia/00000865.jpg 94
...
train/s/swamp/00000560.jpg 92
train/s/swamp/00000935.jpg 92
....
train/m/mountain/00000428.jpg 68
train/m/mountain/00000126.jpg 68

The last number is the class number. I have 50 different classes, and each class has 1,000 lines. I would like to take a random sample of size N from each class, and store the result in another text file.

Best Answer

Since your lines are grouped by class, you could (with gnu tools)split the file into pieces and use the --fiter option to pipe each piece to shuf to extract N random lines from it:

split --filter='shuf -n N' infile > outfile

Note that split defaults to 1000 lines - which is what you need in this particular case. If the requirements change you'll have to pass the number of lines via -l
e.g. to split into pieces of 200 lines and extract 30 random lines from each piece:

split -l 200 --filter='shuf -n 30' infile > outfile
Related Question