Shell – Print random lines respecting the order of the source file

awkrandomsedshellsort

I have a large text file (~500K lines) with short sentences (couple of words long). Additionally, there is some XML markup in most of the lines. Finally, the text file has been sorted before the markup has been added! Adding the XML markup changes the alphabetic sort but this is desired.

My question is: How can I print random lines respecting the order of the source file?

I know I could just use the shuf command and sort the result. The problem is that the markup will mess up the sort.

I could also write a python script which loads the text file in a list, generates some random numbers, sorts them and uses them as indices to pull out the lines. If possible, I would prefer standard *nix command-line tools.

Sample data:

<CITY>anaconda</CITY> city is in <STATE>montana</STATE>
let's go to <CITY>rome</CITY>
please find <CITY>berlin</CITY>
where is <CITY>cairo</CITY> in <COUNTRY>egypt</COUNTRY>

For example, it would be great if I could pull out the line 2 and 3. Lines 1,3 and 4 are also good. If I get the line 3, 1 and 4, this is not good.

Best Answer

Use this:

nl file | shuf -n2 | sort -n | cut -f2-
  • nl to number the lines,
  • shuf to shuffle and limit the output to 2 lines (-n),
  • sort to rebuild the original order,
  • and cut to remove the numeration of nl.

It will print 2 lines of your file in the original order of the file. Use shuf -n X, where X can be any number.