I have a large text file (~500K lines) with short sentences (couple of words long). Additionally, there is some XML markup in most of the lines. Finally, the text file has been sorted before the markup has been added! Adding the XML markup changes the alphabetic sort but this is desired.
My question is: How can I print random lines respecting the order of the source file?
I know I could just use the shuf command and sort the result. The problem is that the markup will mess up the sort.
I could also write a python
script which loads the text file in a list, generates some random numbers, sorts them and uses them as indices to pull out the lines. If possible, I would prefer standard *nix command-line tools.
Sample data:
<CITY>anaconda</CITY> city is in <STATE>montana</STATE>
let's go to <CITY>rome</CITY>
please find <CITY>berlin</CITY>
where is <CITY>cairo</CITY> in <COUNTRY>egypt</COUNTRY>
For example, it would be great if I could pull out the line 2 and 3. Lines 1,3 and 4 are also good. If I get the line 3, 1 and 4, this is not good.
Best Answer
Use this:
nl
to number the lines,shuf
to shuffle and limit the output to 2 lines (-n
),sort
to rebuild the original order,cut
to remove the numeration ofnl
.It will print 2 lines of your file in the original order of the file. Use
shuf -n X
, whereX
can be any number.