Ubuntu – Split txt file in half based on pattern

bashsplittext processing

I have a text file which looks like this:

train/a/abbey/00000001.jpg 0
...
train/a/abbey/00000999.jpg 0
train/a/abbey/00001000.jpg 0
train/a/airport_terminal/00000001.jpg 1
train/a/airport_terminal/00000002.jpg 1
...
train/c/corn_field/00000354.jpg 40
train/c/corn_field/00000355.jpg 40
train/c/corn_field/00000356.jpg 40
...
train/y/yard/00000998.jpg 99
train/y/yard/00000999.jpg 99
train/y/yard/00001000.jpg 99

The last number in each line is the category. I have 100 categories (0 to 99) and each one contains 1,000 lines (so 100 * 1,000 = 100,000 lines in total).

I'd like to split this file into two random halves i.e. one half contains 50 random categories and the other half contains the other 50 categories.

Best Answer

This one does it that way, it shuffles each chapter and takes "lineswanted" lines from the result to finaly store it in the both half files:

#!/bin/bash

lineswanted=300
infile="full"
half1="half1"
half2="half2"

# Build chapterlist 0 1 2 3 ....
chapterlist=""
for (( i=0 ; i<100; i=i+1 )) ; do
  chapterlist="$chapterlist $i"
done

# shuffle chapterlist
randomchapterlist="`shuf -e $chapterlist`"

rm -f "$half1" "$half2"

i=0
for chapter in $randomchapterlist ; do
  if [ $i -lt 50 ] ; then
    egrep ".*\ $chapter\$" "$infile" | shuf | head -n $lineswanted >> "$half1"
  else
    egrep ".*\ $chapter\$" "$infile" | shuf | head -n $lineswanted >> "$half2"
  fi
  i=$(( i+1 ));
done