Linux – split a file by a line prefix

bashcommand linegreplinux

My data looks like this:

60  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
61  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
64  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

I want to split it out into separate files by the line prefix.. like this:

file 60 contains all lines prefixed with "60"
file 61 contains all lines prefixed with "61"
...

The best idea I came up with so far was to grep for all the line prefixes, then loop through that and grep each one of them out into a separate file, but it's a fairly large file, so that might take a really long time. Perhaps there is a better way than looping and grepping? Some way of grep grouping? I know there is a way to cut the file up if there were markers between each section like — but I'm not entirely sure that's the best way either.

Best Answer

If the input file is called data, one solution is:

awk '{print>$1}' data

In awk, the first field (column) is called $1. The above loops through each line of input (awk does this implicitly) and writes that line to a file whose name is the first field.

In more detail:

The command is placed in braces. Since there is no qualifier in front of the braces, the command will be run on every input line.
The command print, with no argument, will print the whole input line.
The symbol > indicates redirection of the output to a file
The file name is specified as $1 which, again, refers to whatever text was in the first field of the input line.

Thus, this command will create files named 60, 61, etc. which will contain the corresponding lines from the input file.

Handling very large datasets

By default, awk keeps all the files handles open until the whole command finishes. Consequently, with very large datasets, it is possible to exceed the system limits on number of open files. The simplest solution is to use append and close each file after writing:

awk '{print>>$1; close($1)}' data

Because this uses >> (append), this will add to existing data files rather than overwrite them. If that isn't what you want, delete them before running this command.

Related Solutions

Bash – How to remove lines from large text file using bash

sed --in-place $filter $file

Using grep to remove lines from a file which contain a string from another file

You can do this using grep's -f option (that's lower-case -f, not -F):

% echo -e 'Dog\nFish\nCat\nShoes' > ./file1.txt 
% echo -e '1,shoes,red\n2,apple,black\n3,fog,blue' > ./file2.csv 

# Grab all lines from the CSV that match a pattern from file1:
% grep -if ./file1.txt ./file2.csv
1,shoes,red

# Grab all lines from the CSV that DON'T match a pattern from file1:
% grep -vif ./file1.txt ./file2.csv
2,apple,black
3,fog,blue

Detailed explanation:

grep — self-explanatory
-v — means 'return lines not matching the input pattern'
-i — means 'use case-insensitive matching' (since your first file had capital letters and the CSV didn't)
-f — means 'interpret each line in the specified file (file1.txt) as a pattern to use for matching'

Depending on the results you want and the contents of your files, you may also want to read into the -F and -w options.

If you need to edit the file in-place, i think you can do this with sed's -f option, but sed interprets each line of the file as a command rather than a simple pattern like grep does.

Best Answer

Handling very large datasets

Related Solutions

Bash – How to remove lines from large text file using bash

Using grep to remove lines from a file which contain a string from another file

Related Question