Linux – split a file by a line prefix

bashcommand linegreplinux

My data looks like this:

60  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
61  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
62  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
63  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
64  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

I want to split it out into separate files by the line prefix.. like this:

file 60 contains all lines prefixed with "60"
file 61 contains all lines prefixed with "61"
...

The best idea I came up with so far was to grep for all the line prefixes, then loop through that and grep each one of them out into a separate file, but it's a fairly large file, so that might take a really long time. Perhaps there is a better way than looping and grepping? Some way of grep grouping? I know there is a way to cut the file up if there were markers between each section like — but I'm not entirely sure that's the best way either.

Best Answer

If the input file is called data, one solution is:

awk '{print>$1}' data

In awk, the first field (column) is called $1. The above loops through each line of input (awk does this implicitly) and writes that line to a file whose name is the first field.

In more detail:

  • The command is placed in braces. Since there is no qualifier in front of the braces, the command will be run on every input line.

  • The command print, with no argument, will print the whole input line.

  • The symbol > indicates redirection of the output to a file

  • The file name is specified as $1 which, again, refers to whatever text was in the first field of the input line.

Thus, this command will create files named 60, 61, etc. which will contain the corresponding lines from the input file.

Handling very large datasets

By default, awk keeps all the files handles open until the whole command finishes. Consequently, with very large datasets, it is possible to exceed the system limits on number of open files. The simplest solution is to use append and close each file after writing:

awk '{print>>$1; close($1)}' data

Because this uses >> (append), this will add to existing data files rather than overwrite them. If that isn't what you want, delete them before running this command.

Related Question