Extracting subset of lines of a file based on regex for first & last line

regular expressiontext processing

I have a large text file and I only want to look at some of the lines. The first line I want matches a regex R, and when the line matches the regex S, I don't care about that line, or any following lines. Lines in the middle will not match R. Is there a way to do this on the command line in a bash command so I can pipe the output somewhere after?

Best Answer

With sed:

sed -n '/R/,$!d; /S/q; p'

Example:

$ seq 20 | sed -n '/6/,$!d; /1/q; p'
6
7
8
9

Related Solutions

Extracting lines by key from very large file

Using awk should be efficient enough - it provides builtin associative arrays, where the key lookup time is logarithmically proportional to the number of keys (of your lookup table - which is relatively small in your example).

For your input this would be:

42M * log2(1.5M) -> 42M * 20 key comparisons

(where M means 10^6)

In case your awk uses hash tables, every key lookup would only cost a constant amount of time.

An example of an efficient awk based solution (using the default field separator):

$ awk 'ARGIND == 1 { a[$1] = 1; next } a[$1] { print $0 }' keys.dat largefile.dat

Since both inputs are sorted you could write a script that would be more efficient (with a runtime scaling linearly with both input file sizes). But it would cost more time programming it.

Or you could use join which expect sorted files as input - restriction is that your key needs to be alphabetically sorted - and perhaps you have to tweak the output format. For example:

$ join -j1 keys.dat largefile.dat

Use -t to configure the field separator and -o to adjust the output format.

This should run in time linear to the input size.

Extracting a subset of lines from a large text file

This will give you all lines whose 9th column matches MEF2:

awk -F"\t" '$9~/MEF2/' file > output

Assuming your file is always tab-delimited, this will work and you can rest safely. That's as close to 0 margin of error as you'll ever get.

If, however, you have tried importing into something like R (presumably using read.table("file",sep="\t")) and that didn't work, you might have some lines with different numbers of fields (see the end for how to check that). If so, assuming you are always interested in the last field, you can use $(NF) in awk to print the last field, no matter how many fields there are:

awk -F"\t" '$(NF)~/MEF2/' file > output

If you still feel the need to check, you can simply extract all lines that match MEF2, irrespective of where the match is, then compare the results:

grep MEF2 file > output2

Once you have that, you can use wc to check if they have the same number of lines. If they don't, find where they differ by running

grep -vFf output output2

That command will print any lines in output2 that are not present in output1. If there are any, most probably they will have MEF2 somewhere in the line but not in the 9th field. If it's in the 9th field, then you know that your file is not tab separated and there's something wrong with your data.

The awk above is probably the simplest solution but here are a few others that do the same thing:

Perl

perl -F"\t" -lane '$F[8]=~/MEF2/ && print' file

sed (this one might match wrong lines if you have more than 9 fields)
```
sed -n '/\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*/p' file
```

grep

grep -P '^.+?\t.*\t.*\t.*\t.*\t.*\t.*\t.*\t.*MEF2.*' file

If these don't all produce the same output, you know there's an issue with your file. One more thing you can check is to make sure that all your lines have 9 fields. If they don't, you know there's an issue:

awk -F"\t" 'NF!=9' file

The above will print all lines that don't have exactly 9 tab-separated fields. If there is output, the lines it prints are problematic.

Best Answer

Related Solutions

Extracting lines by key from very large file

Extracting a subset of lines from a large text file

Related Question