Grep -v: How to exclude only the first (or last) N lines that match

greptext processing

Sometimes there are a few really annoying lines in otherwise tabular data like

column name | other column name
-------------------------------

I generally prefer removing garbage lines that shouldn't be there by grep -v ing a reasonably unique string, but the problem with that approach is that if the reasonably unique string appears in the data by accident that's a serious problem.

Is there a way to limit the number of lines that grep -v can remove (say to 1)? For bonus points, is there a way to count the number of lines from the end without resorting to <some command> | tac | grep -v <some stuff> | tac ?

Best Answer

sed provides a simpler way:

... |  sed '/some stuff/ {N; s/^.*\n//; :p; N; $q; bp}' | ...

This way you delete first occurrence.

If you want more:

sed '1 {h; s/.*/iiii/; x}; /some stuff/ {x; s/^i//; x; td; b; :d; d}'

, where count of i is count of occurrences (one or more, not zero).

Multi-line Explanation

sed '1 {
    # Save first line in hold buffer, put `i`s to main buffer, swap buffers
    h
    s/^.*$/iiii/
    x
}

# For regexp what we finding
/some stuff/ {
    # Remove one `i` from hold buffer
    x
    s/i//
    x
    # If successful, there was `i`. Jump to `:d`, delete line
    td
    # If not, process next line (print others).
    b
    :d
    d
}'

In addition

Probably, this variant will work faster, 'cos it reads all rest lines and print them in one time

sed '1 {h; s/.*/ii/; x}; /a/ {x; s/i//; x; td; :print_all; N; $q; bprint_all; :d; d}'

As result

You can put this code into your .bashrc (or config of your shell, if it is other):

dtrash() {
    if [ $# -eq 0 ]
    then
        cat
    elif [ $# -eq 1 ]
    then
        sed "/$1/ {N; s/^.*\n//; :p; N; \$q; bp}"
    else
        count=""
        for i in $(seq $1)
        do
            count="${count}i"
        done
        sed "1 {h; s/.*/$count/; x}; /$2/ {x; s/i//; x; td; :print_all; N; \$q; bprint_all; :d; d}"

    fi
}

And use it this way:

# Remove first occurrence
cat file | dtrash 'stuff' 
# Remove four occurrences
cat file | dtrash 4 'stuff'
# Don't modify
cat file | dtrash

Code:

# read the vins into a set to allow fast lookup
with open('file3', 'rU') as f:
    vins = {vin.strip() for vin in f.readlines()}

# go through the data file one line at a time
with open('file2', 'rU') as f:
    for line in f.readlines():

        # get the vin in the line
        vin = line.split(',')[8]

        # if the vin is not in our set, print out the line
        if vin not in vins:
            print(line.strip())

Results:

123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1C4NJPBB4DD122174,2014-01-20
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1GMDV33179D147281,2014-01-20
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,1FUYDCYB7WP879651,2014-01-20
123,email@example.com,JOE,BLOGGS,123456789,12345-123,"Place Name",12345,5TDBT48A72S003496,2014-01-20

Best Answer

Multi-line Explanation

In addition

As result

Related Solutions

Exclude files that have very long lines of text from grep output

Python – Selecting lines in a file that do not contain the value in the other file

Code:

Results:

Related Question