Grep – How to Match Pattern Exactly from File and Search Only in First Column

command linegrepregular expressionshell

I have a bigfile like this:

denovo1 xxx yyyy oggugu ddddd
denovo11 ggg hhhh bbbb gggg
denovo22 hhhh yyyy kkkk iiii
denovo2 yyyyy rrrr fffff jjjj
denovo33 hhh yyy eeeee fffff

then my pattern file is:

denovo1
denovo3
denovo22

I'm trying to use fgrep in order to extract only the lines exactly matching the pattern in my file (so I want denovo1 but not denovo11).
I tried to use -x for the exact match, but then I got an empty file.
I tried:

fgrep -x --file="pattern" bigfile.txt > clusters.blast.uniq

Is there a way to make grep searching only in the first column?

Best Answer

You probably want the -wflag - from man grep

   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

i.e.

grep -wFf patfile file
denovo1 xxx yyyy oggugu ddddd
denovo22 hhhh yyyy kkkk iiii

To enforce matching only in the first column, you would need to modify the entries in the pattern file to add a line anchor: you could also make use of the \b word anchor instead of the command-line -w switch e.g. in patfile:

^denovo1\b
^denovo3\b
^denovo22\b

then

grep -f patfile file
denovo1 xxx yyyy oggugu ddddd
denovo22 hhhh yyyy kkkk iiii

Note that you must drop the -F switch if the file contains regular expressions instead of simple fixed strings.

Related Solutions

Text Processing with Sed and Grep – Return Only the Portion of a Line After a Matching Pattern

The canonical tool for that would be sed.

sed -n -e 's/^.*stalled: //p'

Detailed explanation:

-n means not to print anything by default.
-e is followed by a sed command.
s is the pattern replacement command.
The regular expression ^.*stalled: matches the pattern you're looking for, plus any preceding text (.* meaning any text, with an initial ^ to say that the match begins at the beginning of the line). Note that if stalled: occurs several times on the line, this will match the last occurrence.
The match, i.e. everything on the line up to stalled:, is replaced by the empty string (i.e. deleted).
The final p means to print the transformed line.

If you want to retain the matching portion, use a backreference: \1 in the replacement part designates what is inside a group $…$ in the pattern. Here, you could write stalled: again in the replacement part; this feature is useful when the pattern you're looking for is more general than a simple string.

sed -n -e 's/^.*\(stalled: \)/\1/p'

Sometimes you'll want to remove the portion of the line after the match. You can include it in the match by including .*$ at the end of the pattern (any text .* followed by the end of the line $). Unless you put that part in a group that you reference in the replacement text, the end of the line will not be in the output.

As a further illustration of groups and backreferences, this command swaps the part before the match and the part after the match.

sed -n -e 's/^\(.*\)\(stalled: \)\(.*\)$/\3\2\1/p'

Bash – Recursive search for a pattern, then for each match print out the specific SEQUENCE: line number, file name, and no file contents

Using grep

Why can't you just use the -r switch to grep to recurse the filesystem instead of making use of find? There are 2 additional switches I'd use too, instead of the -n switch.

$ grep -rHn PATTERN <DIR> | cut -d":" -f1-2

Example #1

$ grep -rHn PATH ~/.bashrc | cut -d":" -f1-2
/home/saml/.bashrc:25

Details

-r - recursively search through files + directories
-H - prints the name of the file if it matches (less restrictive than -l) i.e. it works with grep's other switches
-n - display the line number of the match

Example #2

$ grep -rHn PATH ~/.bash* | cut -d":" -f1-2
/home/saml/.bash_profile:10
/home/saml/.bash_profile:12
/home/saml/.bash_profile_askapache:99
/home/saml/.bash_profile_askapache:101
/home/saml/.bash_profile_askapache:118
/home/saml/.bash_profile_askapache:166
/home/saml/.bash_profile_askapache:218
/home/saml/.bash_profile_askapache:250
/home/saml/.bash_profile_askapache:314
/home/saml/.bash_profile_askapache:2317
/home/saml/.bash_profile_askapache:2323
/home/saml/.bashrc:25

Using find

$ find . -exec sh -c 'grep -Hn PATTERN "$@" | cut -d":" -f1-2' {}  +

Example

$ find ~/.bash* -exec sh -c 'grep -Hn PATH "$@" | cut -d":" -f1-2' {}  +
/home/saml/.bash_profile:10
/home/saml/.bash_profile:12
/home/saml/.bash_profile_askapache:99
/home/saml/.bash_profile_askapache:101
/home/saml/.bash_profile_askapache:118
/home/saml/.bash_profile_askapache:166
/home/saml/.bash_profile_askapache:218
/home/saml/.bash_profile_askapache:250
/home/saml/.bash_profile_askapache:314
/home/saml/.bash_profile_askapache:2317
/home/saml/.bash_profile_askapache:2323
/home/saml/.bashrc:25

If you truly want to use find you can do something like this to exec grep upon finding the files using find.