Ubuntu – How to grep for groups of n digits, but no more than n

command linegreptext processing

I'm learning Linux, and I have a challenge that I seem to fail to solve on my own. Here it is:

grep a line from a file which contains 4 numbers in a row but not more than 4.

I'm not sure how to approach this. I can search for specific numbers but not their amount in a string.

Best Answer

There are two ways to interpret this question; I'll address both cases. You might want to display lines:

  1. that contain a sequence of four digits that is itself not part of any longer sequence of digits, or
  2. that contains a four-digit sequence but no longer sequence of digits (not even separately).

For example, (1) would display 1234a56789, but (2) wouldn't.


If you want to display all lines that contain a sequence of four digits that is itself not part of any longer sequence of digits, one way is:

grep -P '(?<!\d)\d{4}(?!\d)' file

This uses Perl regular expressions, which Ubuntu's grep (GNU grep) supports via -P. It won't match text like 12345, nor will it match the 1234 or 2345 that are part of it. But it will match the 1234 in 1234a56789.

In Perl regular expressions:

  • \d means any digit (it's a short way to say [0-9] or [[:digit:]]).
  • x{4} matches x 4 times. ({ } syntax isn't specific to Perl regular expressions; it's in extended regular expressions via grep -E as well.) So \d{4} is the same as \d\d\d\d.
  • (?<!\d) is a zero-width negative look-behind assertion. It means "unless preceded by \d."
  • (?!\d) is a zero-width negative look-ahead assertion. It means "unless followed by \d."

(?<!\d) and (?!\d) don't match text outside the sequence of four digits; instead, they will (when used together) prevent a sequence of four digits from itself being matched if it is part of a longer sequence of digits.

Using just the look-behind or just the look-ahead is insufficient because the rightmost or leftmost four-digit subsequence would still be matched.

One benefit of using look-behind and look-ahead assertions is that your pattern matches only the four-digit sequences themselves, and not the surrounding text. This is helpful when using color highlighting (with the --color option).

ek@Io:~$ grep -P '(?<!\d)\d{4}(?!\d)' <<< 12345abc789d0123e4
12345abc789d0123e4

By default in Ubuntu, each user has alias grep='grep --color=auto' in their ~.bashrc file. So you get color highlighting automatically when you run a simple command starting with grep (this is when aliases are expanded) and standard output is a terminal (this is what --color=auto checks for). Matches are typically highlighted in a shade of red (close to vermilion), but I've shown it in italicized bold. Here's a screenshot:
Screenshot showing that grep command, with 12345abc789d0123e4 as output, with the 0123 highlighted in red.

And you can even make grep print only matching text, and not the whole line, with -o:

ek@Io:~$ grep -oP '(?<!\d)\d{4}(?!\d)' <<< 12345abc789d0123e4
0123

Alternative Way, Without Look-Behind and Look-Ahead Assertions

However, if you:

  1. need a command that will also run on systems where grep doesn't support -P or otherwise don't want to use a Perl regular expression, and
  2. don't need to match the four digits specifically--which is usually the case if your goal is simply to display lines containing matches, and
  3. are okay with a solution that is a bit less elegant

...then you can achieve this with an extended regular expression instead:

grep -E '(^|[^0-9])[0-9]{4}($|[^0-9])' file

This matches four digits and the non-digit character--or beginning or end of the line--surrounding them. Specifically:

  • [0-9] matches any digit (like [[:digit:]], or \d in Perl regular expressions) and {4} means "four times." So [0-9]{4} matches a four-digit sequence.
  • [^0-9] matches characters not in the range of 0 through 9. It is equivalent to [^[:digit:]] (or \D, in Perl regular expressions).
  • ^, when it doesn't appear in [ ] brackets, matches the beginning of a line. Similarly, $ matches the end of a line.
  • | means or and parentheses are for grouping (as in algebra). So (^|[^0-9]) matches the beginning of the line or a non-digit character, while ($|[^0-9]) matches the end of the line or a non-digit character.

So matches occur only in lines containing a four-digit sequence ([0-9]{4}) that is simultaneously:

  • at the beginning of the line or preceded by a non-digit ((^|[^0-9])), and
  • at the end of the line or followed by a non-digit (($|[^0-9])).

If, on the other hand, you want to display all lines that contain a four-digit sequence, but do not contain any sequence of more than four digits (even one that is separate from another sequence of only four digits), then conceptually your goal is to find lines that match one pattern but not another.

Therefore, even if you know how to do it with a single pattern, I'd suggest using something like matt's second suggestion, greping for the two patterns separately.

You don't strongly benefit from any of the advanced features of Perl regular expressions when doing that, so you might prefer not to use them. But in keeping with the above style, here's a shortening of matt's solution using \d (and braces) in place of [0-9]:

grep -P '\d{4}' file | grep -Pv '\d{5}'

Since it uses [0-9], matt's way is more portable--it will work on systems where grep doesn't support Perl regular expressions. If you use [0-9] (or [[:digit:]]) instead of \d, but continue to use { }, you get the portability of matt's way a bit more concisely:

grep -E '[0-9]{4}' file | grep -Ev '[0-9]{5}'

Alternative Way, With a Single Pattern

If you really do prefer a grep command that

  1. uses a single regular expression (not two greps separated by a pipe, as above)
  2. to display lines that contain at least one sequence of four digits,
  3. but no sequences of five (or more) digits,
  4. and you don't mind matching the whole line, not just the digits (you probably don't mind this)

...then you can use:

grep -Px '(\d{0,4}\D)*\d{4}(\D\d{0,4})*' file

The -x flag makes grep display only lines where the entire line matches (rather than any line containing a match).

I've used a Perl regular expression because I think the brevity of \d and \D substantially increase clarity in this case. But if you need something portable to systems where grep doesn't support -P, you can replace them with [0-9] and [^0-9] (or with [[:digit:]] and [^[:digit]]):

grep -Ex '([0-9]{0,4}[^0-9])*[0-9]{4}([^0-9][0-9]{0,4})*' file

The way these regular expressions work is:

  • In the middle, \d{4} or [0-9]{4} matches one sequence of four digits. We may have more than one of these, but we need to have at least one.

  • On the left, (\d{0,4}\D)* or ([0-9]{0,4}[^0-9])* matches zero or more (*) instances of not more than four digits followed by a non-digit. Zero digits (i.e., nothing) is one possibility for "not more than four digits." This matches (a) the empty string or (b) any string ending in a non-digit and not containing any sequences of more than four digits.

    Since the text immediately to the left of the central \d{4} (or [0-9]{4}) must either be empty or end with a non-digit, this prevents the central \d{4} from matching four digits that have a another (fifth) digit just to the left of them.

  • On the right, (\D\d{0,4})* or ([^0-9][0-9]{0,4})* matches zero or more (*) instances of a non-digit followed by not more than four digits (which, like before, could be four, three, two, one, or even none at all). This matches (a) the empty string or (b) any string beginning in a non-digit and not containing any sequences of more than four digits.

    Since the text immediately to the right of the central \d{4} (or [0-9]{4}) must either be empty or start with a non-digit, this prevents the central \d{4} from matching four digits that have another (fifth) digit just to the right of them.

This ensures a four-digit sequence is present somewhere, and that no sequence of five or more digits is present anywhere.

It is not bad or wrong to do it this way. But perhaps the most important reason to consider this alternative is that it clarifies the benefit of using grep -P '\d{4}' file | grep -Pv '\d{5}' (or similar) instead, as suggested above and in matt's answer.

With that way, it's clear your goal is to select lines that contain one thing but not another. Plus the syntax is simpler (so it may be more quickly understood by many readers/maintainers).

Related Question