Regular Expressions: How is group matching useful

regex

I've decided to learn some regular expression basics. I am using the Regex One lessons online and I was stuck at lession 11 for a while, but I think I got it now.

This was the task.

"Write a regular expression that matches only the filenames (not including extension) of the PDF files below."

task            text                     capture
capture text    file_a_record_file.pdf   file_a_record_file
capture text    file_yesterday.pdf       file_yesterday
skip text       testfile_fake.pdf.tmp

There is an input field where you type in the pattern to complete the task. After some trials and errors this is what I came up with.

^(file_a_record_file)\.pdf$

This will match the file name file_a_record_file.pdf but only "capture" the file_a_record_file. What's the difference?… between matching and "capturing"? And how is this useful? How is this "group matching"?

Now this does work for the first file, but not for the second file. The task says I need to make a pattern that will match and capture the file name of both files, excluding the extension. So this is what I came up next.

^(file_.*)\.pdf$

Since both file names start with file_ I thought it would be a good idea to match against that, and then tell it to match any character that follows, and then exit the group with parenthesis (the "group" is what's inside the parenthesis, right?) and escape the dot with a back slash and end with the file name extension.

Can this be described in a more tighter way? The correct solutions are not given on the website. So I have nothing to check my answers against. It's a pity because I think this is a good introduction to regular expressions. The examples given for each lesson are sometimes hard to understand.

And again, how is this useful? He mentions something about command line, I think he means that it can be used to re-use commands or something… well I don't really get it what he's saying.

Imagine that we have a command line tool that copies each file in a
directory up to a server only if it doesn't exist there already, and
prints each filename as a result. Now if I want to do another task on
each of those filenames, then I will not only need a regular
expression that will match the filename, but also some way to extract
that information.

Extracting information? What is he talking about? Can someone please tell me how this is useful and give me real world example?

Best Answer

In the lesson you linked to, you are asked to write a regex that captures the file name of these two

file_a_record_file.pdf
file_yesterday.pdf

and skips

testfile_fake.pdf.tmp

The simplest regex to do that is

(.*)\.pdf$

This means match everything that ends in .pdf but capture only the file name.

So, why is capturing useful? That depends on the program you are using these regexes with. Capturing patterns allows you to save what you have captured as a variable. For example, using Perl, the first captured pattern is $1, the second $2 etc:

echo "Hello world" | perl -ne '/(.+) (.+)/; print "$2 $1\n"'

This will print "world Hello" because the first parenthesis captured Hello and the second captured world but we are then printing $2 $1 so the two matches are inverted.

Other regex implementations allow you to refer to the captured patterns using \1, \2 etc. For example, GNU sed:

echo "Hello world" | sed 's/\(.*\) \(.*\)/\2 \1/'

So, in general, capturing patterns is useful when you need to refer to these patterns later on. This is known as referencing and is briefly explained a little later in the tutorials you are doing.