Regular Expressions: How is group matching useful

regex

I've decided to learn some regular expression basics. I am using the Regex One lessons online and I was stuck at lession 11 for a while, but I think I got it now.

This was the task.

"Write a regular expression that matches only the filenames (not including extension) of the PDF files below."

task            text                     capture
capture text    file_a_record_file.pdf   file_a_record_file
capture text    file_yesterday.pdf       file_yesterday
skip text       testfile_fake.pdf.tmp

There is an input field where you type in the pattern to complete the task. After some trials and errors this is what I came up with.

^(file_a_record_file)\.pdf$

This will match the file name file_a_record_file.pdf but only "capture" the file_a_record_file. What's the difference?… between matching and "capturing"? And how is this useful? How is this "group matching"?

Now this does work for the first file, but not for the second file. The task says I need to make a pattern that will match and capture the file name of both files, excluding the extension. So this is what I came up next.

^(file_.*)\.pdf$

Since both file names start with file_ I thought it would be a good idea to match against that, and then tell it to match any character that follows, and then exit the group with parenthesis (the "group" is what's inside the parenthesis, right?) and escape the dot with a back slash and end with the file name extension.

Can this be described in a more tighter way? The correct solutions are not given on the website. So I have nothing to check my answers against. It's a pity because I think this is a good introduction to regular expressions. The examples given for each lesson are sometimes hard to understand.

And again, how is this useful? He mentions something about command line, I think he means that it can be used to re-use commands or something… well I don't really get it what he's saying.

Imagine that we have a command line tool that copies each file in a
directory up to a server only if it doesn't exist there already, and
prints each filename as a result. Now if I want to do another task on
each of those filenames, then I will not only need a regular
expression that will match the filename, but also some way to extract
that information.

Extracting information? What is he talking about? Can someone please tell me how this is useful and give me real world example?

Best Answer

In the lesson you linked to, you are asked to write a regex that captures the file name of these two

file_a_record_file.pdf
file_yesterday.pdf

and skips

testfile_fake.pdf.tmp

The simplest regex to do that is

(.*)\.pdf$

This means match everything that ends in .pdf but capture only the file name.

So, why is capturing useful? That depends on the program you are using these regexes with. Capturing patterns allows you to save what you have captured as a variable. For example, using Perl, the first captured pattern is $1, the second $2 etc:

echo "Hello world" | perl -ne '/(.+) (.+)/; print "$2 $1\n"'

This will print "world Hello" because the first parenthesis captured Hello and the second captured world but we are then printing $2 $1 so the two matches are inverted.

Other regex implementations allow you to refer to the captured patterns using \1, \2 etc. For example, GNU sed:

echo "Hello world" | sed 's/\(.*\) \(.*\)/\2 \1/'

So, in general, capturing patterns is useful when you need to refer to these patterns later on. This is known as referencing and is briefly explained a little later in the tutorials you are doing.

Related Solutions

Windows: File copy/move with filename regular expressions

I like using all Powershell commands when I can. After a bit of testing, this is the best I can do.

$source = "C:\test" 
$destination = "C:\test2" 
$filter = [regex] "^[0-9]{6}\.(jpg|gif)"

$bin = Get-ChildItem -Path $source | Where-Object {$_.Name -match $filter} 
foreach ($item in $bin) {Copy-Item -Path $item.FullName -Destination $destination}

The first three lines are just to make this easier to read, you can define the variables inside the actual commands if you want. The key to this code sample is the the "Where-Object" command which is a filter that accepts regular expression matching. It should be noted that regular expression support is a little weird. I found a PDF reference card here that has the supported characters on the left side.

[EDIT]

As "@Johannes Rössel" mentioned, you can also reduce the last two lines down to a single line.

((Get-ChildItem -Path $source) -match $filter) | Copy-Item -Destination $destination

The main difference is that Johannes's way does object filtering and my way does text filtering. When working with Powershell, it's almost always better to use objects.

[EDIT2]

As @smoknheap mentioned, the above scripts will flatten out the folder structure and put all your files in one folder. I'm not sure if there is a switch that retains folder structure. I tried the -Recurse switch and it doesn't help. The only way I got this to work is to go back to string manipulation and add in folders to my filter.

$bin = Get-ChildItem -Path $source -Recurse | Where-Object {($_.Name -match $filter) -or ($_.PSIsContainer)}
foreach ($item in $bin) {
    Copy-Item -Path $item.FullName -Destination $item.FullName.ToString().Replace($source,$destination).Replace($item.Name,"")
    }

I'm sure that there is a more elegant way to do this, but from my tests it works. It gather s everything and then filters for both name matches and folder objects. I had to use the ToString() method to gain access to the string manipulation.

[EDIT3]

Now if you want to report the pathing to make sure you have everything correct. You can use the "Write-Host" Command. Here's the code that will give you some hints as to what's going on.

cls
$source = "C:\test" 
$destination = "C:\test2" 
$filter = [regex] "^[0-9]{6}\.(jpg|gif)"

$bin = Get-ChildItem -Path $source -Recurse | Where-Object {($_.Name -match $filter) -or ($_.PSIsContainer)}
foreach ($item in $bin) {
    Write-Host "
----
Obj: $item
Path: "$item.fullname"
Destination: "$item.FullName.ToString().Replace($source,$destination).Replace($item.Name,"")
    Copy-Item -Path $item.FullName -Destination $item.FullName.ToString().Replace($source,$destination).Replace($item.Name,"")
    }

This should return the relevant strings. If you get nothing somewhere, you'll know what item is having problems with.

Hope this helps

Perl for matching with regular expressions in Terminal

Well, here's a wikipedia page for matching or replacing with Perl one liners. I did this in Cygwin:

Perl can behave like grep or like sed.

The /s makes dot match new line.

The -0777 makes it apply the regular expression to the whole thing instead of line by line.

\n can match new line as well.

$ echo -e 'a\nb\nc\nd' | perl -0777 -pe 's/.*c//s'

d

user@comp ~
$ echo -e 'a\nb\nc\nd' | perl -pe 's/.*c//s'
a
b

d

Here is the other form, -ne with print $1:

user@comp ~
$ echo -e 'a\nb\nc\nd' | perl -ne 'print $1 if /(.*c)/s'
c
user@comp ~
$ echo -e 'a\nb\nc\nd' | perl -0777 -ne 'print $1 if /(.*c)/s'
a
b
c
user@comp ~
$

Also

$ echo xxx|perl -lne 'print ""'

Perl's equivalent of \0 or &, i.e. the whole match is $_ or to be able to put text before and after without a space, ${_}

$ echo xxx|perl -lne 'print "a${_}${_}a"'
axxxxxxa

and

$  echo xxx|perl -lpe 's/.*/a${_}${_}a"/'
axxxxxxa"

###Some further examples

$ cat t.t
<ul>
 <li>item 1</li>
 <li>item 2</li>
</ul>

$ perl -0777 -ne 'print $1 if /\<ul\>(.*?)\<\/ul>/s' t.t

 <li>item 1</li>
 <li>item 2</li>

user@comp ~
$ perl -0777 -ne 'print $1 if /(.*)/s' t.t
<ul>
 <li>item 1</li>
 <li>item 2</li>
</ul>

user@comp ~
$

An example of Global for the -ne one (change "if" to "while"):

$ echo -e 'bbb' | perl -0777 -ne 'print $1 while /(b)/sg'
bbb

For the -pe one, just add the g at the end (/sg or /gs, same thing):

$  echo -e 'aaa' | perl -0777 -pe 's/a/z/s'
zaa

user@comp ~
$  echo -e 'aaa' | perl -0777 -pe 's/a/z/sg'
zzz

Note- This question contrasts /s and -0777

Those print $1 examples don't show the whole line. this link https://dzone.com/articles/perl-as-a-better-grep has this example that does perl -wln -e "/RE/ and print;" foo.txt

Best Answer

Related Solutions

Windows: File copy/move with filename regular expressions

Perl for matching with regular expressions in Terminal

Related Question