Understanding Regular Expressions 2: Groups and Captures

March 05, 2015

In the first post of this series, we looked at the basic building blocks of regular expressions: the characters and quantities. The next major building blocks are groupings and captures. Both features are based off the parenthesis, and can be used to great effect.

Groupings

Groups allow a regular expression to function on a set of [character][quantity] pairs as if they were a singe expression. For example, an app needs you to confirm a line of text has 5 groups of 3 digit numbers, separated by commas. We have the tools given our basic tools:

line_is_valid = line.match( /\d{3},\d{3},\d{3},\d{3},\d{3}/ )

With groups, we can surround the common term, \d{3} or, 3 consecutive digits, within parens, and add a quantity to the group itself:

line_is_valid = line.match( /(\d{3},){4}\d{3}/ )

Certainly shorter, but is it more understandable? To be seen.

\d{3}, or, 3 consecutive digits followed by a comma is surrounded in a grouping. This implies the group will itself be used as a singular expression. In this case, it is repeated.

(\d{3},){4} or, 4 consecutive sets of 3 digits and a comma. Since the group has a quantity, {4}, it will repeat the interior expression 4 times. This makes (\d{3},){4} equivalent to \d{3},\d{3},\d{3},\d{3},. We use 4 instead of the 5 desired groups, because the last number should not end in a comma.

\d{3} or, 3 consecutive digits will finish off our match, ending with 3 numbers and no trailing comma.

The or

Groups also allow us to employ an "or" function. Regex has a built in "or" for individual characters by surrounding them in brackets. [ab] is a great way to say a or b, but how would a regex say yes or no? For this, we need to define each word as an expression, and allow either. So:

yes = maybe.match( /yes/ )	# => matches "yes"
no = maybe.match( /no/ )	# => matches "no"
yes_or_no = maybe.match( /(yes|no)/ )	# => matches "yes" or "no"

This is, of course, and overly simple example. We will see in later posts how to use the "or" in a much more impactful manner.

Replacements

The true power of regex is unlocked with replacements and captures. Thus far, we have looked exclusively at whether or not a string matches an expression. What would be possible if we instead chose to replace that match? If we return to the example from [](part 1) of a list of names, in the form FirstName, MiddleInitial. LastName, we will see.

If tasked with removing the middle initial from each line, replacement becomes necessary:

if full_name.match( /[A-Z][a-z]+ [A-Z]\. [A-Z][a-z]+/ )
	first_last = full_name.gsub( / [A-Z]\./, '' )
end

[A-Z]\. or, space, a capital letter, a literal period i, matches our middle initial. The second string, '', is used as the replacement. Since it is empty, our match is replaced with, thereby deleting it.

Note: for those unfamiliar, gsub() is the ruby replace function, standing for "global substitution". Javascript uses replace(), python uses sub(), etc.

Captures

Finally, the amalgamation of replacements and groupings, are captures. Captures allow groups to be "captured" and referenced with the replacement string. These references are numbered, and are usually referenced with either \1, \2, etc or $1, $2, etc. These captures are automatic, meaning that every grouping in a regex is able to be referenced in the replacement.

This can be a bit confusing, so let's look through some examples and shed some clarity on the power this offers.

Examples

Last name first

Keeping with the list of names example, lets say we need to turn the first-middle-last name file into the format LastName, FirstName. Describing the process, we must: identify the first name, identify the last name, swap their order, add a comma between them, and discard the rest of the name.

To do this, we must place a group around our selector for the first name [A-Z][a-z]+, and the last name. These groups we will later be referenced as \1 and \2 respectively.

name = "John R. Smith"
last_first = name.replace( /([A-Z][a-z]+) [A-Z]\. ([A-Z][a-z]+)/, '\2, \1' )
puts last_first	# => "Smith, John"

([A-Z][a-z]+) or, capture the group: a capital letter followed by 1 or more lowercase letters stores our first name to be referenced later as \1.

[A-Z]\. or, space, a capital letter, a literal period, space will match the middle initial. Since we are not capturing it, it will be erased when the replacement call completes.

([A-Z][a-z]+) or, capture the group: a capital letter followed by 1 or more lowercase letters stores the last name in our second capture, \2. At this point, the entire name has been matched, and the replace will run.

'\2, \1', or, place the second capture, comma, space, place the first capture references our last and first names, and separates them with a comma and space.

A decent Pig Latin translator

A tried and true interview question is to ask programmers to write a Pig Latin translator. This is a great challenge made much simpler by the user of regular expressions. Let's look at the definition of Pig Latin, and then consider how to tackle such a problem.

Pig Latin takes the first consonant (or consonant cluster) of an English word, moves it to the end of the word and suffixes an "ay", or if a word begins with a vowel you just add "way" to the end.

The approach to this problem will be a couple of parts: we will need to consider words starting with consonants and vowels separately. This requires an expression to find words that begin with vowels or not.

[aeiou] or, a, or e, or i, or o, or u is a basic expression to match any vowel. If we follow it with \w* or, 0 or more Word characters we should be able to capture whole words. Once captured, we simply need to replace a word with itself followed by "way".

Finally, there is a regex operator we have yet to explore that we will need: \b. This selector, known as a "zero-width" selector, does not match a letter or character but matches the start and end of words. This will guarantee that our vowel selector [aeiou] will only matching the first letter in a word, instead of the first vowel in a word.

text.gsub!( /\b([aeiou]\w*)\b/, '\1way' )

Again, notice how we capture the entire word in our group, then replace it with itself, \1 followed by "way": \1way. Once the words starting with vowels are updated, it is time to update words beginning with consonants.

Finding words that begin with consonants is the same as finding words that begin with non-vowels. You may recall that brackets selectors can begin with ^ to function as a "not". This means that [^aeiou] actually means not a, and not e, and not i, and not o, and not u.

Unlike the vowel words, however, we need to match consonant clusters for Pig Latin. For example "truck" becomes "ucktray", not "rucktay". This means we will need to match all consecutive non-vowels at the start of a word. [^aeiou]+ should do the trick. Again, we will use the still-confusing \b to ensure we look at words as a whole.

text.gsub!( /\b([^aeiou]+)(\w+)\b/, '\2\1ay' )

Here, notice we capture the word into two parts: the consonant cluster, ([^aeiou]+) and the remained of the word (\w+). In our replacement, we rearrange these groups \2\1, then append "ay" '\2\1ay'.

This leaves us with a solution:

# Translate words beginning with vowels
text.gsub!( /\b([aeiou]\w*)\b/, '\1way' )

# Translate words beginning with consonants
text.gsub!( /\b([^aeiou]+)(\w+)\b/, '\2\1ay' )

While this indeed works, there are some issues still. Words starting with capital letters will be ignored by our expressions, for example. The next post will dive into how we can solve this, as well as what is behind the "zero-width" selectors.

Like this post? Read the other posts in the Understand Regex series:

Understanding Regular Expressions 1: Characters

Understanding Regular Expressions 3: Zero Width Selectors

Understanding Regular Expressions 4: Lookarounds