Understanding Regular Expressions 3: Zero Width Selectors
Readers of the previous post in this series will remember a Pig Latin translator we created using groups and captures. That translator included a selector, \b
, we had not yet explored. It was described as a "zero-width" selector. The concept can be a bit foreign.
Up to this point, every selector we have looked at matched some number of some characters. Zero-width selectors, on the other hand, match a position instead. Let's look at the list of selectors, and see the positions they select.
The zero-widths
Symbol | Position Matched |
---|---|
^ | Start of the line |
$ | End of the line |
\A | Start of the string |
\Z | End of the string |
\b | Start and end of word (used in pairs) |
These selectors do not match the character at these position, but rather the position itself. Let's work through some examples and see how these work out functionally.
Examples
^ and $
^
and $
act as place markers for the start and end of a line, respectively. Personally, i find myself using these most frequently when looking for file extensions or lines that begin with variables. Let's consider both of these use cases.
If i am reading a list of file, separated by newlines, and want to check the extension of each file, i could start with the basic expression: \.\w+
or a literal period followed by 1 or more word characters. This will fail to correctly match stacked extensions, such as index.html.haml
or stuff.tar.gz
. To match these, we need to specify we want only the extension that ends at the end of the line. We still specify end of the line by adding a $
to the end of the expression.
/\.\w+$/
or, a period followed by 1 or more word characters at the end of a line will ensure out expression matches only the final extension. Notice that the $
will not match a charater at the end of the line, but rather the position itself. This will ensure that we only match the final extension on a given line.
As for ^
, it can be used in much the same way. I use it most often when searching code, and almost always in combination with \s*
or, 0 or more whitespace characters. Using these, we can match all lines that begin with a desired expression, with or without leading tabs/spaces. Consider the expression:
/^\s*my_variable/
or, a line starting with 0 or more whitespace characters followed by variable_name. It can find every line of code that begins by referencing my_variable
. This is particularly useful when browsing code with the bash command grep
.
\A and \Z
\A
and \Z
work the same as ^
and $
, but functions on the entire string instead of each line. For strings without newline characters, they are functionally identical. Because of this, \A
and \Z
are rarely used. They are nice to remember, though, for that once or twice a year when you need to match the final line in a string.
\b and \b
\b
is a unique selector that indicates the start and end of a word. This is useful when you are looking for words that may also appear in other words. For example, if you wish to find all instances where the variable width
is used, you have a few options. Let's use grep -r
and see how our expressions would look.
grep -r 'width' *
This will match variables named width, but it will also match variables whose names include "width", e.g. min_width
, max_width
, rect_width
. To avoid this, we could say we want to match "width", where the characters touching both sites are not word characters:
grep -r '\Wwidth\W' *
This expression, reading a non-word character, width, a non word character, comes closer to our need. However, it fails to match lines starting with "width" as there would be no non-word character preceding the word "width". What we need, ideally, is the ability to match the word "width". That is, "width" bordered by either non-word characters or the start/end of a line. This is precisely the job of \b
.
grep -r '\bwidth\b' *
By swapping from \W
to \b
, we will match all usages of the word "width" with the added bonus of not matching surrounding characters we aren't concerned with.
The Pig Latin usage
In the Pig Latin example, we arrived at the following code:
# Translate words beginning with vowels
text.gsub!( /\b([aeiou]\w+)\b/, '\1way' )
# Translate words beginning with consonants
text.gsub!( /\b([^aeiou]+)(\w+)\b/, '\2\1ay' )
Without \b
to specify we only want complete words, we would need to define that constraint ourselves. The most complete way would be to require bounding non-word characters on each side of a full word:
text.gsub!( /\W([^aeiou]+)(\w+)\W/, '\2\1ay' )
\W([^aeiou]+)(\w+)\W
or, a non-word character, 1 consonant, 1 or more word characters and another non-word character matches full words. However, since we were using a replace function gsub
, all matched characters would be replaced. This means that the bounding characters, be them spaces, tabs or punctuation, would be deleted. To retain them, we need to capture them as well, and reuse them within the replace:
text.gsub!( /(\W)([^aeiou]+)(\w+)(\W)/, '\1\3\2ay\4' )
Being required to identify and preserve bounding characters has caused a major decline in the readability of our expression. This is the use case of \b
. We don't care what the bounding characters are, we simply require our word to be between them. What we need is a zero-wdith position to begin from and end at. Again:
text.gsub!( /\b([^aeiou]+)(\w+)\b/, '\2\1ay' )
Since \b
matches the start/end of the word and not the bounding characters, we don't need to capture and carry them into the replace string. The result is a much simpler expression, as well
as shorter replace statement.
The vast majority
With zero-width selectors, we have covered the tools for approx. 98% of all of regular expressions i have ever encountered in code. That is not to say they are the end of the tools regex provides. As they get more complicated, regular expressions can make otherwise daunting problems easily managable.
In the final post of this series, we will explore the most confusing of all the regex tools: look aheads and look behinds.
Like this post? Read the other posts in the Understand Regex series:
Understanding Regular Expressions 1: Characters