Understanding Regular Expressions 4: Lookarounds

August 26, 2015

Lookarounds, like regular expressions in general, are too often avoided. Why? We avoid them because we don't understand them.

The time has come.

The Lookarounds

There are 4 types lookarounds. They are categorized in two ways: direction (ahead and behind) and positivity (positive and negative). Combined, there are a total of four lookarounds.

Positivity	Look Direction	Symbol
Positive	Ahead	(?=)
Positive	Behind	(?<=)
Negative	Ahead	(?!)
Negative	Behind	(?<!)

Lookarounds are zero-width selectors. This means they do not actually retrieve the context they match. This can be tricky to conceptualize, so let's look at an example.

A positive lookahead

The most intuitive of the lookarounds is the positive lookahead. Consider the following sentence:

This sentence, yes, this one, contains a number of commas, words, and an Oxford comma.

If we were asked to match all of the words that are followed by a comma, we might hastily write the expression:

sentence.match( /\w+,/ )

Haste makes waste. This expression:
\w+ or, 1 or more word characters will match the words in this sentence.
, or, a comma will match the comma.

While we have matched the correct words, we have erroneously included the commas as well. To exclude the comma, we need a selector that confirms the comma is there without actually including it. We can do this with a positive lookahead.

list.match( /\w+(?=,)/ )

Remember: because lookarounds are zero-width selectors, they will not actually include the expressions they match. So, by changing our , in a positive lookahead (?=) we get the expression: (?=,) or, literally, immediately followed by a comma.

This means our new expression literally reads: \w+ one of more word characters (?=,) immediately followed by a comma. Bingo.

Tricky phones

Now that we have some basis to work from, let's try a more complicated example. Consider this problem: we would like to find all instances of the string "phone" that are not part of the word "telephone". Let's look at the following list:

phone
cellphone
telephone
payphone
mobile phone
phone cord
some other words
because test cases

This is surprisingly tricky, even given a short list. Lookarounds, though, will see us through.

To find the right lookaround, we must identify the direction and positivity of our search.

Direction

The direction is decided by the position of the context with respect to the match. In this case, "tele" comes before "phone". This means we need a lookbehind.

Positivity

We are asked to ignore instances of "tele". Because we are negating a case we will need to a negative lookaround. Combined with our direction, this means we will need to use a negative lookbehind.

The expression

Now that we have identified that we need a negative lookbehind, we can use it to match instances of "phone" in the proper context. To do this we will need to surround the excluded context, "tele", in the intimidating looking negative lookbehind: (?<!).

list.match( /(?<!tele)phone/ )

Because we are using a lookbehind, let's break this down in reverse:
phone, or, the substring "phone", matches all instances of the string "phone". (?<!tele) or, not preceeded by "tele" excludes our desired context.

On the command line

I most often find lookarounds pulling their weight when searching codebases.

I recently was asked to identify all link urls in outgoing emails that did not include utm parameters. This was a perfect opportunity to utilize a negative lookahead. Using ack, an amazing grep replacement that supports full regex, i put together the following command.

ack 'link_to(?![^\n]+utm_\w+)'

The expression reads funny because it has a couple of negatives in it. link_to or, the string "link_to" matches the Rails <a> function. (?!...) or, not immediately followed by is our negative lookahead. This wraps the regular expression [^\n]+ or, 1 or more chacters that are not new-lines followed by utm_\w+ or, "utm_" followed by 1 or more word characters.

More simply: any line containing a "link_to" that is not followed by a "utm_" on the same line.

Quantity makes quality

As with so many things, repeated use improves our abilities. At first, lookarounds are an intimidating feature of an already intimidating technology. But they exist, and persist for a reason. Their flexibilty, when called upon, is unmatched.

Commit the 4 wrappers to memory: (?=), (?!), (?<=), (?<!). Challenge yourself to use them. Learn the unnecessary things.

Like this post? Read the previous posts in the Understand Regex series:

Understanding Regular Expressions 1: Characters

Understanding Regular Expressions 2: Groups and Captures

Understanding Regular Expressions 3: Zero-width Selectors

Written by Ben

https://www.shootskyward.com

Ben is the co-founder of Skyward. He has spent the last 10 years building products and working with startups.