Understanding Regular Expressions 1: Characters and Quantities

March 03, 2015

Regular expressions, like CSS or the workings of Agile process, are some of the least understood coding aspects among young programmers, and for good reason. Regex use a syntax entirely of their own, they are rarely a necessity when building a solution, and they are infamous for their inability to parse HTML.

It could be said that it's a wonder the language still exists at all. In fact, the only logical reason such a technology could remain relevant would be if it bestowed a sort of flexible productivity unparalleled in the industry.

And that is precisely what it does.

The structure

Regex, at its core, is a repeated pattern enclosed in /. In the case of small regular expressions, that pattern is simply: [characters][quantity]. Let's look through some examples to see how this works.

"hello".match( /a/ )  # => false
"bat".match( /a/ )  # => true

In this most basic example, we define a set of characters, "a", but omit a quantity. The processor assumes this to mean "1". Thus, our regex says: match the letter a. Since "hello" contians no letter "a", it returns false. "Bat", on the other hand, returns true.

Let's make this more robust with some character quantities and see how it behaves:

"hello".match( /a*/ )  # => true
"hello".match( /a+/ )  # => false
"hello".match( /l*/ )  # => true
"hello".match( /l+/ )  # => true

You will notice that somehow adding a * has caused "a" to be matched within "hello". This is because * represents the quantity 0 or more. Since "hello" does indeed contain 0 or more "a"s, the match returns true. To understand this better, we must first understand the basic symbols.

The symbols

Regex use symbols to represent common quantities and character groupings. These keep regular expressions both readable and maintainable as they grow.

For now, let's look at the symbols for character quantities and character groups.

Quantity Symbol	Actual Quantity
?	0 to 1
*	0 to many
+	1 to many
{2}	exactly 2
{1,3}	between 1 and 3
{3,}	3 or more
{,4}	up to 4

Character Symbol	Characters Selected
.	any character
\w	Word characters (a-z, A-Z, 0-9, _)
\W	Non Word characters
\d	Digits (0-9)
\D	Non Digits
\s	WhiteSpace characters (space, tab, newline, etc)
\S	Non WhiteSpace characters
[abc]	a or b or c
[^abc]	not a, not b and not c

For character groups, a capital letter effectively means not, ex: \w vs \W. Putting characters within braces means match any of these characters.

Also of note is the . and the *. When combined, the resulting expression /.*/ effectively means match 0 or more of anything. While it sounds profoundly useless, there are indeed times it comes in handy.

Now that we have some tools, lets put them to use.

The usage

Phone numbers

Phone numbers are a great example of a simple regex used for validation. A typical American number is generally of the form "(123) 456-7890". How would code validating that format look without regular expressions? Less than ideal. With regular expressions, though:

phone_is_valid = phone.match( /\(\d{3}\) \d{3}-\d{4}/ )

Simple, but 25% of those characters are backslashes and that seems excessive. Is it really necessary?

\(\d{3}\) or, a literal open paren, exactly 3 numbers, a literal closing paren. In regular expressions, parenthesis are used in groups, so we must escape them with a \ to check for a literal paren.

\d{3} or, 1 space followed by exactly 3 digits. Spaces in regex are just another character to match. Put a space, and it will match a space. We follow that up with the 3 digits following the area code.

-\d{4} or, 1 hyphen followed by exactly 4 digits. Again, lacking a quantity means regex will assume exactly 1. Thus, the singular hyphen will match exactly one, where the {4} will require exactly 4 more digits (\d).

Success! A basic one-liner to handle what would otherwise be a series of string splits and algorithmic handwaving.

Name list

Say we have a list of names in a file. We want to find all the names that have the form FirstName MiddleInitial. LastName. We will assume "happy path" names, for this example's sake.

names.each do |name|
	if name.match( /[A-Z][a-z]+ [A-Z]\. [A-Z][a-z]+/ )
		puts "#{name} matches"
	end
end

Fewer backslashes, but a bit daunting at first glance. Let's break it down.

[A-Z][a-z]+ or, 1 capital letter followed by 1 or more lowercase letters. This will match our first names. Names begin with capital letters, which is handled by the [A-Z]. The range behaves intuitively, functioning as a shorthand for any letter between capital "A" and capital "Z". Again, because we omit a quantity, 1 is assumed. Next, [a-z]+ enforces that all following letters in the name will be lower case letters. Additionally, the use of + instead of * causes all matching names be at least 2 letters in order to match the expression.

[A-Z]\. or, space, capital letter, literal period, space. As advertised here. A space in our regex will look for match a space in the string. Again a single, capital letter is used. Next, a literal period. Since . itself means any character, we escape it with a \. This is followed with another space, as expected between the period and last name.

[A-Z][a-z]+ or, 1 capital letter followed by 1 or more lowercase letters. A repeat of the first name expression, this time to match the last name. If we wanted to include hyphenated names, or names with multiple capital letters like "McAllister", this particular expression would become more complex.

What's next

Regular expressions are hailed for their power and flexibility. Learning them can take time and experimentation, though. A wonderful resource for experimenting and seeing results in real-time is Rubular. While it is ruby-branded, the regex you make will work in most any language you use.

In the next part of this series, we will examine groups, captures, and substitution. We will also try to write a 1 line solution to an old interview question: a Pig Latin translator.

Like this post? Read the next posts in the Understand Regex series:

Understanding Regular Expressions 2: Groups and Captures

Understanding Regular Expressions 3: Zero Width Selectors

Understanding Regular Expressions 4: Lookarounds