Regular Expressions in Ruby

Regular expressions can be both terribly awkward and extremely useful. In this introductory post, we will learn the basics of regular expressions in Ruby programming language and how to use them for routine programming tasks.

4 min read
Merry Christmas!
Merry Christmas!

In my programming career so far, I've always used regular expressions on a need-to-know basis, and never really took the time to learn them properly. Now that I've been programming in Ruby for the past year and a half (and really enjoying it), I decided to take a deeper look and understand the Ruby way to do them properly.

So I spent the Christmas weekend reading and playing with regular expressions in Ruby. It was time well spent. Ruby makes them much less intimidating, as compared with C# and JavaScript, and fun too. I've learned a lot, and can use those learnings not only when writing Rails apps, but also to improve my productivity by using them in VS Code's search-find-replace feature.

What follows are my notes from the weekend learning session. I hope you find them useful and you learn something new.  


What is a regular expression?

A regular expression is a pattern describing the contents of a string. Regular expressions offer a very powerful way to match, search, and replace a pattern of characters in text.

Why do we need them?

  1. To test if a string contains a given pattern
  2. To extract the portions matching the given pattern.

How to create them?

Ruby offers three syntaxes to create a regular expression object.

pattern_one = /hello/ 
pattern_two = %r{hello} 
pattern_three = Regexp.new("hello")

They all create an instance of the Regexp class. Personally, I like the first syntax as it keeps the code concise. You can think of the slashes (/) as the quotes around strings.

puts pattern_one.class   # Regexp 
puts pattern_two.class   # Regexp 
puts pattern_three.class # Regexp

You can pattern match using the =~ operator or the Regexp#match method.

=~ operator

The =~ operator returns the first position at which the given regular expression pattern occurs. If the pattern doesn't match, it returns nil.

/my/ =~ "Hi, my name is Akshay. This is my blog." 
=> 4 

/my/ =~ "Hi, I am Akshay" 
=> nil

This operator works the same for both String and Regexp so order doesn't matter.

MatchData

The other way to match a string for patterns is to use a MatchData object. It encapsulates the result of matching a regular expression against a string.

Calling the match method on a Regexp and a String returns an instance of MatchData.

line = "this is my blog" 

line.match(/my/) 
=> #<MatchData "my"> 

/my/.match(line) 
=> #<MatchData "my">

MatchData acts as an array. The first element is the matched string.

In this example, the match is my since the . matches a single character.

md = /m./.match("this is my blog") 

md[0] 
=> "my"

You can access the matched string by calling to_s on the MatchData.

md.to_s 
=> "my"

The subsequent values in the array contain the stored values between brackets.

md = /age: (\d+)/.match("age: 31 years") 

md 
=> #<MatchData "age: 31" 1:"31"> 

md[0] 
=> "age: 31" 

md[1] 
=> "31"

The captures method returns the array of captured values.

line = "age: 31 location: victoria" 
pattern = /age: (\d+) location: (\w+)/ 

md = pattern.match(line) 
=> #<MatchData "age: 31 location: victoria" 1:"31" 2:"victoria"> 

md.captures 
=> ["31", "victoria"]

How to check if the text matches a pattern?

Use the match? method to indicate if the regular expression pattern is matched or not.

/R/.match? "Ruby"
=> true

/R/.match? "ruby"
=> false

Metacharacters

These characters have a special meaning in regular expressions. I've put them in a separate line as using a comma to separate them always confused me.

( 
) 
[ 
] 
{ 
} 
. 
? 
+ 
*

To escape them, use a backslash \.

/1 + 2/ =~ "expression: 1 + 2" 
=> nil 

/1 \+ 2/ =~ "expression: 1 + 2" 
=> 12

Character Classes

A character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string.

Use character classes to tell the regex engine to match only one out of several characters.

/[bcr]at/.match("cat") 
=> #<MatchData "cat"> 

/[bcr]at/.match("that") 
=> nil 

/gr[ae]y/.match("gray") 
=> #<MatchData "gray"> 

/gr[ae]y/.match("grey") 
=> #<MatchData "grey"> 

/gr[ae]y/.match("gruy") 
=> nil

Some important points about character classes:

  • A character class only matches a single character.
  • The order of characters inside the brackets doesn't matter.

For one or two characters, it's enough to list them in the character classes. What if there are many characters? For example, if you want to match all letters from `d-m`, writing each letter could get tedious.

In this case, use a range as follows:

md = /[d-m]at/.match("mat") 
=> #<MatchData "mat"> 

md = /[d-m]at/.match("cat") 
=> nil 

# /[a-d]/ is the same as /abcd/

To match all characters except those listed in the brackets, place a ^ at the beginning. This is known as negation. The result is that the character class matches any character that is not in the character class.

md = /[^bcr]at/.match("hat") 
=> #<MatchData "hat"> 

A number of common character groups have their own built-in shortcuts, called meta-characters.

  • /./ - Any character except a newline.
  • /./m - Any character (the m modifier enables multiline mode)
  • /\w/ - A word character [a-zA-Z0-9_]
  • /\W/ - A non-word character [^a-zA-Z0-9_].
  • /\d/ - A digit character [0-9]
  • /\D/ - A non-digit character [^0-9]
  • /\s/ - A whitespace character /[ \t\r\n\f\v]/
  • /\S/ - A non-whitespace character: /[^ \t\r\n\f\v]/

Repetition

So far, we've matched a single character. What if we want to match multiple characters—a sequence of one or more characters?

/\d/.match("37signals") 
=> #<MatchData "3">

You can use a repetition meta-character to specify how many times the matched character needs to occur. For example, the meta-character (+) matches the character one or more times.

/\d+/.match("37signals") 
=> #<MatchData "37">

Here is a list of available meta-characters.

  • * - Zero or more times
  • + - One or more times
  • ? - Zero or one times (optional)
  • {n} - Exactly n times
  • {n,} - n or more times
  • {,m} - m or less times
  • {n,m} - At least n and at most m times

Alright, that's all for now. I will continue learning more about regular expressions this week, especially how to capture and group pattern-matched data. So expect a second post in this series soon.

Hope you found this article useful and learned something new.

Merry Christmas,

Akshay

Christmas Tree, Tree, Christmas, Christmas Decor