PHP regex with word boundaries

Overview
Understanding Regex Word Boundaries
Basic Regex with Word Boundaries
Finding Words at the Start or End of a String
Matching Whole Words in a Case-Insensitive Manner
Using Word Boundaries with Character Classes
Advanced Regex: Capturing Groups with Word Boundaries
Regex Word Boundaries and Unicode Characters
Conclusion

Overview

Regular expressions (regex) in PHP provide a powerful method for pattern matching in strings, including the ability to match entire words by using word boundaries. This tutorial will walk you through the essentials of utilizing word boundaries in PHP regex through increasingly complex examples.

Understanding Regex Word Boundaries

In regex, \b denotes a word boundary. This means it matches the position between a word character (usually alphanumeric or underscore) and a non-word character. Note that \b is zero-width; it does not consume any characters in the string.

$pattern = '/\bword\b/';

Understanding where word boundaries lie is critical for crafting effective regex patterns in PHP. For instance, the pattern above will match ‘word’ in ‘This is a word.’ but not in ‘Swordfish’.

Basic Regex with Word Boundaries

Let’s start with a simple example:

$string = 'The quick brown fox jumps over the lazy dog';
$pattern = '/\bquick\b/';
if (preg_match($pattern, $string)) {
  echo 'Match found!';
} else {
  echo 'No match found.';
}

This code will output ‘Match found!’ because ‘quick’ is a separate word bounded by spaces which are non-word characters.

Finding Words at the Start or End of a String

Regex word boundaries can also be used to match words at the beginning or end of a string.

$string = 'Welcome to the club';
$pattern = '/\bWelcome\b/';

In this case, ‘Welcome’ is followed by a whitespace character, which is a boundary. We can match the last word ‘club’ at the end of a string in a similar fashion:

$pattern = '/\bclub\b$/';

Here, the dollar sign $ denotes the end of the string.

Matching Whole Words in a Case-Insensitive Manner

Sometimes your search needs to be case-insensitive. This is where modifiers come into play.

$string = 'Case matters not.';
$pattern = '/\bmatters\b/i';

Adding the i modifier at the end of the pattern makes the pattern case-insensitive. So it will match ‘matters’, ‘Matters’, ‘maTTers’, etc.

Using Word Boundaries with Character Classes

Character classes can also be used with word boundaries. Here’s an example where we want to match any word that starts with a ‘c’ or ‘C’.

$string = 'Cats and dogs.';
$pattern = '/\b[cC]\w+\b/';

The \w+ part matches one or more word characters following ‘c’ or ‘C’.

Advanced Regex: Capturing Groups with Word Boundaries

By using parentheses, you can create capturing groups which can be later referenced.

$string = 'When in Rome, do as the Romans.';
$pattern = '/\b(Romans?)\b/';
if (preg_match($pattern, $string, $matches)) {
  echo 'Match found: ' . $matches[1];
}

The pattern will match both ‘Rome’ and ‘Romans’ and capture them for later reference. The question mark makes the preceding character or group optional.

Regex Word Boundaries and Unicode Characters

As of recent versions of PHP, you can now handle word boundaries with Unicode characters by using the u modifier.

$string = 'Déjà vu';
$pattern = '/\bd\xE9j\xE0\b/u';

The u modifier ensures that the pattern treats the string as UTF-8 and matches the word ‘Déjà’ correctly.

Conclusion

Utilizing word boundaries in PHP regex is a fundamental skill that can greatly enhance the precision of your string matching. As demonstrated, there are multiple levels to its application, each providing greater control over the match results. By understanding and applying these principles, you’ll be able to perform more sophisticated text processing tasks in PHP.

Next Article: PHP regex with case-insensitive matching

Previous Article: PHP: Remove Accent Marks from a String

Series: Working with Numbers and Strings in PHP

PHP

PHP