PHP: Extract URLs from a string

Updated: January 9, 2024 By: Guest Contributor Post a comment

Introduction

As you navigate the web programmatically or parse text data in PHP, you may often need to extract URLs from strings. This skill is particularly useful for web scraping, data migration, and SEO tools development. In this tutorial, we will explore multiple methods to accomplish this with PHP, enhancing our toolkit from basic to advanced as we progress.

Basic URL Extraction

To start, we will discuss the simplest way to extract URLs using PHP’s built-in functions. The
egex_match_all() function is a powerful tool that can search for patterns defined by Regular Expressions within a string. A basic Regular Expression for URL extraction would look like this:

// The input string containing URLs
$string = 'Check out https://www.example.com and http://www.foo.com.';

// Regular Expression Pattern for a basic URL
$pattern = '/\b(?:https?:\/\/)[a-zA-Z0-9\.\-]+(?:\.[a-zA-Z]{2,})(?:\/\S*)?/';

// Array to hold the matched URLs
$matches = [];

// Perform the pattern match
preg_match_all($pattern, $string, $matches);

// Print the matches
print_r($matches[0]);

Improved URL Extraction with Regex

As we go deeper, we can refine our Regular Expression to better handle edge cases and different URL formats:

// Improved Regular Expression Pattern
$pattern = '/\b(?:https?:\/\/)?(?:www\.)?[a-zA-Z0-9\.\-]+\.\w+(?:\/[\w\/.?-]*)?/';
// Rest of the code is the same...

This regex takes into account optional protocols and subdomains, as well as various URL path components.

Using PHP Filters

Beyond regex, PHP provides filters that can validate and sanitize data, including URLs. Here we demonstrate how to employ filter_var with the FILTER_VALIDATE_URL flag to find and validate URLs:

// Split the input string by spaces or any other delimiters you expect
$parts = preg_split('/\s+/', $string);

// Array to hold valid URLs
$validURLs = [];

foreach ($parts as $part) {
    if (filter_var($part, FILTER_VALIDATE_URL) !== false) {
        $validURLs[] = $part;
    }
}

// Print the valid URLs
print_r($validURLs);

Advanced URL Extraction

In more complex scenarios, such as dealing with encoded URLs or URLs embedded within scripts or styles, additional parsing logic is required. Libraries or functions capable of more deeply understanding the structure of HTML can help:

// For example, using the PHP Simple HTML DOM Parser:

// Assume we're using the simple_html_dom library available through Composer. Be sure you have included the library in your project.

// Create a DOM object from a string
$html = str_get_html($string);

// Find all the links
foreach($html->find('a') as $element) {
    echo $element->href . '\n';
}

// Remember to handle script, style, or encoded URLs differently
// Additional parsing logic here

This will require you to handle more cases and also, perhaps, to employ some additional libraries for robust HTML parsing.

Conclusion

In this tutorial, we covered ways to extract URLs from strings in PHP, starting from a simple regex and progressing to advanced methods utilizing PHP’s native functions and external libraries. By now, you should have a good understanding of how to approach this common task and adapt the examples to fit more complex scenarios or specific requirements in your projects.