This concise article shows you how to get the hostname and the protocol from a given URL in Python.
Before writing code, let me clarify some things. Suppose we have a URL like this:
https://www.slingacademy.com/cat/sample-data
Then the components of the URL are:
https
: The protocol.www.slingacademy.com
: The hostname.www
: The subdomain.slingacademy.com
: The domain (domain name)./cat/sample-data
: The path segment.
Using the urllib.parse module
The built-in urllib.parse
module of Python provides beautiful tools for URL parsing. It can handle various URL formats, including different protocols.
Here’re the steps to extract the hostname and protocol from a URL:
- Import the
urlparse
function from theurllib.parse
module. - Parse the URL using the
urlparse
function. - Access the
hostname
andscheme
attributes from the parsed URL object.
A code example is worth more than thousands of boring words:
from urllib.parse import urlparse
def extract_hostname_and_protocol(url):
parsed_url = urlparse(url)
hostname = parsed_url.hostname
protocol = parsed_url.scheme
return hostname, protocol
# Test it out
url1 = 'https://www.slingacademy.com/cat/sample-data/'
hostname_1, protocol_1 = extract_hostname_and_protocol(url1)
print(f"Hostname: {hostname_1}")
print(f"Protocol: {protocol_1}")
url2 = "https://api.slingacademy.com/v1/examples/sample-page.html"
hostname_2, protocol_2 = extract_hostname_and_protocol(url2)
print(f"Hostname: {hostname_2}")
print(f"Protocol: {protocol_2}")
Output:
Hostname: www.slingacademy.com
Protocol: https
Hostname: api.slingacademy.com
Protocol: https
Using regular expressions
The preceding approach is elegant and works well. However, it isn’t the only possible way to get the job done. An alternative solution is to use a regular expression.
The steps are:
- Import the
re
module for regular expressions. - Define a regular expression pattern to match the hostname and protocol.
- Use the
re.search
function to find the pattern within the URL string. - Extract the matched groups for hostname and protocol.
Here’s the pattern we’ll use:
pattern = r"^(?P<protocol>https?)://(?P<hostname>[^/]+)"
Let me explain the pattern above:
^
: Start of the string anchor.(?P<protocol>https?)
: Named capturing group protocol to match the protocol. It matcheshttp
orhttps
using the?
quantifier to make thes
optional.://
: Matches the colon and double slashes.(?P<hostname>[^/]+)
: Named capturing group hostname to match the hostname. It matches one or more characters that are not a forward slash (/
), indicating the hostname portion of the URL.
Code example:
import re
def extract_hostname_and_protocol(url):
pattern = r"^(?P<protocol>https?)://(?P<hostname>[^/]+)"
match = re.search(pattern, url)
if match:
protocol = match.group("protocol")
hostname = match.group("hostname")
return hostname, protocol
return None, None
# Test it out
url1 = 'https://www.slingacademy.com/cat/sample-data/'
hostname_1, protocol_1 = extract_hostname_and_protocol(url1)
print(f"Hostname: {hostname_1}")
print(f"Protocol: {protocol_1}")
url2 = "http://api.slingacademy.com/v1/examples/sample-page.html"
hostname_2, protocol_2 = extract_hostname_and_protocol(url2)
print(f"Hostname: {hostname_2}")
print(f"Protocol: {protocol_2}")
Output:
Hostname: www.slingacademy.com
Protocol: https
Hostname: api.slingacademy.com
Protocol: http
Regular expressions allow you to deal with rare and specific use cases by crafting your own custom pattern. However, it may be tough sometimes, even with experienced programmers.