Python: Get Hostname and Protocol from a URL

Updated: June 2, 2023 By: Khue Post a comment

This concise article shows you how to get the hostname and the protocol from a given URL in Python.

Before writing code, let me clarify some things. Suppose we have a URL like this:

https://www.slingacademy.com/cat/sample-data

Then the components of the URL are:

  • https: The protocol.
  • www.slingacademy.com: The hostname.
  • www: The subdomain.
  • slingacademy.com: The domain (domain name).
  • /cat/sample-data: The path segment.

Using the urllib.parse module

The built-in urllib.parse module of Python provides beautiful tools for URL parsing. It can handle various URL formats, including different protocols.

Here’re the steps to extract the hostname and protocol from a URL:

  1. Import the urlparse function from the urllib.parse module.
  2. Parse the URL using the urlparse function.
  3. Access the hostname and scheme attributes from the parsed URL object.

A code example is worth more than thousands of boring words:

from urllib.parse import urlparse

def extract_hostname_and_protocol(url):
    parsed_url = urlparse(url)
    hostname = parsed_url.hostname
    protocol = parsed_url.scheme
    return hostname, protocol


# Test it out
url1 = 'https://www.slingacademy.com/cat/sample-data/'
hostname_1, protocol_1 = extract_hostname_and_protocol(url1)
print(f"Hostname: {hostname_1}")
print(f"Protocol: {protocol_1}")

url2 = "https://api.slingacademy.com/v1/examples/sample-page.html"
hostname_2, protocol_2 = extract_hostname_and_protocol(url2)
print(f"Hostname: {hostname_2}")
print(f"Protocol: {protocol_2}")

Output:

Hostname: www.slingacademy.com
Protocol: https
Hostname: api.slingacademy.com
Protocol: https

Using regular expressions

The preceding approach is elegant and works well. However, it isn’t the only possible way to get the job done. An alternative solution is to use a regular expression.

The steps are:

  1. Import the re module for regular expressions.
  2. Define a regular expression pattern to match the hostname and protocol.
  3. Use the re.search function to find the pattern within the URL string.
  4. Extract the matched groups for hostname and protocol.

Here’s the pattern we’ll use:

pattern = r"^(?P<protocol>https?)://(?P<hostname>[^/]+)"

Let me explain the pattern above:

  • ^ : Start of the string anchor.
  • (?P<protocol>https?) : Named capturing group protocol to match the protocol. It matches http or https using the ? quantifier to make the s optional.
  • ://: Matches the colon and double slashes.
  • (?P<hostname>[^/]+) : Named capturing group hostname to match the hostname. It matches one or more characters that are not a forward slash (/), indicating the hostname portion of the URL.

Code example:

import re

def extract_hostname_and_protocol(url):
    pattern = r"^(?P<protocol>https?)://(?P<hostname>[^/]+)"
    match = re.search(pattern, url)
    if match:
        protocol = match.group("protocol")
        hostname = match.group("hostname")
        return hostname, protocol
    return None, None


# Test it out
url1 = 'https://www.slingacademy.com/cat/sample-data/'
hostname_1, protocol_1 = extract_hostname_and_protocol(url1)
print(f"Hostname: {hostname_1}")
print(f"Protocol: {protocol_1}")

url2 = "http://api.slingacademy.com/v1/examples/sample-page.html"
hostname_2, protocol_2 = extract_hostname_and_protocol(url2)
print(f"Hostname: {hostname_2}")
print(f"Protocol: {protocol_2}")

Output:

Hostname: www.slingacademy.com
Protocol: https
Hostname: api.slingacademy.com
Protocol: http

Regular expressions allow you to deal with rare and specific use cases by crafting your own custom pattern. However, it may be tough sometimes, even with experienced programmers.