How to crawl web pages using Go

Introduction
Getting Started with Go
Basic HTTP Request
Parsing HTML Content
Building the Crawler
Conclusion

Introduction

Web crawling is the process of programmatically accessing web pages to extract data or gather information. One of the powerful languages used for this purpose is Go, known for its efficiency and concurrency capabilities. In this article, we will explore how to create a simple web crawler using Go.

Getting Started with Go

First, ensure you have Go installed on your system. You can download Go from the official website and follow the installation instructions.

$ go version

This command will display the current version of Go installed on your system.

Basic HTTP Request

We will use Go’s net/http package to send HTTP requests. Let's start by making a simple GET request to a web page.

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
)

func main() {
    response, err := http.Get("http://example.com")
    if err != nil {
        fmt.Printf("Request failed: %s\n", err)
        return
    }
    defer response.Body.Close()

    body, err := ioutil.ReadAll(response.Body)
    if err != nil {
        fmt.Printf("Reading body failed: %s\n", err)
        return
    }
    fmt.Println(string(body))
}

This code snippet makes a GET request to example.com, reads the response body, and prints it out.

Parsing HTML Content

To extract specific elements from the HTML content, we'll use the golang.org/x/net/html package which provides utilities to parse HTML documents. Here's an example of how to parse all anchor tags (links) from the page.

package main

import (
    "fmt"
    "golang.org/x/net/html"
    "strings"
)

func extractLinks(body string) []string {
    tokenizer := html.NewTokenizer(strings.NewReader(body))
    var links []string

    for {
        tokenType := tokenizer.Next()
        switch tokenType {
        case html.ErrorToken:
            return links
        case html.StartTagToken:
            token := tokenizer.Token()
            if token.Data == "a" {
                for _, attr := range token.Attr {
                    if attr.Key == "href" {
                        links = append(links, attr.Val)
                    }
                }
            }
        }
    }
}

func main() {
    // Fetch the page like in the previous example
    // Then: 
    body := "..." // Assuming this is the page body you fetched.
    links := extractLinks(body)
    fmt.Println("Found links:", links)
}

The code defines an extractLinks function that returns a list of URLs from anchor tags in the HTML content.

Building the Crawler

Now, let's bring everything together to build a simple crawler. We will modify our example to visit each link we find (only if it's on the same domain) and extract links recursively.

package main

import (
    "fmt"
    "net/http"
    "net/url"
    "sync"
)

// Define a struct to manage visited URLs
var visited = struct {
    m map[string]bool
    sync.RWMutex
}{m: make(map[string]bool)}

func crawl(u string) {
    // Parse the URL
    uri, err := url.Parse(u)
    if err != nil || visited.m[u] {
        return
    }

    // Mark the URL as visited
    visited.Lock()
    visited.m[u] = true
    visited.Unlock()

    // Pretend this function fetches and extracts links like earlier examples
    linkedUrls := fetchAndExtractLinks(u)

    for _, link := range linkedUrls {
        go crawl(link)
    }
}

func main() {
    startingURL := "http://example.com"
    crawl(startingURL)
    // Optionally, you can wait for user input to prevent main from exiting immediately
    // fmt.Scanln()
}

In this enhanced crawler, we introduce Go routines and a synchronized map to track and manage visited URLs. This simple setup can be improved further by adding mechanisms to respect robots.txt, handle different domains, prioritize new links, and extract deeper data.

Conclusion

This guide provided a basic walkthrough of creating a simple web crawler in Go. While this code can serve as a starting point, web crawling involves handling numerous practical challenges like pagination, errors, duplicate content, and dynamic data. Further enhancements can include adding concurrency controls, integrating databases for storage, and implementing sophisticated data extraction methods.

Previous Article: How to redirect in Go (301, 302, etc)

Series: Networking and Server

Golang

How to set up and run Go in Ubuntu

November 20, 2024