Sling Academy
Home/Golang/How to crawl web pages using Go

How to crawl web pages using Go

Last updated: November 29, 2024

Introduction

Web crawling is the process of programmatically accessing web pages to extract data or gather information. One of the powerful languages used for this purpose is Go, known for its efficiency and concurrency capabilities. In this article, we will explore how to create a simple web crawler using Go.

Getting Started with Go

First, ensure you have Go installed on your system. You can download Go from the official website and follow the installation instructions.

$ go version

This command will display the current version of Go installed on your system.

Basic HTTP Request

We will use Go’s net/http package to send HTTP requests. Let's start by making a simple GET request to a web page.

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
)

func main() {
    response, err := http.Get("http://example.com")
    if err != nil {
        fmt.Printf("Request failed: %s\n", err)
        return
    }
    defer response.Body.Close()

    body, err := ioutil.ReadAll(response.Body)
    if err != nil {
        fmt.Printf("Reading body failed: %s\n", err)
        return
    }
    fmt.Println(string(body))
}

This code snippet makes a GET request to example.com, reads the response body, and prints it out.

Parsing HTML Content

To extract specific elements from the HTML content, we'll use the golang.org/x/net/html package which provides utilities to parse HTML documents. Here's an example of how to parse all anchor tags (links) from the page.

package main

import (
    "fmt"
    "golang.org/x/net/html"
    "strings"
)

func extractLinks(body string) []string {
    tokenizer := html.NewTokenizer(strings.NewReader(body))
    var links []string

    for {
        tokenType := tokenizer.Next()
        switch tokenType {
        case html.ErrorToken:
            return links
        case html.StartTagToken:
            token := tokenizer.Token()
            if token.Data == "a" {
                for _, attr := range token.Attr {
                    if attr.Key == "href" {
                        links = append(links, attr.Val)
                    }
                }
            }
        }
    }
}

func main() {
    // Fetch the page like in the previous example
    // Then: 
    body := "..." // Assuming this is the page body you fetched.
    links := extractLinks(body)
    fmt.Println("Found links:", links)
}

The code defines an extractLinks function that returns a list of URLs from anchor tags in the HTML content.

Building the Crawler

Now, let's bring everything together to build a simple crawler. We will modify our example to visit each link we find (only if it's on the same domain) and extract links recursively.

package main

import (
    "fmt"
    "net/http"
    "net/url"
    "sync"
)

// Define a struct to manage visited URLs
var visited = struct {
    m map[string]bool
    sync.RWMutex
}{m: make(map[string]bool)}

func crawl(u string) {
    // Parse the URL
    uri, err := url.Parse(u)
    if err != nil || visited.m[u] {
        return
    }

    // Mark the URL as visited
    visited.Lock()
    visited.m[u] = true
    visited.Unlock()

    // Pretend this function fetches and extracts links like earlier examples
    linkedUrls := fetchAndExtractLinks(u)

    for _, link := range linkedUrls {
        go crawl(link)
    }
}

func main() {
    startingURL := "http://example.com"
    crawl(startingURL)
    // Optionally, you can wait for user input to prevent main from exiting immediately
    // fmt.Scanln()
}

In this enhanced crawler, we introduce Go routines and a synchronized map to track and manage visited URLs. This simple setup can be improved further by adding mechanisms to respect robots.txt, handle different domains, prioritize new links, and extract deeper data.

Conclusion

This guide provided a basic walkthrough of creating a simple web crawler in Go. While this code can serve as a starting point, web crawling involves handling numerous practical challenges like pagination, errors, duplicate content, and dynamic data. Further enhancements can include adding concurrency controls, integrating databases for storage, and implementing sophisticated data extraction methods.

Previous Article: How to redirect in Go (301, 302, etc)

Series: Networking and Server

Golang

Related Articles

You May Also Like

  • How to remove special characters in a string in Go
  • How to remove consecutive whitespace in a string in Go
  • How to count words and characters in a string in Go
  • Relative imports in Go: Tutorial & Examples
  • How to run Python code with Go
  • How to generate slug from title in Go
  • How to create an XML sitemap in Go
  • How to redirect in Go (301, 302, etc)
  • Using Go with MongoDB: CRUD example
  • Auto deploy Go apps with CI/ CD and GitHub Actions
  • Fixing Go error: method redeclared with different receiver type
  • Fixing Go error: copy argument must have slice type
  • Fixing Go error: attempted to use nil slice
  • Fixing Go error: assignment to constant variable
  • Fixing Go error: cannot compare X (type Y) with Z (type W)
  • Fixing Go error: method has pointer receiver, not called with pointer
  • Fixing Go error: assignment mismatch: X variables but Y values
  • Fixing Go error: array index must be non-negative integer constant
  • Fixing Go error: syntax error: unexpected X, expecting Y