Fetching and Parsing HTML Pages with `goquery`

When working with web scraping or data extraction in the Go programming language, the goquery library is a powerful tool for fetching and parsing HTML pages. It provides a simple and efficient way to filter and extract elements from HTML documents, similar to how jQuery works for manipulating DOM elements in web pages.

Installation
Fetching HTML Content
Parsing HTML Content
Working with Selections
Conclusion

Installation

Before you start using goquery, you need to have it installed in your Go environment. You can install it using Go modules with the following command:

go get github.com/PuerkitoBio/goquery

Fetching HTML Content

The first step in using goquery is to fetch the HTML content you want to parse. You can do this using Go's net/http package. Below is an example of how to fetch a webpage:


package main

import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    // Request the HTML page.
    res, err := http.Get("https://example.com")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    if res.StatusCode != 200 {
        log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
    }

    // Parse the HTML.
    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }

    // Use goquery to parse the document...
}

Parsing HTML Content

After fetching the HTML content, you can parse it and filter elements using CSS-like selectors. Here is an example demonstrating how to extract all headings from a page:


func main() {
    // Omitting initial fetching code for brevity...

    // Find the heading elements
    doc.Find("h1, h2, h3, h4, h5, h6").Each(func(index int, item *goquery.Selection) {
        text := item.Text()
        fmt.Printf("Heading %d: %s\n", index, text)
    })
}

Working with Selections

The goquery library allows you to refine your selection and extract attributes, text, and more. The following example demonstrates how to extract href attributes from all hyperlinks:


doc.Find("a").Each(func(index int, item *goquery.Selection) {
    linkTag := item
    link, _ := linkTag.Attr("href")
    fmt.Println(link)
})

Conclusion

Using goquery allows Go developers to effectively scrape and parse HTML pages with relative ease. This brief introduction provides just a glimpse of the capabilities goquery offers for web scraping tasks.

Next Article: Developing and Running Cron Jobs with `robfig/cron` in Go

Previous Article: Understanding `runtime` Package for Low-Level Go Utilities

Series: Go Utilities and Tools

Golang

How to set up and run Go in Ubuntu

November 20, 2024