Sling Academy
Home/Golang/Fetching and Parsing HTML Pages with `goquery`

Fetching and Parsing HTML Pages with `goquery`

Last updated: November 27, 2024

When working with web scraping or data extraction in the Go programming language, the goquery library is a powerful tool for fetching and parsing HTML pages. It provides a simple and efficient way to filter and extract elements from HTML documents, similar to how jQuery works for manipulating DOM elements in web pages.

Installation

Before you start using goquery, you need to have it installed in your Go environment. You can install it using Go modules with the following command:

go get github.com/PuerkitoBio/goquery

Fetching HTML Content

The first step in using goquery is to fetch the HTML content you want to parse. You can do this using Go's net/http package. Below is an example of how to fetch a webpage:


package main

import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    // Request the HTML page.
    res, err := http.Get("https://example.com")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    if res.StatusCode != 200 {
        log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
    }

    // Parse the HTML.
    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }

    // Use goquery to parse the document...
}

Parsing HTML Content

After fetching the HTML content, you can parse it and filter elements using CSS-like selectors. Here is an example demonstrating how to extract all headings from a page:


func main() {
    // Omitting initial fetching code for brevity...

    // Find the heading elements
    doc.Find("h1, h2, h3, h4, h5, h6").Each(func(index int, item *goquery.Selection) {
        text := item.Text()
        fmt.Printf("Heading %d: %s\n", index, text)
    })
}

Working with Selections

The goquery library allows you to refine your selection and extract attributes, text, and more. The following example demonstrates how to extract href attributes from all hyperlinks:


doc.Find("a").Each(func(index int, item *goquery.Selection) {
    linkTag := item
    link, _ := linkTag.Attr("href")
    fmt.Println(link)
})

Conclusion

Using goquery allows Go developers to effectively scrape and parse HTML pages with relative ease. This brief introduction provides just a glimpse of the capabilities goquery offers for web scraping tasks.

Next Article: Developing and Running Cron Jobs with `robfig/cron` in Go

Previous Article: Understanding `runtime` Package for Low-Level Go Utilities

Series: Go Utilities and Tools

Golang

Related Articles

You May Also Like

  • How to remove HTML tags in a string in Go
  • How to remove special characters in a string in Go
  • How to remove consecutive whitespace in a string in Go
  • How to count words and characters in a string in Go
  • Relative imports in Go: Tutorial & Examples
  • How to run Python code with Go
  • How to generate slug from title in Go
  • How to create an XML sitemap in Go
  • How to redirect in Go (301, 302, etc)
  • Using Go with MongoDB: CRUD example
  • Auto deploy Go apps with CI/ CD and GitHub Actions
  • Fixing Go error: method redeclared with different receiver type
  • Fixing Go error: copy argument must have slice type
  • Fixing Go error: attempted to use nil slice
  • Fixing Go error: assignment to constant variable
  • Fixing Go error: cannot compare X (type Y) with Z (type W)
  • Fixing Go error: method has pointer receiver, not called with pointer
  • Fixing Go error: assignment mismatch: X variables but Y values
  • Fixing Go error: array index must be non-negative integer constant