String Length in Go: Counting Characters vs Bytes

In the Go programming language, working with strings can sometimes be tricky, especially when it comes to counting characters versus bytes. A common misconception is that the length of a string always equals the number of characters it contains. However, because Go uses UTF-8 encoding, a string's length in bytes might be different from its length in characters. Let's explore how to measure both using Go.

Understanding Strings in Go
Conclusion

Understanding Strings in Go

Strings in Go are a sequence of bytes that use the UTF-8 encoding. This means each character can consist of one or more bytes. For example, ASCII characters require one byte, while some Unicode characters may consume two or more bytes.

Basic Usage

Getting the length of a string in bytes is straightforward using Go's built-in len function:

package main

import "fmt"

func main() {
    str := "hello"
    fmt.Println("Length in bytes:", len(str)) // Outputs: 5
}

In this example, each character ('h', 'e', 'l', 'l', 'o') is one byte in UTF-8 encoding.

Intermediate Usage

Now, let's look at a string with characters beyond the basic ASCII set:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "こんにちは" // "Hello" in Japanese
    fmt.Println("Length in bytes:", len(str)) // Outputs: 15
    fmt.Println("Number of characters:", utf8.RuneCountInString(str)) // Outputs: 5
}

In this case, the string "こんにちは" is 15 bytes long because each character is represented by 3 bytes in UTF-8 encoding. However, there are only 5 characters.

Advanced Usage

Let's handle strings that include variable byte length more dynamically. This will illustrate converting between runes and strings to check a string's actual characters' length, and address potential encoding issues.

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "🙂😀😁"
    byteCount := len(str)
    charCount := utf8.RuneCountInString(str)

    fmt.Println("The string is:", str)
    fmt.Println("Length in bytes:", byteCount)     // Outputs: 12
    fmt.Println("Number of characters:", charCount) // Outputs: 3

    // Convert to runes slice
    runes := []rune(str)
    for i, r := range runes {
        fmt.Printf("Character %d: %c (Unicode code point: %U, Size: %d bytes)\n", i, r, r, utf8.RuneLen(r))
    }
}

Here, the string consists of emoji characters that consume 4 bytes each. Thus, the total length is 12 bytes, but there are only 3 character symbols displayed.

Conclusion

In Go, when working with strings, it's essential to understand the differences between measuring the length in bytes and counting the actual characters, especially when your application involves internationalization or deals with Unicode content. Using utf8.RuneCountInString is necessary when you need to track user-perceived character counts beyond simple byte length.

Next Article: Converting Strings to Runes and Vice Versa in Go

Previous Article: Manipulating Strings: Substrings, Slicing, and Splitting in Go

Series: Working with Strings in Go

Golang

How to set up and run Go in Ubuntu

November 20, 2024