Pandas UnicodeDecodeError: ‘utf-8’ codec can’t decode

Updated: February 21, 2024 By: Guest Contributor Post a comment

Table Of Contents

1 Understanding the Error

1.1 Why the Error Occurs?

2 Solutions

2.1 Solution #1 – Specifying the Encoding Explicitly

2.2 Solution #2 – Detecting Encoding Automatically

2.3 Solution #3 – Convert File Encoding to UTF-8

Understanding the Error

The UnicodeDecodeError: 'utf-8' codec can't decode in Pandas often occurs when trying to read a file with non-UTF-8 encoding. This error can be frustrating, but understanding its causes and knowing how to address it can help mitigate such issues. This tutorial will explore the reasons behind this error and provide detailed solutions.

Why the Error Occurs?

This error arises when Pandas attempts to open a file assumed to be encoded in UTF-8, but contains characters that are outside this encoding. Since UTF-8 is a standard encoding format used for encoding all character sets, when a file has a different encoding, Pandas will fail to decode it correctly, leading to this error.

Solutions

Solution #1 – Specifying the Encoding Explicitly

The first and most straightforward approach is to explicitly specify the file’s encoding if it’s known.

Step 1: Use the pd.read_csv or pd.read_excel function.
Step 2: Add the encoding parameter with the correct encoding of your file.

Code Example:

import pandas as pd
df = pd.read_csv('yourfile.csv', encoding='latin1')

Notes: This solution is straightforward but requires knowledge of the file’s encoding. ‘latin1’, ‘iso-8859-1’, and ‘cp1252’ are common encodings for files that may cause this error.

Solution #2 – Detecting Encoding Automatically

When the file encoding is unknown, Python’s chardet or cchardet library can be used to detect it.

Step 1: Install chardet or cchardet using pip.

Step 2: Use the library to detect the encoding of your file.
Step 3: Read the file with the detected encoding.

Code Example:

import chardet
with open('yourfile.csv', 'rb') as f:
    result = chardet.detect(f.read())
encoding = result['encoding']
import pandas as pd
df = pd.read_csv('yourfile.csv', encoding=encoding)

Notes: While this method is automated, it can slow down your workflow significantly if dealing with large files, as the entire file must be scanned to detect the encoding.