Git: How to remove a very large file from commit history

Updated: January 27, 2024 By: Guest Contributor Post a comment

Introduction

Version control systems like Git are essential for modern software development, providing a means to track changes, revert to previous stages, and collaborate with others. However, a common mistake that many developers encounter is accidentally committing a large file to their Git repository, which can inflate the repository size and make cloning and pulling operations inefficient. This tutorial will guide you through the process of removing a large file from your git commit history.

Identifying the Problematic File

Before we proceed with the actual removal, it’s important to identify the file(s) that are unnecessarily bulking up your repository size. A tool that is extremely useful for this task is the git-sizer, which you can install and use as follows:

git clone https://github.com/github/git-sizer.git
cd git-sizer
make
./git-sizer

Alternatively, you can use the git rev-list command to pinpoint large files:

git rev-list --objects --all |
grep -F -f <(git verify-pack -v .git/objects/pack/pack-*.idx |
sort -k3nr | head -10 |
cut -f1 -d ' ')

Preliminary Steps

Before making any changes, it’s crucial to inform your team about the process, as altering commit history can disrupt their workflows.

Next, ensure you have a full backup of your repository. This can be done by cloning your repository:

git clone --mirror your-repo-url backup-repo.git

Removing a Single Large File

If you know the exact name of the large file and the commit it’s associated with, you can use git filter-branch:

git filter-branch --force --index-filter "git rm --cached --ignore-unmatch path_to_your_file" --prune-empty --tag-name-filter cat -- --all

Replace path_to_your_file with the path to the file you wish to remove.

Removing the File from the Entire History

If the file has been committed multiple times, it’s more efficient to use the git-filter-repo utility:

python3 -m pip install --user git-filter-repo
git filter-repo --invert-paths --path path_to_your_file

Again, replace path_to_your_file with the relevant file path.

Purging the File from All References

To ensure the file is removed from all references such as tags, use the --tag-name-filter option:

git filter-repo --tag-name-filter cat --path path_to_your_file

Cleaning Up and Reducing Repository Size

After the history has been rewritten to exclude the large file, you need to clean up the repository:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now --aggressive

This series of commands removes backup references, expires reflog entries, and performs garbage collection to minimize the repository size.

Reflecting the Changes on Remote Repositories

Finally, push your changes to the remote repository with the --force option:

git push origin --force --all
git push origin --force --tags

This will overwrite the history on the remote repository, effectively removing the large file history for other collaborators as well.

Using the BFG Repo-Cleaner

As an alternative to git filter-branch, the simpler BFG Repo-Cleaner can be used to remove large files:

java -jar bfg.jar --strip-blobs-bigger-than 100M

This command will remove blobs larger than 100MB. Adjust the size as necessary for your particular case.

Conclusion

Removing large files from Git history can help keep your repository lean and efficient. While the process can be complex and requires careful handling to preserve the integrity of your project history, the correct application of the presented tools ensures a clean repository, beneficial for you and your collaborators.