Introduction
Version control systems like Git are essential for modern software development, providing a means to track changes, revert to previous stages, and collaborate with others. However, a common mistake that many developers encounter is accidentally committing a large file to their Git repository, which can inflate the repository size and make cloning and pulling operations inefficient. This tutorial will guide you through the process of removing a large file from your git commit history.
Identifying the Problematic File
Before we proceed with the actual removal, it’s important to identify the file(s) that are unnecessarily bulking up your repository size. A tool that is extremely useful for this task is the git-sizer
, which you can install and use as follows:
git clone https://github.com/github/git-sizer.git
cd git-sizer
make
./git-sizer
Alternatively, you can use the git rev-list
command to pinpoint large files:
git rev-list --objects --all |
grep -F -f <(git verify-pack -v .git/objects/pack/pack-*.idx |
sort -k3nr | head -10 |
cut -f1 -d ' ')
Preliminary Steps
Before making any changes, it’s crucial to inform your team about the process, as altering commit history can disrupt their workflows.
Next, ensure you have a full backup of your repository. This can be done by cloning your repository:
git clone --mirror your-repo-url backup-repo.git
Removing a Single Large File
If you know the exact name of the large file and the commit it’s associated with, you can use git filter-branch
:
git filter-branch --force --index-filter "git rm --cached --ignore-unmatch path_to_your_file" --prune-empty --tag-name-filter cat -- --all
Replace path_to_your_file
with the path to the file you wish to remove.
Removing the File from the Entire History
If the file has been committed multiple times, it’s more efficient to use the git-filter-repo
utility:
python3 -m pip install --user git-filter-repo
git filter-repo --invert-paths --path path_to_your_file
Again, replace path_to_your_file
with the relevant file path.
Purging the File from All References
To ensure the file is removed from all references such as tags, use the --tag-name-filter
option:
git filter-repo --tag-name-filter cat --path path_to_your_file
Cleaning Up and Reducing Repository Size
After the history has been rewritten to exclude the large file, you need to clean up the repository:
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now --aggressive
This series of commands removes backup references, expires reflog entries, and performs garbage collection to minimize the repository size.
Reflecting the Changes on Remote Repositories
Finally, push your changes to the remote repository with the --force
option:
git push origin --force --all
git push origin --force --tags
This will overwrite the history on the remote repository, effectively removing the large file history for other collaborators as well.
Using the BFG Repo-Cleaner
As an alternative to git filter-branch
, the simpler BFG Repo-Cleaner can be used to remove large files:
java -jar bfg.jar --strip-blobs-bigger-than 100M
This command will remove blobs larger than 100MB. Adjust the size as necessary for your particular case.
Conclusion
Removing large files from Git history can help keep your repository lean and efficient. While the process can be complex and requires careful handling to preserve the integrity of your project history, the correct application of the presented tools ensures a clean repository, beneficial for you and your collaborators.