Remove files from your git history with git-filter-repo
Published
November 12, 2022
Modified
November 25, 2022
I recently cleaned a couple of git repos that had large data files committed early in their history, and in the process I learned about git-filter-repo, a tool for cleanly altering git histories.
There are many reasons you might need to modify your git history. For example, consider a local repo you want to push to Github that at some point in time had a file larger than their 100MB cap committed. In order to push to Github, you would need to not only remove the file from your repo with git rm, but also remove the file from any commit it showed up in. Another common scenario: you want to purge your git history of any accidentally tracked junk files, such as __pycache__ folders or .DS_Store files.
In both scenarios, the goal becomes to completely rid a file (or directory) from the git history.
The old way: git filter-branch
When you search around for ideas on how to rid files from histories you might find a lot of older stack-exchange posts and tutorial websites with solutions involving git filter-branch. However, according to the git filter-repo readme, filter-branch has numerous problems: it is slow, potentially unsafe for your repository, and clunky to use. For that reason I won’t describe how to use it here.
Enter git filter-repo
People have since built simpler, more effective tools for performing git history manipulations, and the best one I’ve found is git-filter-repo. I picked it after having been convinced by their comparisons to other tools in this area. They cover many use cases in their handbook, which is worth at least glancing over.
Example: removing files from the git history
In this post I’ll focus on the example of removing a file from the git history. However, this works the same with directories and similarly with glob patterns or regex; see --path-glob and --path-regex.
For illustration, I’ll initialize an empty git repository and add two files, file_1.txt and file_2.txt, in a single commit.
Parsed 1 commits
New history written in 0.02 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 9e2acce Initial commit
Enumerating objects: 3, done.
Counting objects: 33% (1/3)Counting objects: 66% (2/3)Counting objects: 100% (3/3)Counting objects: 100% (3/3), done.
Writing objects: 33% (1/3)Writing objects: 66% (2/3)Writing objects: 100% (3/3)Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
Completely finished after 0.07 seconds.
The --path specifies the path you’re trying to target for removal, and the --invert-paths is basically the logical negation of the filtering condition, so when it’s applied it will only deletefile_2.txt. When you leave that flag off, you instead delete everything exceptfile_2.txt. You get only the file, or everything but the file.
The --force flag is needed because filter-repo expects us to follow best practices by only using it on a fresh clone. In practice1, you would commit all your work, get a clean git state, and make a fresh clone of your repo to operate on with filter-repo.
Now check the files in the git log and filesystem:
ls
file_1.txt file_2.txt
git log --name-status--oneline
9e2acce (HEAD -> main) Initial commit
A file_1.txt
The single commit now does not have any information pertaining to file_2.txt, and file_2.txt is still around on the filesystem. (If you just wanted to delete it completely, you could just skip the decaching step altogether.)