Ken Muse

Migrating Submodules That Use Large File Storage (LFS)


When working in Git, you have the ability to treat external repositories as if they were simply folders. These submodules can be updated and managed just like any other Git repository. You can update and commit code, then update the current repository to use the revised version of the submodule. But what happens when you need to rewrite the history of the submodule? This can happen for a number of reasons, but one of the most common is the need to remove large files from the repository.

In this post, we’ll look at how submodules work and how to handle pointing a submodule to a new repository or updating the associated commit to point to a different commit.

What is a submodule

Conceptually, it’s easy to understand that a submodule is a pointer to a specific commit in a different repository. The implementation, however is a bit more complex. In fact, it’s actually separated into three components:

  • The .gitmodules file
  • The .git/config file
  • The gitlink commit

To explore how these components work, let’s assume that we run the following command to add a submodule to our repository in the my-submodule directory:

1git submodule add https://github.com/kenmuse/supersub.git my-submodule

First, a .gitmodules file is created in the root of the repository. It contains all of the key details for any submodules. In this case, it will look something like this:

1[submodule "my-submodule"]
2	path = my-submodule
3	url = https://github.com/kenmuse/supersub.git

This provides the record of which submodules are included in the repository and where they are located. When git submodule init is run, it will read this file, clone the submodules into the appropriate directory, and update the .git/config file with the details from .gitmodules. This is the second component of responsible for making submodules work in Git. These entries track which submodules have been restored and where the files are located. This allows specific submodules to be updated or restored without affecting the others. It also allows you to change where the current submodule is pointing, allowing you to update the submodule to use a different commit or repository.

The final component is the gitlink commit. This is a special commit that always shows as a change to the folder containing the submodule. When a commit is created, it consists of the trees (loosely, directories) and blobs (files) with a file mode that indicates type of record and permissions. In the case of a gitlink, it is recorded in a tree in the commit with the file mode 160000 and the object type commit. This indicates that this is a directory that should be populated using the contents of the specified commit. For example, running git ls-files --format='%(objectmode) %(objecttype) %(objectname) %(path)' for the current example will show something like this after it is committed:

1100644 blob     9da992faf110f01ce1efdfd27e04795a23d97e92    .gitmodules
2160000 commit   c931a7bbe7df798d559e172bcf7c80a086c82f1d    my-submodule

This shows that .gitmodules is a standard file without execution permission. It also indicates that the my-submodule directory relies on a gitlink. It should be populated using commit c931a7bbe7df798d559e172bcf7c80a086c82f1d from a repository identified in .gitmodules. This tree record is what ties a specific commit from the submodule repository to the parent repository. This ensures that the repository tracks the specific commit that was used with the code.

What about branches

If a branch is specified (git submodule add -b <branch> <url>), the .gitmodules file will include the configuration line branch=. This associates the submodule with a specific branch in the repository. If the special branch name . is used, it will track the branch using the name of the current branch in the parent repository. This associates the submodule with a branch, but the specific commit is still always tracked as a gitlink record.

The simple migration

The simplest case to handle is when the repository URL is changing, but the contents are unaltered. When this happens, the repository URL associated with the submodule just needs to be updated. This can be done with a simple command: git submodule set-url -- <path-to-submodule> <newurl>. This updates the .gitmodules folder and, if necessary, the .git/config. At this point, Git will use the new endpoint to push or pull commits.

Migrating with LFS

This approach assumes that the associated commit was not changed. Git still expects to be able to resolve the gitlink entry. This is what makes Large File Storage (LFS) so challenging. If the submodule’s repository has large files that have to be converted to LFS, this will cause the history to be rewritten to use LFS pointers. That change causes all of the SHA commit IDs to also change. As a result, the gitlink entry will no longer be valid. This causes an error when trying to update the submodule:

1> git submodule update
2
3Cloning into '/repo/my-submodule'...
4done.
5fatal: git upload-pack: not our ref c931a7bbe7df798d559e172bcf7c80a086c82f1d
6fatal: remote error: upload-pack: not our ref c931a7bbe7df798d559e172bcf7c80a086c82f1d
7fatal: Fetched in submodule path 'my-submodule', but it did not contain c931a7bbe7df798d559e172bcf7c80a086c82f1d. Direct fetching of that commit failed.

To fix this issue, we need to also update the gitlink record to request the current commit from the repository: git submodule update --remote. If you then use git diff, you will see that the reference has been updated:

1> git diff
2
3diff --git a/my-submodule b/my-submodule
4index c931a7b..8fecc5d 160000
5--- a/my-submodule
6+++ b/my-submodule
7@@ -1 +1 @@
8-Subproject commit c931a7bbe7df798d559e172bcf7c80a086c82f1d
9+Subproject commit 8fecc5d2bf22e5223260647f8468ca222e691671

Now that the submodule is able to resolve the gitlink, it’s also possible to cd to the current directory and use git checkout to change to a different commit or branch. This allows you to update the gitlink to point to a different commit or branch.

As you can see, it’s not too difficult to deal with submodules that have been updated to use LFS. The key is to remember to update the gitlink so it points to a valid commit in the repository.

The recursive history

There’s one last thing to understand. Because submodule references include both a pointer to the repository and to the commit, the Git history will show each time those references have changed. While those references may be valid at the time they were created, you will notice that none of the commands we used altered the history of those various references over time. Instead, we’ve only create new commits that updated the repository to point to the latest versions. The original history – what the references pointed to at that moment in time – remain intact.

While it is possible to rewrite the history, you can see that it would require iterating through the entire repository, updating both the .gitmodules and the gitlink to point to the new repository and commit at each point in time. This would, of course, also change the commit IDs throughout the repository. This is a non-trivial task and is not something that Git can do natively, but it could be scripted.

In general, I would recommend against rewriting the history in this way. Instead, consider selectively branching and updating the references. If you find a need to rebuild a certain version of the code, update the references on that branch so that it now points to the correct repository and commit. From there, you can continue with your normal build process. This leaves the original history intact while still providing the option to update the references as needed.

Know your alternatives

You can see that there’s a lot to consider if you’re migrating a repository with submodules. If you’re not using submodules today, I’d recommend instead considering whether package management solutions would provide the features in a more manageable way. For example, it’s often easier to reference an NPM package than it is to create a submodule and link to the code. It can make the code more portable and easier to migrate if future demands require it. If you are using submodules, then moving to packages may also be an alternative to migrating the submodules. Some languages – C/C++ in particular – are still relatively new to the package management experience, so submodules may still be the right choice. For most other languages, however, packages are usually the better long term investment.

Now that you know a bit more about how it works under the covers, I hope you’ll find it easier to manage and migrate your submodules. Happy DevOp’ing!