The Ultimate Tips for Working With Large Git Monorepos

Category:

#DevOps

#GitHub

Tags:

#Git

#Performance

Published: August 3, 2024 Reading Time: 9 min

If your company has been doing development for a long time, then you may be dealing with a large monorepo. A monorepo is a single repository that contains multiple projects. Monorepos are popular because they make it easier to share code between projects and enforce consistent coding standards. They can work well for some code, especially when the projects are released and versioned together as a single unit. Unfortunately, as the size of the monorepo grows, it becomes more difficult to manage and is likely to have performance issues. Thankfully, there are some things you can do to reduce some of the issues.

The root of the issue

The core problem is that Git fundamentally only understands the concept of a complete repository. Updating a file in Git requires not just capturing the updated file content. It also involves creating a snapshot of the hashes and names for all of the files and folders in the repository. As the number of items grows, the time to catalog these to update the repository increases. Complicating this challenge, Git has to track files that have changed in the source code folders. This can require Git to scan the entire folder structure to determine what has changed. While this can be optimized a bit using .gitignore to exclude a folder and its content, this behavior can still lead to performance issues.

On Windows and macOS, Git provides a file monitoring component, fsmonitor, that can help reduce the time it takes to determine what has changed in the repository. Linux users will need to use a third-party tool like Watchman to get similar functionality. To enable it, run git config core.fsmonitor true. The next Git operation will start the monitor. It works by listening for notifications from the operating system for any files that have changed in the Git repository’s folder tree. When a change is reported, Git will evaluate the modified files and update its state, avoiding a scan. To maximize the benefits from this setting, be sure to also set core.untrackedCache to true. This setting allows Git to cache the state of untracked files within the repository folder. With these two features enabled, Git can avoid scanning the entire repository for changes, which can significantly improve performance. These features also work with .gitignore. If a folder is excluded from the repository, Git will ignore the notifications for files in that folder. With this enabled, Git will still need to scan the folders when the repository is cloned (to get the initial state) or if the fsmonitor daemon is restarted.

Networks may it worse

One of the hardest things you can do on a computer is request the operating system to work with large numbers of small files. The nature of this process can substantially slow down the performance of an entire computer. This is one reason that most big data platforms try to merge together the numerous small files into a single larger file. It requires less effort and CPU. Git does something similar during its garbage collection process. It merges the numerous small “loose” files that have been committed into a single pack file.

Operations involving I/O will generally always suffer from performance issues when dealing with small data sizes. Unfortunately, this also applies to network operations. That normally makes network shares a bad idea for Git repositories. While it may provide a boost to the available storage, it comes at a cost. The small packets of data can be more challenging than reading and streaming data from a few, larger files. I’ve seen networked Git repositories take 10x-300x longer on a network share to prepare a repository after a clone due to the mix of small network operations and small disk I/O.

It might seem like fsmonitor can help with some of this. However, support for network-mounted repositories is considered experimental and typically won’t work with many network shares. There are a number of technical reasons for this limitation, including the challenges of getting notifications from networked shares. Suffice to say, it’s best to avoid network shares for your repos!

Partial and sparse checkouts

Git has some other options that can help with large monorepos. The first of these is the partial clone. A partial clone is a way to clone a repository without getting all of the blobs or trees that make up the history. This is done by using the --filter option when cloning (i.e., git clone --filter=blob:none https://path/to/repo.git). Using a blobless clone instructs Git to download all of the trees for the while avoiding any blob that isn’t needed for the current operations (such as the checkout of the branch). This can substantially reduce the amount of data being transferred.

Another option is to use a sparse checkout. This instructs Git to clone the repository, but only checkout and manage the root directory of the repository by default. From there, you can add cones. This allows you to work with specific folders (and their children). This eliminates the need to extract every available file and folder into the working directory. Since there are fewer files and folders to scan, it improves the performance of Git. You can even combine this with the partial clone to get the best of both worlds. For example, you can invoke git clone --sparse --filter=blob:none {repository-url} to get a sparse clone whose working directory only contains the files from the root directory. From there, git sparse-checkout add {folder} can be used to add specific folders to the working directory (downloading the needed blobs as necessary). This can be a great way to work with a large monorepo without having to download the entire repository or extract all of its folders to the file system.

Tweak the settings

If you have a big repository, I recommend that you consider setting the following values in your .gitconfig (manually or by calling git config). These settings can help improve the performance of Git when working with large repositories.

Setting	Value	Notes
feature.manyFiles	true	Sets `core.untrackedCache=true`, `index.version=4`, `index.skipHash=true`. This enables caching the details for untracked files to minimize the scan time and allows Git to skip a part of the process that involves hashing the files. In addition, it also enables the newest version of the index format. This new format supports pathname compression, which can reduce the index size by 30-50%. It has been available since version 1.8.0 (2012), but for compatibility the index version defaults to v2 (or sometimes v3, depending on the features being used).
core.commitGraph	true	Required for commit graph writes (see `fetch.writeCommitGraph`). This is already set to `true` in the latest versions of Git.
fetch.writeCommitGraph	true	Improves performance for merges and Git history. There’s a great post on Microsoft Learn that explains this feature.
core.fsmonitor	true	Enables OS file change monitoring. For Linux, this requires specifying a web hook to invoke, such as the provided `./git/hooks/fsmonitor-watchman.sample`
checkout.workers	0	Less than 1 to automatically create workers for checking out files based on CPU core count. The default is 1 worker. This setting is used when a checkout occurs and the checkout would involve least 100 files (or the configured value for `checkout.thresholdForParallelism`).
pack.writereverseindex	true	Enables reverse index lookups. Reverse indexes are enabled by default in 2.41.

Get rid of some of the bloat

If your repository has binary files (especially large ones), then it can add more overhead to the process. Git has a number of algorithms it can use for compressing data stored in its packfiles. The most efficient of these store the difference (or delta) between versions of files. Unfortunately, this doesn’t work well with binary files. Since Git can’t store a delta for a binary file, it stores the entire file each time it changes. Even small files can add up quickly over time! To eliminate this problem, consider using Git LFS to store your binary data. LFS stores the binary files outside of the repository and only stores a pointer to the file in the repository. This can significantly reduce the size of the repository and the time it takes to clone it. In addition, it reduces the size of the Git object database. In many cases, it’s easier to start using LFS for new commits than it is to migrate the old binary files to LFS. While it may be worth the effort to migrate them to LFS (and often is with larger repos), it changes the history of the repository and SHA hashes for the commits. This can break any references to the old commits (such as pull requests). If you do decide to migrate to LFS, be sure to communicate the change to your team and plan for the migration to avoid any issues!

That other option

With modern Git, the best practice is to move towards poly-repos. That is, consider breaking a monorepo up into multiple smaller repositories. These repositories can create binary packages that are managed using a package management solution (such as Azure Artifacts, GitHub Packages, or Artifactory). Beyond reducing the size of a monorepo, it has other benefits. Individual components can be unit tested and released whenever bug fixes or feature changes are required. Since the components are separate, the individual build and test times are substantially smaller. In addition, it opens up opportunities for reusing the code in other projects. This can be a great way to improve the performance of your development process and reduce the overhead of managing a large monorepo.

It’s also not something you have to do all at once. You can start by moving the components that are the most independent and have the most reuse. This can help you get a feel for the process and the benefits of poly-repos. As you get more comfortable with the process, you can move more components to their own repositories. This can be a great way to improve the performance of your development process and gradually reduce the size of the monorepo.

Stay up to date

If you want the best performance, keep the version of Git up-to-date. Beyond the security benefits, newer versions of Git include additional performance enhancements. As these get validated, they are enabled by default. I also recommend that you keep an eye on GitHub’s Git blog. Each time a new version of Git is realeased, the team summarizes the highlights from the release, including new features that can improve performance. For example, at the time of this writing, the Highlights from Git 2.46 detail the new pseudo-merge bitmaps, providing details about how it works and how to enable it now. It’s a great way to stay up-to-date with the latest recommendations and settings. It’s also a great way to continue to take advantage of improvements to Git to make it easier to manage your large monorepo.