Consolidating multiple repositories into a Monorepo

A background on the Monorepo concept

At this point, there is very little dispute within the software industry in the benefits of using a Monorepo. There are obviously use cases where this approach doesn’t suit, but on-balance, it will often make a lot more sense to use a single repository than many. Some benefits include:

  • Absolute visibility. When a Monorepo is used, it means that any engineer has access to any of the

company’s code. Instead of needing to ask another team how something works internally or where to find code, said engineer can immediately look at the internals of another service and figure it out for themselves.

  • Leading on from the previous point, it also means any engineer can easily suggest changes and

act with initiative to fix problems they find in other teams domains. In the case of multiple repositories, the location of code is often hidden knowledge of the owning team which blocks this collaboration.

  • Standardized tooling and conventions are much easier to implement. As per my series on

building a complete Kubernetes infrastructure only one CI system needs to be set up via Github actions, in place of a CI system for each repository.

  • While I don’t have empirical evidence and haven’t looked into it formally, I feel that when both

engineers and teams know that their code is visible company wide, they are more focused on the quality and readability and stick to company standards more, in place of team level norms.

  • Monorepo’s also allow for better code review with cross team reviews being much easier to

facilitate allowing for styles and standards to be kept more congruent.

Prior to jumping in, I’ve also found the question of submodules being an alternative come up quite often when people are looking for a path to migrate to a Monorepo style workflow. I have tried this previously and in my experience this is not a good option.

Using submodules absolutely subtracts the benefits of visibility as new changes are often only seen as an updated hash in the parent repo, and to keep many submodules which are often moving requires extra tooling to keep a local up to date. It also introduces a significant amount of brittleness to any CI system as all dependencies need to be available, have the proper authorizations set and each commit to the parent repository must include the correct hashes of the children that must be pushed prior. The ease of being able to commit a hash to a parent prior to the child repo having it’s hash pushed upstream is way too easy to do.

Submodules definitely have a place in some contexts where you want to reduce visibility / noise, and where what is being committed is not a dependency for code in the core Monorepo, but in the context of a Monorepo approach I feel they should be left well alone.

Merging multiple single repositories into a Monorepo.

Git natively supports the ability to merge multiple repositories, but it can be difficult and slow. Some approaches include using grafts and filter-branch, but these can break and can be problematic when the previous merge history of a given repository is highly complex and not very linear.

Instead of this, the git-filter-repo package provides a much quicker and trustworthy approach to merge multiple existing histories. It can easily be installed via pip3 on mac via the following:

$ pip3 install git-filter-repo
Collecting git-filter-repo
  Downloading https://files.pythonhosted.org/packages/c7/a3/f5a470387c6b9c6d560b74d6cee21d56d595c970a439e5e702b595e520a8/git_filter_repo-2.28.0-py2.py3-none-any.whl (97kB)
    100% |████████████████████████████████| 102kB 619kB/s
Installing collected packages: git-filter-repo
Successfully installed git-filter-repo-2.28.0

As an example, I’m going to assume that my company owns two products, Vue.js and React. Let’s say both of these repositories have a lot of code that I don’t want to import into the Monorepo as it is tooling and support files which are common to both projects and I’ll recreate in the Monorepo after I’ve imported both products, so as an example, let’s say I really only want what is in the scripts directory for each repo (using the same directory in each is a good example of getting around conflicting file names too).

To start off, I create a new repository which I’ll call simply monorepo and add two commits.

$ mkdir monorepo

$ cd monorepo

$ git init
Initialized empty Git repository in /path/to/monorepo/.git/

$ echo "# Monorepos are great" > README.md

$ git add README.md

$ git commit -m "Add initial commit"
[master (root-commit) e89ea26] Add initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 README.md

$ echo "# Monorepos are practical in many contexts, not all" > README.md

$ git add README.md

$ git commit -m "Change README title to less embellished version"
[master d7d2d73] Change README title to less embellished version
 1 file changed, 1 insertion(+), 1 deletion(-)

$ git log --pretty=oneline
d7d2d737ff4c2aa7d543c37e3b1d92c38dcab55b (HEAD -> master) Change README title to less embellished version
e89ea269cded4eaa521cc922a5cec8aa9355cffd Add initial commit

$ cd ..

Next, I create a local copy of each of the repositories that I’m going to consolidate into the new repository. A clone of the repository should be done for this purpose as the filter-repo package mutates the repository it is working on, so you will somewhat burn your local copy in the process. While you’re likely not going to use it again and start using the Monorepo, it’s probably best making a separate clone so you can easily start over if need be.

$ git clone [email protected]:vuejs/vue.git
Cloning into 'vue'...
remote: Enumerating objects: 18, done.
remote: Counting objects: 100% (18/18), done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 56508 (delta 3), reused 0 (delta 0), pack-reused 56490
Receiving objects: 100% (56508/56508), 26.95 MiB | 11.19 MiB/s, done.
Resolving deltas: 100% (39656/39656), done.

$ git clone [email protected]:facebook/react.git
Cloning into 'react'...
remote: Enumerating objects: 182979, done.
remote: Total 182979 (delta 0), reused 0 (delta 0), pack-reused 182979
Receiving objects: 100% (182979/182979), 154.14 MiB | 18.60 MiB/s, done.
Resolving deltas: 100% (127869/127869), done.

Now that all three repositories are siblings in the one directory, I can use the filter-repo package via the git command to filter the repository history so that it only includes the files in the scripts directory in each project. The following is for the vue project.

$ cd vue

$ git filter-repo --path scripts/ --path-rename 'scripts':'vue-scripts' --tag-rename '':'vue-'
Parsed 6289 commits
New history written in 0.85 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 20c5987b chore: remove unused build alias (#9525)
Enumerating objects: 129, done.
Counting objects: 100% (129/129), done.
Delta compression using up to 8 threads
Compressing objects: 100% (67/67), done.
Writing objects: 100% (129/129), done.
Total 129 (delta 54), reused 50 (delta 38)
Completely finished after 1.13 seconds.

$ ls -la
total 0
drwxr-xr-x   4 ianbelcher  staff  128 May  5 12:29 .
drwxr-xr-x   5 ianbelcher  staff  160 May  5 12:22 ..
drwxr-xr-x  14 ianbelcher  staff  448 May  5 12:29 .git
drwxr-xr-x  12 ianbelcher  staff  384 May  5 12:29 vue-scripts

$ cd ..

A few things to note here:

  • The filter runs in place as previously mentioned. The repository no longer includes any other

files apart from those that were in the scripts directory.

  • By using the --path-rename switch and providing 'scripts':'vue-scripts', the filter actively

renamed the scripts directory to vue-scripts. This means that we can avoid conflicts when adding the history to the monorepo.

  • By using the --tag-rename switch and providing '':'vue-', tags which are attached to commits

that were not removed will be prepended with vue-. Again, another measure to keep as much history without creating any conflicts.

A similar process can then be applied to the React project.

$ cd react

$ git filter-repo --path scripts/ --path-rename 'scripts':'react-scripts' --tag-rename '':'react-'
Parsed 16803 commits
New history written in 2.80 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at 88053ee1b Release script: allow preparing RC from npm
Enumerating objects: 9421, done.
Counting objects: 100% (9421/9421), done.
Delta compression using up to 8 threads
Compressing objects: 100% (2834/2834), done.
Writing objects: 100% (9421/9421), done.
Total 9421 (delta 5126), reused 9333 (delta 5063)
Completely finished after 4.28 seconds.

$ ls -al
total 0
drwxr-xr-x   4 ianbelcher  staff  128 May  5 12:32 .
drwxr-xr-x   5 ianbelcher  staff  160 May  5 12:22 ..
drwxr-xr-x  14 ianbelcher  staff  448 May  5 12:32 .git
drwxr-xr-x  21 ianbelcher  staff  672 May  5 12:32 react-scripts

$ cd ..

The next step is to consolidate these histories into the main repository. This can be done simply by merging each in as a remote, and using the --allow-unrelated-histories flag when performing the merge.

For the following example, the dev branch in vue is being used, while the master branch from react is being used.

$ cd monorepo

$ git remote add vue ../vue/

$ git remote add react ../react/

$ git fetch vue
<REDACTED OUTPUT>

$ git fetch react
<REDACTED OUTPUT>

$ git branch vue remotes/vue/dev
Branch 'vue' set up to track remote branch 'dev' from 'vue'.

$ git merge vue -m "Merge vue history" --allow-unrelated-histories
<REDACTED OUTPUT>

$ git branch react remotes/react/master
Branch 'react' set up to track remote branch 'master' from 'react'.

$ git merge react -m "Merge react history" --allow-unrelated-histories
<REDACTED OUTPUT>

$ git remote remove vue

$ git remote remove react

$ git branch -d vue

$ git branch -d react

At this point for the sake of the example, I make a quick change to the vue-scripts/release.sh file and commit it as well. (Simply change the hashbang to /bin/zsh for some silly reason)

$ vi vue-scripts/release.sh

$ git add vue-scripts/release.sh

$ git commit -m "Change hashbang to /bin/zsh"
[master 48093a8] Change hashbang to /bin/zsh
 1 file changed, 1 insertion(+), 1 deletion(-)

At this point, the history tree looks like the following.

merging git histories

Hurrah! At this point, any engineer is able to look at any file in the monorepo and see a complete history of the changes.

Moving forward, the monorepo can be used as the single source of truth for all code and the singular repositories archived.

It is also worth mentioning that this method will also still work without the repo-filter example. If a repository doesn’t have any conflicting filenames with what is already in the repository, creating the remote and merging using the --allow-unrelated-histories switch will still work. You may be inclined to do this and then move the files as needed via a large commit. This is also a viable option.

In summary, keeping histories from multiple single projects when consolidating into a Monorepo is not that difficult of a task and it’s worth the time and effort in consolidating if your company can gain the benefits.

Posted in: