Eugene Kirpichov (antilamer) wrote,

In which I REALLY convert a git subdirectory to submodule

...And make it occupy no more disk space than it's supposed to.

So, the other day (actually today) I was converting a repo with a couple dozen subdirectories into a couple dozen submodules.

This is a good start, but insufficient.

Let's follow that guide and see what's wrong with it:

$ ls openstack-copy/
README     concat     git        horizon    lvm        mysql      openstack  selinux    swift      xinetd
apt        examples   glance     keepalived memcached  network    rabbitmq   ssh        sysctl
common     galera     haproxy    keystone   mmm        nova       rsync      stdlib     vcsrepo

$ git clone --no-hardlinks openstack-copy rabbitmq
Cloning into 'rabbitmq'...
done.

$ cd rabbitmq

$ git filter-branch --prune-empty --subdirectory-filter rabbitmq HEAD -- --all
Rewrite 1bc3a45889f9670c05e8db17d520895c0ec347be (4/4)
Ref 'refs/heads/master' was rewritten

$ git reset --hard
HEAD is now at ad041bd mysql,nova,horizon centos integration

$ git gc --aggressive 
Counting objects: 1700, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1495/1495), done.
Writing objects: 100% (1700/1700), done.
Total 1700 (delta 549), reused 0 (delta 0)

$ git prune

$ git remote rm origin


So far so good. But let's see how much space this thing takes up:
$ du -s .
37048   .

$ du -s .git
36224   .git


WTF has just happened? We see that the .git folder occupies the vast majority of space. Let's see what exactly occupies so much:

$ git verify-pack -v .git/objects/pack/pack-*.idx | sort -k3n | tail -5 
4a28b72f832b0b738a5ea30cad7f8897e05845d0 blob   55937 54729 458412
3ecaa00c9be1406156985367bc53c1ac06319e64 blob   99476 22611 228942
6ef6d4ff148f4bbba71800a8941c26f25bd393d6 blob   151912 149718 308694
fb7bfac9de30d1bcc5610861679c1ee12023ab1b blob   1444026 1436024 683829
f2298756510da416fd53370f637f2ab0b62bb76f blob   16490552 16281402 2119853


Apparently we have a horrendous blob that occupies 16 megabytes. What is it, exactly?

$ git rev-list --all | xargs -n1 -IX sh -c "git ls-tree -r X \
  | grep f2298756510da416fd53370f637f2ab0b62bb76f && echo X"
100644 blob f2298756510da416fd53370f637f2ab0b62bb76f    galera/files/mysql-server-wsrep-5.5.23-23.6-amd64.deb
8450a3cddc3f02753b25abbe92b5ea529b50e3a3
... (same repeated multiple times for different commits)
100644 blob f2298756510da416fd53370f637f2ab0b62bb76f    galera/files/mysql-server-wsrep-5.5.23-23.6-amd64.deb
19a93f30f5949acffde926c952a5e85d4ae5c2e3


Whew, it's a debian package of Galera! But WTF is it doing in our filtered history for the RabbitMQ module?

The previous rev-list command actually showed the commits in reverse chronological order, so the last commit, 19a93f, is the one which introduced this file. Let's see what kind of commit this is, and why it wasn't wiped from our history by git gc.

$ git show --stat 19a93f
commit 19a93f30f5949acffde926c952a5e85d4ae5c2e3
Author: Eugene Kirpichov 
Date:   Fri Sep 7 17:38:58 2012 -0700

    Initial commit

 apt/.fixtures.yml                                  |    5 +
 apt/.gemfile                                       |    5 +
... (whole repo actually, including this .deb file - this was an initial import of a non-VCSd folder)


So, this is the initial commit, in its unfiltered form. Why wasn't it wiped from the history by git gc? Here's why:

$ git describe --all --contains 19a93f
original/refs/heads/master~58


Very nice. This is the 58th parent of original/refs/heads/master (i.e. the master branch of the remote original created during git clone). So, this commit is being referenced from there and therefore counted as reachable by git gc.

What other refs have we got? Quite a few.

$ cat .git/info/refs
ad041bd07785cb52e448065fa9d0a7125a780e73        refs/heads/master
67cb7d9fa852bbb5bc4e62cf186d3fc4a29c0ee8        refs/original/refs/heads/master
67cb7d9fa852bbb5bc4e62cf186d3fc4a29c0ee8        refs/remotes/origin/HEAD
67cb7d9fa852bbb5bc4e62cf186d3fc4a29c0ee8        refs/remotes/origin/master

$ cat .git/packed-refs 
# pack-refs with: peeled 
ad041bd07785cb52e448065fa9d0a7125a780e73 refs/heads/master
67cb7d9fa852bbb5bc4e62cf186d3fc4a29c0ee8 refs/original/refs/heads/master


I've no idea what the "refs/remotes/origin" references are doing here, given that I've removed the remote origin. I guess git is just trying a bit too hard to prevent data loss.

Basically, I want to remove all refs except refs/heads/master. I've also no idea how to do this automatically, so I just removed them by hand. Read more about git reachability - it also includes e.g. all branches and tags.

$ vim .git/info/refs # remove all except refs/heads/master

$ vim .git/packed-refs # remove all except refs/heads/master


Honestly, there are also references from the reflog - I discovered that by guesswork, and I don't know how to show you a command that will expose this explicitly. Let's prune the reflog too.

$ git reflog --all
ad041bd refs/heads/master@{0}: filter-branch: rewrite
67cb7d9 refs/heads/master@{1}: clone: from /Users/jkff/projects/work/splitting-openstack/openstack-copy

$ git reflog expire --all --expire=now

$ git reflog --all

$ 


Now what?

$ git gc --aggressive --prune=now
Counting objects: 95, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (77/77), done.
Writing objects: 100% (95/95), done.
Total 95 (delta 14), reused 80 (delta 0)

$ du -s .
1432    .


Yay! The huge file became actually unreachable and was collected by git gc.

Now our "submodule" is compacted and we can proceed to extract it:
$ git remote add origin gitolite@gitolite.mirantis.com:openstack/deployment/puppet/rabbitmq.git

$ git push origin master
Counting objects: 95, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (63/63), done.
Writing objects: 100% (95/95), 222.02 KiB, done.
Total 95 (delta 14), reused 95 (delta 14)
To gitolite@gitolite.mirantis.com:openstack/deployment/puppet/rabbitmq.git
 * [new branch]      master -> master

$ cd ../openstack-fresh/ # An empty directory where I'll put all the submodules

$ git init
Initialized empty Git repository in /Users/jkff/projects/work/splitting-openstack/openstack-fresh/.git/

$ git submodule init

$ git submodule add gitolite@gitolite.mirantis.com:openstack/deployment/puppet/rabbitmq.git rabbitmq
Cloning into 'rabbitmq'...
remote: Counting objects: 95, done.
remote: Compressing objects: 100% (77/77), done.
remote: Total 95 (delta 14), reused 0 (delta 0)
Receiving objects: 100% (95/95), 222.03 KiB | 39 KiB/s, done.
Resolving deltas: 100% (14/14), done.

$ git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached ..." to unstage)
#
#       new file:   .gitmodules
#       new file:   rabbitmq
#

$ git commit -a -m 'Added RabbitMQ submodule'


P.S. I suspect that at some point in all this I could just have git clone'd my submodule and it would probably just pull the needed history, without all the verify-pack stuff. But I tried it right after the first git gc and it didn't help. I'll have to check it later.

P.P.S. Full MINIMAL sequence of commands:

$ git clone openstack-copy/ apt
Cloning into 'apt'...
done.

$ cd apt

$ git filter-branch --prune-empty --subdirectory-filter apt HEAD -- --all
Rewrite 19a93f30f5949acffde926c952a5e85d4ae5c2e3 (1/1)
Ref 'refs/heads/master' was rewritten

$ git reset --hard

$ git remote rm origin

$ sleep 1 # THIS IS NECESSARY in a script! 
$ # Apparently git reflog expire has a precision of 1 second
$ # and fails about 50% of the time if you don't sleep here, preserving
$ # the "clone from original" reflog entry.

$ git reflog expire --all --expire=now

$ git gc --aggressive --prune=now
Counting objects: 1699, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1494/1494), done.
Writing objects: 100% (1699/1699), done.
Total 1699 (delta 550), reused 0 (delta 0)

$ grep ' refs/heads/master' .git/info/refs > .git/info/refs.new
$ mv .git/info/refs.new .git/info/refs

$ grep ' refs/heads/master' .git/packed-refs > .git/packed-refs.new
$ mv .git/packed-refs.new .git/packed-refs

$ git gc --aggressive --prune=now
Counting objects: 63, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (56/56), done.
Writing objects: 100% (63/63), done.
Total 63 (delta 2), reused 59 (delta 0)


P.P.P.S. See http://www.kernel.org/pub/software/scm/git/docs/git-filter-branch.html - "Checklist for Shrinking a Repository" - however for me, this didn't really remove the refs.
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 16 comments