leah blogs: Git

05jun2018 · GitHub: quo vadis?

As GitHub user #139 I feel compelled to say something about GitHub getting bought by Microsoft.

I still remember, back when GitHub was founded, I was both thrilled and frightened. Thrilled, because the three founders managed to bootstrap a startup the right way: profitable from day one, focussed on a single, successful product, that did what people wanted. I told myself that if I’d ever do a startup, I’d do it like them. (Yet they took venture capital in 2012 and 2015 and as a result grew from 50 to over 800 employees by now, making loss at least in 2016 until they changed their business plans.)

Why was I frightened? I saw GitHub was immedately getting very successful, and many people, especially in the Ruby community, moved their projects to it, creating both a monopoly and a single point of failure.

One of my most successful projects, Rack, was converted from Darcs to Git in May 2008. I put it on GitHub (which was only a few months old by then) about that time, but I also provided my own Git mirror on own infrastructure. However, development quickly shifted to GitHub only, mostly because pull requests and issues were very convenient. Over time, my skepticism vanished, using GitHub was a no-brainer, and while occasional outages remembered us of the central position GitHub is in, we didn’t do anything.

Let me emphasize a few ways GitHub vastly improved my own open source work: finally, it was easy to report issues for many projects, without having to register yet another Bugzilla account, and issues could easily link issues at other projects. GitHub made it simple to quickly look into the actual source of many projects in a straight-forward way, without having to figure out CVS checkouts or fetch tarballs. It was easy to see which people contribute to which projects, and I discovered some cool projects this way.

So, now they are getting bought by Microsoft. I’m sad that they are getting bought at all, because I think it’s very important that such a central piece of the open source community stays independent of major software vendors. As for getting bought by Microsoft, I cannot share the enthusiasm many have: it is still a huge company that makes its profits primarily from proprietary, closed-source software and vendor lock-in, and while their management certainly changed a lot in the last decade, who knows how long this will last. Worse buyers are easily imaginable, however.

It is therefore sensible to think of alternatives to GitHub. Contrary to many, I don’t think switching to alternative offerings such as GitLab.com, BitBucket or SourceForge significantly improves the situation: while GitHub’s monopoly could get whittled down, we are still dependent on another for-profit company that is likely to be acquired by a major corporation sooner or later, too.

As for my own projects, I plan to be moving the most part of my recent projects (so called leahutils) from GitHub to a self-hosted solution. The details are not fixed yet, but I have enough experience with GitLab that I’m sure I’ll use something else. For these projects, I also have different needs compared to what GitHub offered: I’m often the sole committer, and I prefer receiving patches by mail and refining them myself rather than telling people how to improve their pull requests. So likely, I’ll set up a mix of cgit and public-inbox, and adopt a quite different workflow.

Other projects I’m involved in, most importantly VoidLinux, are far more dependent on outside contributions and having access to CI infrastructure, which already makes it hard to move away from GitHub and Travis. For now we’ve decided to stick to GitHub, as there are more pressing issues currently, and we don’t expect GitHub to go haywire anytime soon. Still, our autonomy as an open source project is something we need to bethink more often and take care of.

NP: Julia Weldon—Comatose Hope

03dec2014 · Recovering a Git repository from filesystem corruption

Recently I had to fix a Git repository where something unfortunate happened: probably due to accessing a NTFS partition that was still mounted in a hibernated alternative operating system, several files became corrupted (and actually had their contents exchanged with different files on the disk!).

git fsck discovered corrupted blobs, which we tried to recover first, when we detected their content does not make sense at all. These blobs were irrevocably lost, but we still wanted to get out the rest of the Git history alive.

Usually, applying the following technique is not necessary, because you can either just clone again from your Git upstream or recover the repository from the last backup—but both did not exist in this case.

The corrupted blobs actually all belonged to a single commit that happened a few months ago. The solution was thus to remove this commit from the history, keeping all other trees intact (of course, commit ids would change, but the content won’t).

I first tried to do this with git rebase, but it is of course the wrong tool, since it will try to remove the change of the defect in all following history.

Finally, I had a use-case for git filter-branch. To make it short, we can filter out the defect commit using:

git filter-branch --commit-filter \
  '[ $GIT_COMMIT = badbadbadbad ] && skip_commit "$@" ||
                                     git commit-tree "$@"'

This will rewrite all commits after badbadbadbad, but not touch their actual content.

git fsck still was not happy, we thus made a clean copy using

git clone --no-local --no-hardlinks mybrokengit myfixedgit

Now git fsck reported no errors and all other revisions were still ok. (Also, the blobs have been packed, so the next data corruption will be more fatal… ;))

I cannot think of any other version control system where a repair like this would have been possible. Thanks, Git!

NP: Against Me!—Exhaustion & Disgust

06jan2013 · A grab bag of Git tricks

Since its release I’ve been a fan of Git. (I still can remember downloading the initial version.) The thing I like most is that it can be extended and customized in an unixy way. Over time, I have collected some scripts and tricks that I would like to present to a wider audience. Git information online abounds (I especially recommend Mark J. Dominus in-depth posts on Git), thus I will only show stuff I haven’t seen elsewhere.

git news

Let’s start with a simple alias which you can simply add to your .gitconfig:

[alias]
        news = log -p HEAD@{1}..HEAD@{0}

I am tracking quite a lot of open source projects by cloning them into ~/src and running git pull on them occasionally. Next, I run git news and see only the commits (with diff) that have arrived since the last pull.

Of course it is a very simplistic alias and it probably won’t do what you want if you actually change the HEAD yourself—e.g. by committing. (A more robust version could, for example, parse the output of git reflog and search for the last pull.) On the other hand, as it is, it also can be useful for showing what came in with a merge. I also use it for repositories where I git cvsimport into, with the same benefits.

git comma

Admittedly, I’m a fan of dirty working trees, which is why—when I don’t use magit or finely-grained git add -p/git commit -p already—I commit whole files at once like git commit foo.c bar.c.

One thing that has always annoyed me is that I cannot git commit files unknown to Git, enforcing an explicit git add step only for these new files! One day I took the plunge and wrote git-comma (a portmanteau of commit and add) which gives its best to behave exactly like git commit except for adding the yet-unknown files beforehand. This was a bit more tricky than I expected because I wanted it to work correctly even in the face of partially staged files, thus a stupid git add on all arguments would not work (also, you only want to add explicitly named files, not whole directories and so on). Finally, git comma tries to clean up properly if you decide to abort the commit, unstaging the files again.

(IMO, this should be a flag or configuration option for git commit.)

git attic

A newer script, but a very useful one, is git attic, whose namesake perhaps gives you a shiver down the spine, being reminded of this CVS quirk.

Yet, CVS’ manner with deleted files—moving them into a folder called Attic—had one benefit which cannot be denied: it was easy to see what had been removed and to access the contents again.

Of course, Git has no problem with file removal, but having a look at the old contents can be laborious.

Thus I wrote git-attic, which presents you a nice list of files together with their deletion date:

% git attic
2012-08-14 441e782^:Etc/ChangeLog-5.0
2012-05-31 0793393^:Completion/Unix/Command/_systemctl
2012-01-31 6a364de^:Test/Y04compgen.ztst
2012-01-31 6a364de^:Test/compgentest
2011-08-18 f0eaa57^:Completion/Zsh/Command/_schedtool
...

The output is designed to be copy’n’pasted: Pass the second field to git show to display the file contents, or just select the hash without ^ to see the commit where removal happened.

(By default, I don’t detect renames, since I want to see which paths don’t exist anymore. If you are looking for “lost” content, feel free to pass -M to the script to detect renames and only show truly deleted files.)

A minimalist, yet powerful zsh prompt

As an avid zsh user for years, I have been using a simple but powerful shell prompt which looks like hecate src/zsh% for years (since 2010-02-11 actually, thanks to homegit, see below.) and ridiculed experiments to make the zsh prompt a kitchen sink. However, my Git usage grew and I started occasionally mixing up branches.

Thus I decided to grin and bear it and wondered how to make a minimalist nevertheless useful Git-enhanced prompt. One feature of my prompt was that it only shows the last few segments of the current working directory (usually 2, which is enough for me unless I need to work in some javaesque file labyrinth). One day I decided to integrate the current Git branch into these path segments. Now, my prompt looks like this:

hecate src/zsh@master% cd Doc

… and it actually sticks to the repository root:

hecate zsh@master/Doc% cd Zsh

When the level gets too deep, the branch and repository moves to the front:

hecate zsh@master Doc/Zsh%

The depth is still configurable:

hecate zsh@master Doc/Zsh% NDIRS=4
hecate src/zsh@master/Doc/Zsh%

I’ve quite come to like this presentation. Additionally, it also works with detached heads (useful when rebasing):

hecate src/zsh@master/Doc/Zsh% git checkout HEAD~42
...
hecate src/zsh@master~42/Doc/Zsh%

For free, you get some feedback when bisecting:

hecate ~/src/zsh@master% git bisect bad
hecate ~/src/zsh@bisect/bad% git bisect good HEAD~42
hecate ~/src/zsh@bisect/bad~21% git bisect good
hecate ~/src/zsh@bisect/bad~5% git bisect reset
hecate ~/src/zsh@master%

This is the code in all its glory:

# gitpwd - print %~, limited to $NDIR segments, with inline git branch
NDIRS=2
gitpwd() {
  local -a segs splitprefix; local prefix gitbranch
  segs=("${(Oas:/:)${(D)PWD}}")

  if gitprefix=$(git rev-parse --show-prefix 2>/dev/null); then
    splitprefix=("${(s:/:)gitprefix}")
    branch=$(git name-rev --name-only HEAD 2>/dev/null)
    if (( $#splitprefix > NDIRS )); then
      print -n "${segs[$#splitprefix]}@$branch "
    else
      segs[$#splitprefix]+=@$branch
    fi
  fi

  print "${(j:/:)${(@Oa)segs[1,NDIRS]}}"
}

Perhaps it turned out to be a bit more challenging than expected. ;) Integration into the prompt is trivial, however:

function cnprompt6 {
  case "$TERM" in
    xterm*|rxvt*)
      precmd() {  print -Pn "\e]0;%m: %~\a" }
      preexec() { printf "\e]0;$HOST: %s\a" $1 };;
  esac
  setopt PROMPT_SUBST
  PS1='%B%m%(?.. %??)%(1j. %j&.)%b $(gitpwd)%B%(!.%F{red}.%F{yellow})%#${SSH_CLIENT:+%#} %b'
  RPROMPT=''
}

cnprompt6

homegit

For the last five years I have used Git to manage my dotfiles and I use the repository on a plethora of machines.

I found the following zsh alias to be the simplest and best method to use Git for this purpose:

alias homegit="GIT_DIR=~/prj/dotfiles/.git GIT_WORK_TREE=~ git"

Why not a function? Because an alias will make zsh autocomplete homegit just like it completes git already, without any additional work.

Why not a ~/.git? I decided against it because I didn’t want to accidentally commit stuff from any subdirectory and feared a git clean could wipe my sweet home directory.

The homegit approach works very well for me and I have not felt a need for more complex solutions which symlink dotfiles or copy them around.

Note that the git-* scripts presented here can be called transparently from homegit as well, e.g. with homegit attic. And since $GIT_DIR is set in the environment, the scripts can just call git and will just work correctly!

411 commits as of now tell me I perhaps should scale back customizing stuff all the time, but it can be very helpful indeed to see how things changed over time. Also, tracking changes other programs make to your files (and being able to revert them) is totally worth it.

git trail

One of the newest additions to my Git zoo is git trail, a tool I wanted for years, really. With many branches, it’s easy to get confused about what branched off where and what actually is part of this topic branch and whether this topic branch has been merged but then forgotten or…

Perhaps you feel my pain. Perhaps you tried git show-branch once to get an overview of such a mess, but I feel it’s easier to see stereographic projections of a T-Rex in its output than the state of your branches.

Thus I wrote git-trail, which shows how to reach commits in the current branch from other branches. Since we don’t have enough local branches to make it interesting, lets show remote branches too (-r):

hecate tmp/rack@master% git trail -r
2013-01-04 7e1f081 master
2013-01-04 7e1f081 remotes/origin/HEAD
2013-01-04 1e75faa remotes/origin/hijack~2
2013-01-04 1e75faa remotes/origin/master~1
2012-11-03 1824547 remotes/origin/unstandard_uri_escape~1
2012-03-18 7d7977f remotes/origin/rack-1.4~77
2011-05-22 a50dda5 remotes/origin/rack-1.3~99
2010-06-15 dc6b54e remotes/origin/rack-1.2~38
2010-01-03 e6ebd83 remotes/origin/rack-1.1~23
2009-04-25 d221938 remotes/origin/rack-1.0~24
2009-01-05 7fed4c7 remotes/origin/rack-0.9~15
2008-08-09 e9f9f27 remotes/origin/rack-0.4~6

What you see is the first common commit between every branch and the current branch, together with the commit date. If the branch is listed without suffixes, it is completely included. Else, you effectively see how the branch diverges. For example, in rack-1.4, there have been 77 patches since branching from master. The feature branch hijack consists of two commits. Lets look at the view from that feature branch:

hecate tmp/rack@master% git trail -r origin/hijack
2013-01-04 8a311fb remotes/origin/hijack
2013-01-04 1e75faa master~1
2013-01-04 1e75faa remotes/origin/HEAD~1
2012-11-03 1824547 remotes/origin/unstandard_uri_escape~1
2012-03-18 7d7977f remotes/origin/rack-1.4~77
2011-05-22 a50dda5 remotes/origin/rack-1.3~99
2010-06-15 dc6b54e remotes/origin/rack-1.2~38
2010-01-03 e6ebd83 remotes/origin/rack-1.1~23
2009-04-25 d221938 remotes/origin/rack-1.0~24
2009-01-05 7fed4c7 remotes/origin/rack-0.9~15
2008-08-09 e9f9f27 remotes/origin/rack-0.4~6

We see that there have been commits to master since hijack was branched, and we should perhaps rebase hijack if we wanted to submit it.

Let’s say we simply merged it into master:

hecate tmp/rack@master% git merge origin/hijack
...
hecate tmp/rack@master% git trail -r
2013-01-06 68de794 master
2013-01-04 8a311fb remotes/origin/hijack
2013-01-04 7e1f081 remotes/origin/HEAD
2012-11-03 1824547 remotes/origin/unstandard_uri_escape~1
2012-03-18 7d7977f remotes/origin/rack-1.4~77
...

Now hijack appears undecorated: it is completely contained in the current branch history.

Let’s say we work on the other feature branch next, unstandard_uri_escape:

hecate tmp/rack@master% git checkout unstandard_uri_escape
hecate tmp/rack@unstandard_uri_escape% git trail
2012-11-03 decaa23 unstandard_uri_escape
2012-11-03 1824547 master~10^2~1

We can now rebase it to make it a proper child of master:

hecate tmp/rack@unstandard_uri_escape% git rebase master
hecate tmp/rack@unstandard_uri_escape% git trail
2013-01-06 92b40fa unstandard_uri_escape
2013-01-06 c30da33 master

And then master can be fast-forwarded:

hecate tmp/rack@unstandard_uri_escape% git checkout master
hecate tmp/rack@master% git trail
2013-01-06 c30da33 master
2013-01-06 c30da33 unstandard_uri_escape~1
hecate tmp/rack@master% git merge unstandard_uri_escape 
Updating c30da33..92b40fa
Fast-forward
...
hecate tmp/rack@master% git trail
2013-01-06 92b40fa master
2013-01-06 92b40fa unstandard_uri_escape

I hope this exposed how git trail helps me to keep track of dealing with branches.

git neck

The perfect match for git-trail is git-neck, which show commits from the HEAD until the first branching point… that should explain the name.

So, what is the “neck” of our master branch as above?

hecate tmp/rack@master% git neck -r
92b40fa Add a decoder that supports ECMA unicode uris
c30da33 Merge remote-tracking branch 'origin/hijack'
7e1f081 Merge pull request #480 from udzura/master
3edd1e8 Add a rackup option for one-liner rack app server
6d41179 Extract Builder.new_from_string from Builder.parse_file

Likewise, let’s have a look at that remote feature branch sticking around:

% git neck -r origin/unstandard_uri_escape
decaa23 Add a decoder that supports ECMA unicode uris

It was just a single commit. We can also look at the neck of an old release branch:

hecate tmp/rack@master% git neck -r origin/rack-0.4
92f79ea Make Rack::Lint::InputWrapper delegate size method to underlying IO object.
e33cc65 Update to version 0.4
ab9a95e Fix packaging script
1ccdf73 Update README
1b56583 Document REQUEST_METHOD future changes
f0977a8 Disarm and document Content-Length checking in Rack::Lint for 0.4

And we see the 6 commits that are only in rack-0.4.

If you remember the situation before merging the feature branches:

hecate tmp/r2@master% git trail -r
2013-01-04 7e1f081 master
2013-01-04 7e1f081 remotes/origin/HEAD
2013-01-04 1e75faa remotes/origin/hijack~2
2013-01-04 1e75faa remotes/origin/master~1
2012-11-03 1824547 remotes/origin/unstandard_uri_escape~1

Here, the neck is the part until master forked off:

hecate tmp/r2@master% git neck -r
7e1f081 Merge pull request #480 from udzura/master
3edd1e8 Add a rackup option for one-liner rack app server
6d41179 Extract Builder.new_from_string from Builder.parse_file

git neck is most useful if you are working in a feature branch which no other branch forks off, because then the neck goes until where you forked it.

Using git diff without Git

At last, another small trick: git diff works between any two files (or directories), even if you don’t use Git at all to track them. But you gain some advantages over regular diff, like --word-diff, --color or --stat without having additional tools beyond Git installed.

Also, you can use git diff --binary to generate efficient binary deltas which you can apply again provided you have the unpatched file. (Possibly you need to edit the patch to make both filenames the same, so git apply finds everything.)

NP: Sophie Hunger—What it is

04feb2008 · Introducing gitsum

The major showstopper before I was seriously considering going to Git was the lack of an darcsum-like interface for Git.

Yesterday night I finally decided to write it.

git-status (included as git.el in the Git distribution) is usually good enough to use, but I often like to do partial commits, that is, commit only parts of a file. Git can do that now for some time, using git add --interactive or frontends like git-hunk-commit or git-wt-add. Still, there was no way to do it conveniently in Emacs.

Let me introduce gitsum:

Gitsum screenshot

You can freely delete hunks you don’t want to commit, split big changes, or even edit the patch directly if you feel adventurous. It also integrates into git-status so you can easily switch between these frontends.

Gitsum is hosted at http://github.com/chneukirchen/gitsum (which I highly recommend) and is mirrored at http://git.vuxu.org/, patches and additions are welcome! It’s still very fresh and has some rough corners, but I already notice my increase in productivity.

NP: Twelve Tone Failure—As I Hit the Floor

11apr2006 · My DVCS wishlist

After last week’s intermezzo with Git, my curiosity for distributed version control systems (DVCS) reinflamed again. I also imported the Ruby CVS history into Monotone, which has a pretty fast CVS importer, and Mercurial, which CVS importer seamt to be even faster (cvs20hg), but unfortunately is not complete yet. However, Mercurial also can import from Git, so I went that way.

My projects will continue to be kept in Darcs for near future, but so far no DVCS really could convince me. Wondering about which lacked what, I thought it would be useful to write up what I want to have. So far, I tried: Darcs, Git/Cogito, Mercurial and Monotone. I also dabbled into Bazaar (seems to be discontinued), Bazaar-NG, FastCST (seems to be discontinued) and SVK (IMO just a hack).

So, here is my wishlist (roughly ordered in decreasing importance):

Prefer file storage over patch storage, it’s just easier to deal with in practice. It took be a long time to figure this out, but I actually think it’s the more pragmatic solution. I noticed this when I saw how the Git repository just merged with the Gitk repository, even if both didn’t share a single revision. Darcs, on the other hand, even had problems doing merges which were factually the same, but just couldn’t be arranged the right way. The theory of patches sounds nice, but it doesn’t work out.

Note that this doesn’t exclude diff storage, this of course should be done to save disk space and bandwidth.

Provided by: Bazaar-NG (I think), Git/Cogito, Mercurial, Monotone.
Revisions need to be identified by a globally unique identifier, e.g. a SHA1-hash or a GUID.

Provided by: Bazaar (theoretically), Bazaar-NG, Darcs, Git/Cogito, Mercurial, Monotone.
Revision storage should be implemented as write-once files. Once a file has been written, it should not be touched afterwards. This eases incremental backup and generally improves safety. Alternatively, if files are append-only, this is acceptable too. Changing files leaves a bad taste. (It’s okay for index files and other unessential information.)

Provided by: Bazaar, Darcs, Git/Cogito, Mercurial.
File permissions must be saved, at least the executable bit. Also, the VCS shouldn’t touch the contents of the files at all (no newline conversion, no keywords by default).

Provided by: Bazaar, Bazaar-NG, Git/Cogito, Mercurial, Monotone.
Easy setup of repositories: Setting up a new repository needs to be possible with a single command, usually that’s xxx init—it will turn the current directory into a fresh repository (or even import the files of the current directory, as Cogito does).

Provided by: Bazaar-NG, Darcs, Git/Cogito, Mercurial.
Support multiple heads of development in a single repository. This encourages microbranching and eases incremental development without keeping loads of working directories around.

Provided by: Git/Cogito, Mercurial [Added 22apr2006, thanks to Daniel Néri for noticing], Monotone.
It has to be possible to export patches with full metadata (e.g. renames) as ASCII files, e.g. to send via mail or share in other ways. It needs to support binary files, too. (Think of contributing graphics to a game.)

Provided by: Bazaar-NG, Darcs (very good), Git/Cogito (no binary, renames partly), Mercurial (bundles, but they are not ASCII, renames partly), Monotone (packets, good).
It needs to be possible to contribute patches via mail. This is the way most non-regular commiters send patches.

Provided by: Bazaar-NG, Darcs, Git/Cogito, Mercurial, Monotone.
Serving repositories over dumb HTTP: This is essential to allow people easily setting up repositories on their cheap webspace. Systems that require CGIs would be acceptable too, here (Mercurial without old-http); opening new ports isn’t. It doesn’t need to be the most efficient way of accessing, but must not be unreasonably inefficient.

Provided by: Bazaar-NG, Bazaar (slow), Darcs, Git/Cogito, Mercurial, Monotone (soon).
I definitely need good Emacs integration, preferably with DVC, alternatively, a good standalone-mode can be enough too.

Provided by: Bazaar (DVC), Bazaar-NG (DVC), Darcs (own, partly DVC), Git/Cogito (own, DVC), Mercurial (own, DVC), Monotone (own).
It needs to provide a GUI repository viewer that can show change history as a tree and diffs for each revision. I’ve found such a tool indispensably since I’ve discovered Gitk, especially if you microbranch a lot.

Provided by: Bazaar-NG, Git/Cogito, Mercurial (hack), Monotone.
It needs a good and fast tool to import CVS trees. I’ve found this absolutely needed to convert legacy repositiories and capture the history of older projects locally.

Provided by: Git/Cogito (git-cvsimport, parsecvs), Mercurial (cvs20hg, partly), Monotone (own, very good).
A library to access all features of the VCS from other tools. If a very comprehensive set of commands is available, this will be acceptable too.

Provided by: Bazaar (shell), Bazaar-NG (Python), Darcs (shell, XML), Git/Cogito (shell, very good), Mercurial (Python), Monotone (Lua, shell).

If you find any mistakes or misattribution, please post a comment and I’ll correct it.

Writing a good DVCS is not that hard in theory, but very hard in practice—not only for technical reasons. Implementing DVCS is a community effort, I’d even state it’s pointless today to start yet another VCS, unless you are a celebrity that already has a big community behind (cf. Git).

NP: The Smiths—You Just Haven’t Earned It Yet Baby

04apr2006 · Tracking the Ruby CVS with Git

$ du -h ~/projects/Git/ruby.git/
 29M    /Users/chris/projects/Git/ruby.git/

Amazing, but true: above directory contains the whole history of the Ruby CVS—from January 1998 until today, in less than 30 megabytes. That’s 9325 commits and about 44332 different file versions.

How is this possible? I used Git, the version control system that was written to keep the Linux source, which is “designed to handle absolutely massive projects with speed and efficiency”. And most of the parts are actually pretty efficient and fast.

Not among them is importing from CVS. Not yet, at least. Git includes a Perl script, git-cvsimport which essentially works like that: Checkout each revision from CVS, commit to Git, checkout the next revision, commit again, water, rinse and repeat.

Hopelessy slow, especially if the CVS is remote. So let’s fix that, we make a local CVS mirror first. Luckily, the Ruby CVS supports cvsup, which is essentally like a fast rsync for CVS repositories, but also can be used to mirror complete CVSROOTs. Unluckily, this is not documented at the Ruby CVS page. However, with help from Shugo Maeda, I was able to locally mirror the Ruby CVS. You need a cvsup file like this:

*default base=/Users/chris/mess/current/cstest/sups
*default compress delete use-rel-suffix
*default release=cvs
*default host=cvs.ruby-lang.org

*default prefix=/Users/chris/mess/current/cstest/ruby

# Ruby and other modules
cvs-src

Adjust the paths to your local needs, of course. Then, you need to fetch cvsup. If you are lucky, your distribution will have it packaged, else you need to bootstrap a Modula 3 compiler(!) to compile it. Have fun. *sigh* (The compiler is pretty quick, though.) Anyway, at the end of the day, I had my local CVS mirror—let the experiments start.

git-cvsimport depends on cvsps, a tool to analyze CVSROOTs and figure the actual revisions. This is needed because CVS is a bunch of clunky shit that has no conscience of its commits. After that, an almost endless loop of checkout and commit will start. If you want to try it yourself, get a fast computer, a fast, big disk and an efficient file-system. No, doing it on an iBook with only a few gigs free and HFS+ is not a good idea. Actually, it took four days, and I had to do it stepwise.

There could be a better solution in the future, parsecvs by Keith Packard of X.org fame. It’s in very early alpha stage, and will need even more disk space as of now, but ought to be a lot faster in the future. At least one can hope.

After this, you’ll have a Git controlled tree full of the actual file revisions, it’s hard to estimate how big it would be. To make the handy file shown above, you need to pack the tree. For this, you run:

git repack -d

This will compute a few minutes/hours/days and spit out a nice file, of about 70 megabytes in size. If you want the handy file above, you either need to figure out how to patch git repack to pass the optimization options --window=50 --depth=50 to git-pack-objects, or call the latter low-level tool directly. This way, you’ll get the handy file. Higher argument values will slow down the process a lot, and not result in packages that are maybe half a megabyte smaller. I tried.

The great thing about git-cvsimport is that it can work incrementally, so once we have the pack, we can update directly from Ruby CVS—the changes are small if you do that regularily. For this, I included a small script in the pack, update-ruby-git:

git-cvsimport -d :pserver:anonymous@cvs.ruby-lang.org:/src \
              -k -u -v -m -p -Z,9 ruby

Run this script regularily to keep your tree recent. You don’t need the CVSROOT or cvsup anymore.

Now, how is this all of this useful? Obviously, you enjoy all the benefits Git provides for your daily hacking: atomic actions, distributed development, zero cost (almost!) branches and good merges. Also, you have the nice gitk repository browser that allows you to keep track of recent development. Since you can fetch every file at every revision easily, it’s just a matter of time someone starts datamining… “how many percents of Ruby are really written by matz”?

You can use git bisect to find bugs in Ruby by marking some revision as good, some as bad, and let Git figure which revision you try next to find the faulty patch.

And if you really want to use CVS, you even can emulate a CVS server (read and write!), with git-cvsserver. Isn’t that impressive?

I probably will make the pack available on the net, but I haven’t yet found a good way to allow others to efficiently (and incrementally) fetch it… hopefully more about that later.

NP: Meat Puppets—Up on the Sun