leah blogs: Tracking the Ruby CVS with Git

$ du -h ~/projects/Git/ruby.git/
 29M    /Users/chris/projects/Git/ruby.git/

Amazing, but true: above directory contains the whole history of the Ruby CVS—from January 1998 until today, in less than 30 megabytes. That’s 9325 commits and about 44332 different file versions.

How is this possible? I used Git, the version control system that was written to keep the Linux source, which is “designed to handle absolutely massive projects with speed and efficiency”. And most of the parts are actually pretty efficient and fast.

Not among them is importing from CVS. Not yet, at least. Git includes a Perl script, git-cvsimport which essentially works like that: Checkout each revision from CVS, commit to Git, checkout the next revision, commit again, water, rinse and repeat.

Hopelessy slow, especially if the CVS is remote. So let’s fix that, we make a local CVS mirror first. Luckily, the Ruby CVS supports cvsup, which is essentally like a fast rsync for CVS repositories, but also can be used to mirror complete CVSROOTs. Unluckily, this is not documented at the Ruby CVS page. However, with help from Shugo Maeda, I was able to locally mirror the Ruby CVS. You need a cvsup file like this:

*default base=/Users/chris/mess/current/cstest/sups
*default compress delete use-rel-suffix
*default release=cvs
*default host=cvs.ruby-lang.org

*default prefix=/Users/chris/mess/current/cstest/ruby

# Ruby and other modules
cvs-src

Adjust the paths to your local needs, of course. Then, you need to fetch cvsup. If you are lucky, your distribution will have it packaged, else you need to bootstrap a Modula 3 compiler(!) to compile it. Have fun. *sigh* (The compiler is pretty quick, though.) Anyway, at the end of the day, I had my local CVS mirror—let the experiments start.

git-cvsimport depends on cvsps, a tool to analyze CVSROOTs and figure the actual revisions. This is needed because CVS is a bunch of clunky shit that has no conscience of its commits. After that, an almost endless loop of checkout and commit will start. If you want to try it yourself, get a fast computer, a fast, big disk and an efficient file-system. No, doing it on an iBook with only a few gigs free and HFS+ is not a good idea. Actually, it took four days, and I had to do it stepwise.

There could be a better solution in the future, parsecvs by Keith Packard of X.org fame. It’s in very early alpha stage, and will need even more disk space as of now, but ought to be a lot faster in the future. At least one can hope.

After this, you’ll have a Git controlled tree full of the actual file revisions, it’s hard to estimate how big it would be. To make the handy file shown above, you need to pack the tree. For this, you run:

git repack -d

This will compute a few minutes/hours/days and spit out a nice file, of about 70 megabytes in size. If you want the handy file above, you either need to figure out how to patch git repack to pass the optimization options --window=50 --depth=50 to git-pack-objects, or call the latter low-level tool directly. This way, you’ll get the handy file. Higher argument values will slow down the process a lot, and not result in packages that are maybe half a megabyte smaller. I tried.

The great thing about git-cvsimport is that it can work incrementally, so once we have the pack, we can update directly from Ruby CVS—the changes are small if you do that regularily. For this, I included a small script in the pack, update-ruby-git:

git-cvsimport -d :pserver:anonymous@cvs.ruby-lang.org:/src \
              -k -u -v -m -p -Z,9 ruby

Run this script regularily to keep your tree recent. You don’t need the CVSROOT or cvsup anymore.

Now, how is this all of this useful? Obviously, you enjoy all the benefits Git provides for your daily hacking: atomic actions, distributed development, zero cost (almost!) branches and good merges. Also, you have the nice gitk repository browser that allows you to keep track of recent development. Since you can fetch every file at every revision easily, it’s just a matter of time someone starts datamining… “how many percents of Ruby are really written by matz”?

You can use git bisect to find bugs in Ruby by marking some revision as good, some as bad, and let Git figure which revision you try next to find the faulty patch.

And if you really want to use CVS, you even can emulate a CVS server (read and write!), with git-cvsserver. Isn’t that impressive?

I probably will make the pack available on the net, but I haven’t yet found a good way to allow others to efficiently (and incrementally) fetch it… hopefully more about that later.

NP: Meat Puppets—Up on the Sun

leah blogs

04apr2006 · Tracking the Ruby CVS with Git