Skip to main content

Distributed Source Control Management systems (especially Git)

We recently did an internal knowledge meeting about Distributed Revision Control at work. Rather than post my rugged presentation on SlideShare, I figured I'd rather pull together my notes into a blog-post.

In our presentation, we were supposed to go head-on-head with Git vs Mercurial (as these were our DSCM finalists). Most of my notes are pro-Git, although I'm equally fine with adopting Mercurial.

So, what's the deal with distributed?
Basically, instead of having a single repository on a central server, everyone has their own repository.

These repository are the same, because internally they have the same identifier-keys (SHAs) and history.

Since all bits and pieces are globally understood in relation to eachother, you can bring together changes from two separate repositories pretty easily.

Why would anyone want to distribute their repos?
You want to let people work without ruining the main line.

Consider feature branches.

Remember that feature branch we did that time? It was about three weeks of development, and then *bam* merge hell!

With DSCM, after developing a feature for weeks, you continuously pull in latest main development to your feature branch and fix any conflicts as soon as they arise. Then later when you want to apply your new feature back to the main line: Schmock, done. No conflicts.

We do suffer with branching and merging. Merging costs us thousands of € every month (we maintain a development branch and a stable branch). We're basically manually maintaining many versions in parallel. This is expensive.

If you are distributed, branching/merging has to be easy, cause everyone kind of has a branch in their local copy. Every "update" is actually a merge from one repo to another.

When you've cloned a repository, you have the whole history inside (compressed, this is just a fraction of the size of the working directory).

Doing a diff takes milliseconds (as opposed to over a minute for our main product in Subversion).

Open source vs companies
Many of the distributed advantages appear in a large network, where you have hierarchies of trust and delegation. People who know each other merge upwards. This works really well, linux kernel merges 22.000 files a day (I think Linus said that in his talk, see resources below).

If you get merge conflicts on top, you simply reject the patches coming in, tell them which repository they have to comply with, they do the work, and you can pull in without any conflicts.

If you're in a tightly controlled corporate environment, centralized can work well. But they have problems with distributed groups, which gets tacky with network access.

In a company you have that feeling that everyone should be committing into the main repository. Distributed SCMs, you fake this by having a repository act as central.

A history of version control (for those who are interested)
  • CVS 
    • Dates back to 1986 (Holland, shell scripts, replacing RCS from 1982)
  • Subversion (started in 2000)
    • Was supposed to solve the problems CVS had
    • No renaming and non-atomic commits
    • But did it really? Still centralized, rename is copy+delete
    • No merge tracking, or history-aware-merging
    • Merging is PITA, renames and moves not supported
    • Dominated the open source/company world from ~2005
  • There was VSS (ms), ClearCase (ibm), but these sucked. Perforce (@Google).
At the same time, something was happening on the distributed side..
  • Gnu-Arch (the original)
    • Started in 2001, Tom Lord (thereby tla)
    • Semi-distributed (you could cherry-pick commits from other repos, but still one centralized)
    • But: Slow, hard-to-use, funky conventions
    • Declared dead/deprecated 2009
  • Darcs (Built in Haskell to defeat SVN and CVS)
    • Started as an addition of "theory-of-patches" to Gnu-Arch
    • Around 2002, David Roundy
    • My feeling: Popular and good, besides, Haskell is cool and fast
    • Unfortunately, some algebra bugs in there, recursive merging
    • They've had a few major bugs for a long while, but being resolved..
  • Bazaar (built by Canonical for open source hackers)
    • Predecessor, baz, based on Gnu Arch
    • Current python-powered started as prototype in 2004
    • Used by Ubuntu, MySQL, Emacs
    • It's easy and good, but a bit slow and inefficient
  • BitKeeper (proprietary, took the best ideas from DSCM, 1997)
    • Comes from Sun/TeamWare, which comes from SCCS (xerox original)
    • "Doubles software development productivity"
    • Sold to big businesses
    • Controversially: Was free for, and used by Linux since 2002
    • (Before this they used tarballs and patches)
    • In 2005, they wanted out of the Linux Kernel (fear of competitors developing)
    • Linus started Git
    • Matt Mackall announced Mercurial just a few days after
  • Others: SVK, D-CVS, Monotone (Linus dropped monotone cause the author was on holiday)
(Most of the above is hauled in from wikipedia)

Above: Timeline illustration of roughly when the revision systems were created. Distributed systems above the orange bar, centralized below.
About Git

Nicknamed the fast version control system. Traditionally had very poor windows support, but this has been resolved.

Git has a big presence in communities like:
Git Concepts:
  • Blob: Chunk of binary data, generally files
  • Tree: Directories
  • Commit: A tree at a certain point in time
  • Tag: Pointer to a commit
Git does not track files. It tracks content.

All objects have a SHA key, which is used to identify them across repositories. Powerful concept.

Does not work on deltas, like with SVN. Each time you commit, it stores a snapshot of what all the files in a tree look like. Changes are calculated by diffing the states in two commits.

A blob is defined by it's data. Two equal parts of data will share the same blob, referenced by two trees.

All these objects are stored inside the .git directory in a project. When you jump between branches, git shuffles the objects around.

When you commit, your commit the changes that are in the index, a kind of staging area. Not directly from working directory. git add adds resources to the index. git commit -a does both in one go.

branches are easy. Create branch with
git branch [branchname] ,and  git checkout [branchname] to switch to it.. git branch -d to delete.
(will enforce that you merge back before deleting, unless you use -D)

Stashing is a nice way to dump current work so you can temporarily switch to another branch to do some hot fixes..

git merge [branchname]. Pretty straight forward.

Conflicts get conlict-markers like those in subversion, but git won't let you commit until you have resolved all conflicts (you have to run git add on the resolved files).

If all goes to hell, run

git reset --hard HEAD

If you want the changes from another repository, you have to pull in the changes, git pull
You can also do this in two steps: git fetch (updates index with remote tracking branch stuff) and git merge.

When you want to pretend having a central repo where users use git push to push in their changes. If you have a patch you want to serve back to the authors of an open source project, you give them a pointer to your repo and ask them to pull. They then pull, try it out locally, and then push to their repo.

Git has some very powerful logging commands that you can use to search through the history of the repo, shaping the output to your need. Simply git log is usually ok for looking at the last couple of commits, but here's a fancier example:

git log --pretty=format:"%h by %an - %ar" --author=Ferris

There are two kinds of tags: Light-weight ones, that are just an alias for a commit (SHA), and tag objects, than can have author, message, etc. This is useful for signing with GPG keys etc.

This is where Git really starts to shine. It lets you screw around with your commits as much as you want to fix them up, merge, edit, etc. before pushing up.

It lets you work in your personal style. I used to say "work on one thing at the time, and use svn sync, keep it clean". Lots of people find this disturbs their flow. Git allows you to keep coding and VCS separate. If you wanna make a mess, you can clean it up later.

This relates to The tangled workspace problem described in the blogpost above. You want a SCM that manages revisions for you, not makes it a hassle.

Workflow, conventions

PRO: You can put any kind of workflow, so its possible to make it fit for us.

This article shows some *great* examples of branches and workflow:

See also Git commit policies:

Weaknesses of Git

Some ideas on
although most of the pain points seem to be outdated.

"hg st" and "hg ci" rules over git status og git commit -a (fix with aliases)

$ cat ~/.gitconfig
ci = commit -a
co = checkout
st = status -a

It's complex. Many top level commands that you have to navigate around.

Git has poor Eclipse support.

Above: My twitter buddies give their input on how the Eclipse Git plugin is working.

In other words, if we move to Git now, we have to get used to using the command line (yes, we're spoiled with Subversion support in Eclipse).

Mercurial has a great Eclipse plugin though.

Git is slow on windows (see the GStreamer mail discussion referenced below, or the benchmarks).

Pain points about Mercurial, or pro-Git
Linus says they have the same design, but Git does it better.

hgEclipse Synchronize view is useless to commit from! Have to refresh a lot. (performance is supposedly fixed in the recent version of the plugin around the end of May 2010).

The Mercurial documentation actually points out the differences in a pretty fair and honest way:

Git has longer time in the field than Mercurial.

Git is written in C, while hg is written in python. Which is faster? :P

Git "has more legs". It's easier to do something fancy with using a couple of different git-commands, than to start coding Python.

Some performance benchmarks here:

In Git, you can modify previous commits (ammend).. You can do this with hg too, but you need to understand queues, and enable a plugin..

Pain point about Mercurial: Everytime you want to do something fancy, you have to enable a plugin (not just you, but every developer will have to do this on their machine).

Comparing Git and Mercurial:
If you read this far, here's the bottom line: Git and Mercurial have been around for five years now, and are ready for the mainstream. Going distributed will shake up and change your whole way of dealing with iterations, releasing, testing and developing (for the better). Start understanding them now, and start using them for your internal code.

I might repeat the core message of this in a future blog post, but for now, please read Joel Spolsky's reasoning in his Mercurial how-to.

Disclaimer: These notes were lashed together creating an internal presentation. I cannot vouch for their correctness. Please double-check the facts before going into an argument armed with anything from this post. Sorry that I haven't done proper source-referral.

Popular posts from this blog

Encrypting and Decrypting with Spring

I was recently working with protecting some sensitive data in a typical Java application with a database underneath. We convert the data on its way out of the application using Spring Security Crypto Utilities. It "was decided" that we'd be doing AES with a key-length of 256, and this just happens to be the kind of encryption Spring crypto does out of the box. Sweet!

The big aber is that whatever JRE is running the application has to be patched with Oracle's JCE in order to do 256 bits. It's a fascinating story, the short version being that U.S. companies are restricted from exporting various encryption algorithms to certain countries, and some countries are restricted from importing them.

Once I had patched my JRE with the JCE, I found it fascinating how straight forward it was to encrypt and decrypt using the Spring Encryptors. So just for fun at the weekend, I threw together a little desktop app that will encrypt and decrypt stuff for the given password and sa…

Always use git-svn with --prefix

TLDR: I've recently been forced back into using git-svn, and while I was at it, I noticed that git-svn generally behaves a lot better when it is initialized using the --prefix option.

Frankly, I can't see any reason why you would ever want to use git-svn without --prefix. It even added some major simplifications to my old git-svn mirror setup.

Update: Some of the advantages of this solution will disappear in newer versions of Git.

For example, make a standard-layout svn clone:

$ git svn clone -s

You'll get this .git/config:

[svn-remote "svn"]
        url =
        fetch = project-foo/trunk:refs/remotes/trunk
        branches = project-foo/branches/*:refs/remotes/*
        tags = project-foo/tags/*:refs/remotes/tags/*

And the remote branches looks like this (git branch -a):

(Compared to regular remote branches, they look very odd because there is no remote name i…

Managing dot-files with vcsh and myrepos

Say I want to get my dot-files out on a new computer. Here's what I do:

# install vcsh & myrepos via apt/brew/etc
vcsh clone mr
mr update

Done! All dot-files are ready to use and in place. No deploy command, no linking up symlinks to the files. No checking/out in my entire home directory as a Git repository. Yet, all my dot-files are neatly kept in fine-grained repositories, and any changes I make are immediately ready to be committed:

    -> ~/.atom/*

    -> ~/.mrconfig
    -> ~/.config/mr/*

    -> ~/.tmuxinator/*

    -> ~/.vimrc
    -> ~/.vim/*

    -> ~/bin/*

    -> ~/.gitconfig

    -> ~/.tmux.conf    

    -> ~/.zshrc

How can this be? The key here is to use vcsh to keep track of your dot-files, and its partner myrepos/mr for operating on many repositories at the same time.

I discovere…

Automating Computer Setup with Boxen

I just finished setting up a new laptop at work, and in doing so I revamped my personal computer automation quite a bit. I set up Boxen for installing software, and I improved my handling of dot-files using vcsh, which I'll cover in the next blog-post after this one.

Since it's a Mac, it doesn't come with any reasonable package manager built in. A lot of people get along with a combination of homebrew or MacPorts plus manual installs, but this time I took it a step further and decided to install all the "desktop" tools like VLC and Spotify using GitHub's Boxen:

  include vlc
  include cyberduck
  include pgadmin3
  include spotify
  include jumpcut
  include googledrive
  include virtualbox

If the above excerpt looks like Puppet to you, it's because it is. The nice thing about this is that I can apply the same puppet scripts on my Ubuntu machines as well. Boxen is Mac-specific, Puppet is not.

It was a little weird to get started with Boxen, as you're offered…

Considerations for JavaScript in Modern (2013) Java/Maven Projects

Disclaimer: I'm a Java developer, not a JavaScript developer. This is just what I've picked up the last years plus a little research the last days. It's just a snapshot of my current knowledge and opinions on the day of writing, apt to change over the next weeks/months.

We've gone all modern in our web applications, doing MVC on the client side with AngularJS or Ember, building single-page webapps with REST backends. But how are we managing the growing amount of JavaScript in our application?
You ain't in Kansas anymore So far we've just been doing half-random stuff. We download some version of a library and throw it into our src/main/webapp/js/lib, or we use it from a CDN, which may be down or unreachable when we want to use the application..

Some times the JS is minified, other times it's not. Some times we name the file with version number, other times without. Some times we get the latest library of master branch and name it with the commit-id in the fi…