Version Control/Source Code Management

I imagine many readers are currently thinking that the battle over version control must surely be over by now, and that all developers are using some system. This is, unfortunately, demonstrably untrue. Let me start with an anecdote. It's 2004, and I've just started working as a systems manager in a university computing lab. My job is partly to maintain the computers in the lab, partly to teach programming and numerical computing to physics undergraduates, and partly to write software that will assist in said teaching. As part of this work, I started using version control, both for my source code and for some of the configuration files in /etc on the servers. A more experienced colleague saw me doing this and told me that I was just generating work for myself; that this wasn't necessary for the small things I was maintaining.

Move on now to 2010, and I'm working in a big scientific facility in the UK. Using software and a lot of computers, we've got something that used to take an entire PhD to finish down to somewhere between 1 and 8 hours. I'm on the software team and, yes, we're using version control to track changes to the software and to understand what version is released. Well, kind of, anyway. The "core" of the files/source code is in version control, but one of its main features is to provide a scripting environment and DSL in which scientists at the "lab benches," if you will, can write up scripts that automate their experiments. These scripts are not (necessarily) version controlled. Worse, the source code is deployed to experimental stations so someone who discovers a bug in the core can fix it locally without the change being tracked in version control.

So, a group does an experiment at this facility, and produces some interesting results. You try to replicate this later, and you get different results. It could be software-related, right? All you need to do is to use the same software as the original group used… Unfortunately, you can't. It's vanished.

That's an example of how scientists failing to use the tools from software development could be compromising their science. There's a lot of snake oil in the software field, both from people wanting you to use their tools/methodologies because you'll pay them for it, and from people who have decided that "their" way of working is correct and that any other way is incorrect. You need to be able to cut through all of that nonsense to find out how particular tools and techniques impact the actual work you're trying to do. Philosophy of science currently places a high value on reproducibility and auditing. Version control supports that, so it would be beneficial for programmers working in science to use version control. But they aren't; not consistently, anyway.

In its simplest guise - the one that I was using in 2004 - version control is a big undo stack. Only, unlike a series of undo and redo commands, you can leave messages explaining who made each change and why. Even if you're working on your own, this is a great facility to have – if you try something that gets confusing or doesn't work out, you can easily roll back to a working version and take things from there.

Once you're more familiar with the capabilities of a version control system, it can become a powerful tool for configuration management. Work on different features and bugfixes for the same product can proceed in parallel, with work being integrated when it's ready into one or more releases of the product. Discussing this workflow in detail is more than I'm willing to cover here: I recommend the Pragmatic Programmer books on version control such as Pragmatic Version Control Using Githttp://pragprog.com/book/tsgit/pragmatic-version-control-using-git by Travis Swicegood.

On Version Control and Collaboration

Version control is no more of a collaboration tool than other document management systems, such as SharePoint. Integrating (or merging) related work by different people is hard and requires knowledge of the meaning of the code and how changes interact. Version control systems don't have that knowledge, and as a result cannot simplify this merging process in any but the most trivial cases. It does let you defer the problem until you want to face it, but that's about it.

Some tools - for example, GitHubhttp://www.github.com – provide social features around a core version control system. However, the problems of knowing what to integrate from whom, and when, and resolving conflicts all still exist. The social features give you somewhere to talk about those problems.

Distributed Version Control

I've used a good few version control systems over the years, from simple tools that work with the local filesystem to hugely expensive commercial products. My favored way of working now is with a DVCS (Distributed Version Control System) (though, as promised earlier, I'm not going to suggest that you choose a particular one; with the exception of darcs, they all work in much the same way).

With a DVCS, it's very easy to get a local project into version control, so even toy projects and prototypes can be versioned. A feature that makes them great for this, over earlier systems that version local files, such as RCS (Reaction Control System) and SCCS (Source Code Control System), is that the whole repository (that is, all of the files that comprise the versioned project) is treated atomically. In other words, the repository can be at one version or another, but never in an in-between state where some files are at an earlier revision than others.

Earlier systems, like RCS, do not impose this restriction. With RCS, every file is versioned independently so each can be checked out on a different version. While this is more flexible, it does introduce certain problems. For example, consider the files in the following figure. One of the files contains a function that's used in code in the other file. You need to make a change to the function's signature, to add a new parameter. This means changing all three files.

Figure 4.1: A dependency that crosses multiple files
Figure 4.1: A dependency that crosses multiple files

In an atomic version control system, the files can either both be checked out at the revision with one parameter or both be checked out at the revision with two parameters. A per-file versioning system will allow any combination of versions to be checked out, despite the fact that half of the possible combinations do not make sense.

Once you've got a project that's locally versioned in a DVCS repository, sharing it with others is simple and can be done in numerous ways. If you want to back up or share the repository on a hosted service like BitBuckethttp://www.bitbucket.org, you set that up as a remote repository and push your content. A collaborator can then clone the repository from the remote version and start working on the code. If they're on the same network as you, then you can just share the folder containing the repository without setting up a remote service.

Personal Experience

In some situations, a combination of these approaches is required. The DVCS tools that I've used all support that. On one recent project, everything was hosted on a remote service but there were hundreds of megabytes of assets stored in the repository. It made sense for the computers in the office to not only clone the remote repository, but also to peer with each other to reduce the time and bandwidth used when the assets changed. The situation looked like the following figure.

Figure 4.2: A DVCS configuration can break out of the “star” topology required by centralized systems
Figure 4.2: A DVCS configuration can break out of the "star" topology required by centralized systems

Doing this with a centralized version control system would've been possible, but ugly. One of the developers would've needed to fully synchronize their working copy with the server, then fully copy the repository and its metadata to all of the other developer systems. This is less efficient than just copying the differences between the repositories. Some centralized version control systems wouldn't even support that way of working, because they track which files they think you have checked out on the server.

Another benefit brought by DVCS as much due to improved algorithms as their distributed nature – is the ease with which you can create and destroy branches. When I mainly worked with centralized version control (primarily Subversion and Perforce), branches were created for particular tasks, such as new releases, and the teams I worked on invented workflows for deciding when code migrated from one branch to another.

With DVCSes, I often create a branch every hour or so. If I want to start some new work, I create a branch in my local version of the repository. After a while, I'm either done, and the branch gets merged and deleted; convinced that the idea was wrong - in which case, it's just deleted; or I want someone else to have a look, and I push that branch without merging it. All of this was possible with centralized VCS, though much slower – and you needed network access to even create the branch.