Look at the world through your Polaroid glasses
Things’ll look a whole lot better for the working classes.
—Gang of Four, “I Found that Essence Rare”
This chapter is about revision control systems (RCSes), which maintain snapshots of the many different versions of a project as it develops, such as the stages in the development of a book, a tortured love letter, or a program. A revision control system gives us three major powers:
Our filesystem now has a time dimension, so we can see what a file looked like last week and how it changed from then to now. Even without the other powers, this alone makes me a more confident writer.
We can keep track of multiple versions of a project, such as my copy and my coauthor’s copy. Even within my own work, I may want one version of a project (a branch) with an experimental feature, which should be kept segregated from the stable version that needs to be able to run without surprises.
Further, the website http://github.com has about 45,000 projects that self-report as being primarily in C as of this writing, plus other smaller hosts for RCS repositories, such as the GNU’s Savannah. Even if you aren’t going to modify the code, cloning these repositories is a quick way to get the program or library onto your hard drive for your own use. When your own project is ready for public use (or before then), you can make the repository public as another means of distribution.
Now that you and I both have versions of the same project, and both have equal ability to hack our versions of the code base, revision control gives us the power to merge together our multiple threads as easily as possible.
This chapter will cover Git, which is a distributed revision control system, meaning that any given copy of the project works as a standalone repository of the project and its history. There are others, with Mercurial and Bazaar the other front-runners in the category. There is largely a one-to-one mapping among the features of these systems, and what major differences had existed have merged over the years, so you should be able to pick the others up immediately after reading this chapter.
The most rudimentary means of revision control is via
patch, which are POSIX-standard and therefore
most certainly on your system. You probably have two files on your drive
somewhere that are reasonably similar, and if not, grab any text file,
change a few lines, and save the modified version with a new name.
and you will get a listing, a little more machine-readable than
human-readable, that shows the lines that have changed between the two
files. Piping output to a text file via
diffs and then opening
diffs in your text editor may give you a
colorized version that is easier to follow. You will see some lines giving
the name of the file and location within the file, perhaps a few lines of
context that did not change between the two files, and lines beginning
- showing the lines that got added and removed.
diff with the
-u flag to get a few lines of context around the
Given two directories holding two versions of your project,
a single diff file in the unified diff format for the entire directories
via the recursive (
to apply all of your changes to her copy of
Or, if you have no friends, you can run
time to time on your own code and thus keep a record of the changes you
have made over time. If you find that you have inserted a bug in your
code, the diffs are the first place to look for hints about what you
touched that you shouldn’t have. If that isn’t enough, and you already
v1, you could run the patch in reverse
patch -R <
diff-v1v2, reverting version 2 back to version
1. If you were at version 4, you could even conceivably run a whole
sequence of diffs to move further back in time:
v4patch -R <
diff-v3v4patch -R <
diff-v2v3patch -R <
I say conceivably because maintaining a sequence of diffs like this is tedious and error-prone, to say the least.
Thus, the revision control system, which will make and track the diffs for you.
Git is a C program like any other, and is based on a small set of objects. The key object is the commit object, which is a set of diffs, very much in the style of the unified diffs we saw earlier. Given a previous commit object and some changes from that baseline, a new commit object encapsulates the information. It gets some support from the index, a list of the changes registered since the last commit object, the primary use of which will be in generating the next commit object.
The commit objects link together to form a tree much like any other
tree. Each commit object will have (at least) one parent commit object.
Stepping up and down the tree is akin to using
-R to step among versions.
The repository itself is not formally a single object in the Git source code, but I think of it as an object, because the usual operations one would define, such as new, copy, and free, apply to the entire repository. Get a new repository in the directory you are working in via:
OK, you now have a revision control system in place. You might not
see it, because Git stores all its files in a directory named
.git, where the dot means that all the usual
ls will take it to be
hidden. You can look for it via, e.g.,
-a or via a show hidden files option in
your favorite file manager.
git clone git://github.com/gitster/git.git
If you want to test something on a repository in
~/myrepo and are worried that you might break
something, go to a temp directory (say
~/tmp; cd ~/tmp), clone your repository with
and experiment away. Deleting the clone when done (
has no effect on the original.
Given that all the data about a repository is in the .git subdirectory of your project directory, the analog to freeing a repository is simple:
rm -rf .git
Having the whole repository so self-contained means that you can make spare copies to shunt between home and work, copy everything to a temp directory for a quick experiment, and so on, without much hassle.
We’re almost ready to generate some commit objects, but because they
summarize diffs since the starting point or a prior commit, we’re going to
have to have on hand some diffs to commit. The index (Git source:
struct index_state) is a list of changes that
are to be bundled into the next commit. It exists because we don’t
actually want every change in the project directory to be recorded. For
example, gnomes.c and gnomes.h
will beget gnomes.o and the executable
gnomes. Your RCS should track
gnomes.c and gnomes.h and let
the others get regenerated as needed. So the key operation with the index
is adding elements to its list of changes. Use:
to add these files to the index. Other typical changes to the list of files tracked also need to be added to the index:
Changes you made to files that are already tracked by Git are not
automatically added to the index, which might be a surprise to users of
other RCSes (but see below). Add each individually via
git add -u
to add to the index changes to all the files Git already tracks.
At some point you have enough changes listed in the index that they should be recorded as a commit object in the repository. Generate a new commit object via:
git commit -a -m "
here is an initial commit."
-m flag attaches a message to
the revision, which you’ll read when you run
log later on. If you omit the message, then Git will start the
text editor specified in the environment variable
EDITOR so you can enter it (the default editor
is typically vi; export that variable in your shell’s startup script,
e.g., .bashrc or .zshrc, if you
want something different).
-a flag tells git that there
are pretty good odds that I forgot to run
-u, so please run it just before committing. In practice, this
means that you never have to run
-u explicitly, as long as you always remember the
-a flag in
It is easy to find Git experts who are concerned with generating a
coherent, clean narrative from their commits. Instead of commit messages
like “added an index object, plus some bug fixes along the way,” an
expert Git author would create two commits, one with the message “added
an index object” and one with “bug fixes.” Our authors have such control
because nothing is added to the index by default, so they can add only
enough to express one precise change in the code, write the index to a
commit object, then add a new set of items to a clean index to generate
the next commit object. I found one blogger who took several pages to
describe his commit routine: “For the most complicated cases, I will
print out the diffs, read them over, and mark them up in six colors of
highlighter…” However, until you become a Git expert, this will be much
more control over the index than you really need or want. That is, not
git commit is an advanced use that many people
never bother with. In a perfect world, the
-a would be the default, but it isn’t, so
don’t forget it.
git commit -a writes a
new commit object to the repository based on all of the changes the index
was able to track, and clears the index. Having saved your work, you can
now continue to add more. Further—and this is the real, major benefit of
revision control so far—you can delete whatever you want, confident that
it could be recovered if you need it back. Don’t clutter up the code with
large blocks of commented-out obsolete routines—delete!
Having generated a commit object, your interactions with it will
mostly consist of looking at its contents. You’ll use
git diff to see the diffs that are the core of
the commit object and
git log to see
The key metadata is the name of the object, which is assigned via an
unpleasant but sensible naming convention: the SHA1 hash, a 40-digit
hexadecimal number that can be assigned to an object, in a manner that
lets us assume that no two objects will have the same hash, and that the
same object will have the same name in every copy of the repository. When
you commit your files, you’ll see the first few digits of the hash on the
screen, and you can run
git log to see
the list of commit objects in the history of the current commit object,
listed by their hash and the human language message you wrote when you did
the commit (and see
git help log for
the other available metadata). Fortunately, you need only as much of the
hash as will uniquely identify your commit. So if you look at the log and
decide that you want to check out revision number
fe9c49cddac5150dc974de1f7248a1c5e3b33e89, you can do so with:
git checkout fe9c4
This does the sort of time-travel via diffs that
patch almost provided, rewinding to the state of
the project at commit
Because a given commit only has pointers to its parents, not its
children, when you check
git log after
checking out an old commit, you will see the trace of objects that led up
to this commit, but not later commits. The rarely used
git reflog will show you the full list of commit
objects the repository knows about, but the easier means of jumping back
to the most current version of the project is via a
tag, a human-friendly name that you won’t have to
look up in the log. Tags are maintained as separate objects in the
repository and hold a pointer to a commit object being tagged. The most
frequently used tag is
refers to the last commit object on the master branch (which, because we
haven’t covered branching yet, is probably the only branch you have).
Thus, to return from back in time to the latest state, use:
git checkout master
Getting back to
git diff, it
shows what changes you have saved since the last committed revision. The
output is what would be written to the next commit object via
-a. As with the output from the plain
git diff >
will write to a file that may be more legible in your colorized text
git diff shows
the diff between the last commit and what is in the project directory.
With one commit object name,
shows the sequence of changes between that commit and what is in the
project directory. With two names, it shows the sequence of changes from
one commit to the other:
git diff Show the diffs between the working directory and the last commit. git diff 234e2a Show the diffs between the working directory and the given commit object. git diff 234e2a 8b90ac Show the changes from one commit object to another.
At this point, you know how to:
Save frequent incremental revisions of your project.
Get a log of your committed revisions.
Find out what you changed or added recently.
Check out earlier versions so that you can recover earlier work if needed.
Having a backup system organized enough that you can delete code with confidence and recover as needed will already make you a better writer.
It doesn’t take long working with Git to discover that it doesn’t like doing anything when there are uncommitted changes in the current working directory. It typically asks you to commit your work, and then perform the operation that you had intended. This makes sense, because otherwise, your unsaved work might be lost. You need to store it somewhere so you can get it back after you go spelunking in your archive.
Thus we employ the stash, a special commit
object, mostly equivalent to what you would get from
git commit -a, but with a few special
features, such as retaining all the untracked junk in your working
directory. Here is the typical procedure:
git stash # Code is now as it was at last checkin. git checkout
fe9c4# Look around here. git checkout
master# Or whatever commit you had started with # Code is now as it was at last checkin, so replay stashed diffs with: git stash pop
Another sometimes-appropriate alternative for checking out given
changes in your working directory is
--hard, which takes the working directory back to the state it
was in when you last checked out. The command sounds severe because it
is: you are about to throw away all work you had done since the last
There is one tree in a repository, which got generated when the
first author of a new repository ran
init. You are probably familiar with tree data structures,
consisting of a set of nodes, where each node has links to some number of
children and a link to a parent (and in exotic trees like Git’s, possibly
Indeed, all commit objects but the initial one have a parent, and
the object records the diffs between itself and the parent commit. The
terminal node in the sequence, the tip of the branch, is tagged with a
branch name. For our purposes, there is a one-to-one correspondence
between branch tips and the series of diffs that led to that branch. The
one-to-one correspondence means we can interchangeably refer to branches
and the commit object at the tip of the branch. Thus, if the tip of the
master branch is commit
234a3d are entirely equivalent (until a new commit gets
written, and that takes on the
label). It also means that list of commit objects on a branch can be
rederived at any time by starting at the commit at the named tip and
tracing back to the origin of the tree.
The typical custom is to keep the master branch fully functional at all times. When you want to add a new feature or try a new thread of inquiry, create a new branch for it. When the branch is fully functioning, you will be able to merge the new feature back into the master using the methods to follow.
new_leaf# Create a new branch... git checkout
new_leaf# then check out the branch you just created. #or execute both steps at once with the equivalent: git checkout -b
Having created the new branch, switch between the tips of the two
git checkout master and
What branch are you on right now? Find out with:
which will list all branches and put a
* by the one that is currently active.
What would happen if you were to build a time machine, go back to
before you were born, and kill your parents? If we learned anything from
science fiction, it’s that if we change history, the present doesn’t
change, but a new alternate history splinters off. So if you check out an
old version, make changes, and check in a new commit object with your
newly made changes, then you now have a new branch distinct from the
master branch. You will find via
branch that when the past forks like this, you will be on
(no branch). Untagged branches tend to
create problems, so if ever you find that you are on
(no branch), then run
git branch -m
new_branch_name to name the branch to which
you’ve just splintered.
So far, we have generated new commit objects by starting with a
commit object as a starting point and applying a list of diffs from the
index. A branch is also a series of diffs, so given an arbitrary commit
object and a list of diffs from a branch, we should be able to create a
new commit object in which the branch’s diffs are applied to the
existing commit object. This is a merge. To merge
all the changes that occurred over the course of
newleaf back into
master, switch to
master and use
git checkout master git merge
For example, you have used a branch off of
master to develop a new feature, and it
finally passes all tests; then applying all of the diffs from the
development branch to
create a new commit object with the new feature soundly in place.
Let us say that, while working on the new feature, you never
master and so made no
changes to it. Then applying the sequence of diffs from the other branch
would simply be a fast replay of all of the changes recorded in each
commit object in the branch, which in Git terminology is a
But if you made any changes to
master, then this is no longer a simple
question of a fast application of all of the diffs. For example, say
that at the point where the branch split off,
short int height_inches;
master, you removed the
The purpose of
newleaf was to convert
short int height_cm;
At this point, Git is stymied. Knowing how to combine these lines requires knowing what you as a human intended. Git’s solution is to modify your text file to include both versions, something like:
<<<<<<< HEAD int height_inches; ======= short int height_cm; >>>>>>> 3c3c3c
The merge is put on hold, waiting for you to edit the file to express the change you would like to see. In this case, you would probably reduce the five-line chunk Git left in the text file to:
In all likelihood, get told that there are conflicts you have to resolve.
Check the list of unmerged files using
Pick a file to manually check on. Open it in a text editor and find the merge-me marks if it is a content conflict. If it’s a filename or file position conflict, move the file into place.
Repeat steps 3−5 until all unmerged files are checked in.
git commit to finalize
Take comfort in all this manual work. Git is conservative in merging and won’t automatically do anything that could, under some storyline, cause you to lose work.
When you are done with the merge, all of the relevant diffs that occurred in the side branch are represented in the final commit object of the merged-to branch, so the custom is to delete the side branch:
other_branch tag is deleted, but
the commit objects that led up to it are still in the repository for
Say you have a main branch and split off a testing branch from it on Monday. Then on Tuesday through Thursday, you make extensive changes to both the main and testing branch. On Friday, when you try to merge the test branch back into the main, you have an overwhelming number of little conflicts to resolve.
Let’s start the week over. You split the testing branch off from
the main branch on Monday, meaning that the last commits on both
branches share a common ancestor of Monday’s commit on the main branch.
On Tuesday, you have a new commit on the main branch; let it be commit
abcd123. At the end of the day, you replay all the
diffs that occurred on the main branch onto the testing branch:
git branch testing # get on the testing branch git rebase abcd123 # or equivalently: git rebase main
rebase command, all of the changes
made on the main branch since the common ancestor are replayed on the
testing branch. You might need to manually merge things, but by only
having one day’s work to merge, we can hope that the task of merging is
Now that all changes up to
abcd123 are present
in both branches, it is as if the branches had actually split off from
that commit, rather than Monday’s commit. This is where the name of the
procedure comes from: the testing branch has been rebased to split off
from a new point on the main branch.
You also perform rebases at the end of Wednesday, Thursday, and Friday, and each of them is reasonably painless, as the testing branch kept up with the changes on the main branch throughout the week.
Rebases are often cast as an advanced use of Git, because other systems that aren’t as capable with diff-application don’t have this technique. But in practice rebasing and merging are about on equal footing: both apply diffs from another branch to produce a commit, and the only question is whether you are tying together the ends of two branches (in which case, merge) or want both branches to continue their separate lives for a while longer (in which case, rebase). The typical usage is to rebase the diffs from the master into the side branch, and merge the diffs from the side branch into the master, so there is a symmetry between the two in practice. And as noted, letting diffs pile up on multiple branches can make the final merge a pain, so it is good form to rebase reasonably often.
Everything to this point has been occurring within one tree. If you cloned a repository from elsewhere, then at the moment of cloning, you and the origin both have identical trees with identical commit objects. However, you and your colleagues will continue working, so you will all be adding new and different commit objects.
Your repository has a list of remotes, which
are pointers to other repositories related to this one elsewhere in the
world. If you got your repository via
clone, then the repository from which you cloned is named
origin as far as the new repository is
concerned. In the typical case, this is the only remote you will ever
When you first clone and run
branch, you’ll see one lonely branch, regardless of how many
branches the origin repository had. But run
branch -a to see all the branches that Git knows about, and you
will see those in the remote as well as the local ones. If you cloned a
repository from Github, et al, you can use this to check whether other
authors had pushed other branches to the central repository.
Those copies of the branches in your local repository are as of the
first time you pulled. Next week, to update those remote branches with the
information from the origin repository, run
Now that you have up-to-date remote branches in your repository, you
could merge using the full name of the branch, for example,
git merge remotes/origin/master.
As a shortcut to
git fetch; git merge
git pull origin master
to fetch the remote changes and merge them into your current repository all at once.
The converse is
you’ll use to update the repository with your last commit (not the state
of your index or working directory). If you are working on a branch named
bbranch and want to push to the remote with the same
git push origin bbranch
There are good odds that when you push your changes, applying the diffs from your branch to the remote branch will not be a fast-forward (if it is, then your colleagues haven’t been doing any work). Resolving a non-fast-forward merge typically requires human intervention, and there is probably not a human at the remote. Thus, Git will allow only fast-forward pushes. How can you guarantee that your push is a fast forward?
git pull origin to get
the changes made since your last pull.
Merge as seen earlier, wherein you as a human resolve those changes a computer cannot.
git commit -a -m
git push origin master,
because now Git only has to apply a single diff, which can be done
The structure of a Git repository is not especially complex: there
are commit objects representing the changes since the parent commit
object, organized into a tree, with an index gathering together the
changes to be made in the next commit. But with these elements, you can
organize multiple versions of your work, confidently delete things, create
experimental branches and merge them back to the main thread when they
pass all their tests, and merge your colleagues’ work with your own. From
git help and your favorite
Internet search engine will teach you a whole lot of ways to do these
things more smoothly, as well as many more tricks.