No cautious, creative person starts a project nowadays without a back-up strategy. Because data is ephemeral and can be lost easily—through an errant code change or a catastrophic disk crash, say—it is wise to maintain a living archive of all work.
For text and code projects, the back-up strategy typically includes version control, or tracking and managing revisions. Each developer can make several revisions per day, and the ever increasing corpus serves simultaneously as repository, project narrative, communication medium, and team and product management tool. Given its pivotal role, version control is most effective when tailored to the working habits and goals of the project team.
A tool that manages and tracks different versions of software or other content is referred to generically as a version control system (VCS), a source code manager (SCM), a revision control system (RCS), and several other permutations of the words “revision,” “version,” “code,” “content,” “control,” “management,” and “system.” Although the authors and users of each tool might debate esoterics, each system addresses the same issue: develop and maintain a repository of content, provide access to historical editions of each datum, and record all changes in a log. In this book, the term version control system (VCS) is used to refer generically to any form of revision control system.
This book covers Git, a particularly powerful, flexible, and low-overhead version control tool that makes collaborative development a pleasure. Git was invented by Linus Torvalds to support the development of the Linux® kernel, but it has since proven valuable to a wide range of projects.
Often, when there is discord between a tool and a project, the developers simply create a new tool. Indeed, in the world of software, the temptation to create new tools can be deceptively easy and inviting. In the face of many existing version control systems, the decision to create another shouldn’t be made casually. However, given a critical need, a bit of insight, and a healthy dose of motivation, forging a new tool can be exactly the right course.
Git, affectionately termed “the information manager from hell” by its creator (Linus is known for both his irascibility and his dry wit), is such a tool. Although the precise circumstances and timing of its genesis are shrouded in political wrangling within the Linux kernel community, there is no doubt that what came from that fire is a well-engineered version control system capable of supporting the worldwide development of software on a large scale.
Prior to Git, the Linux kernel was developed using the commercial BitKeeper VCS, which provided sophisticated operations not available in then-current, free software VCSs such as RCS and the concurrent version system (CVS). However, when the company that owned BitKeeper placed additional restrictions on its “free as in beer” version in the spring of 2005, the Linux community realized that BitKeeper was no longer a viable solution.
Linus looked for alternatives. Eschewing commercial solutions, he studied the free software packages but found the same limitations and flaws that led him to reject them previously. What was wrong with the existing VCSs? What were the elusive missing features or characteristics that Linus wanted and couldn’t find?
There are many facets to “distributed development,” and Linus wanted a new VCS that would cover most of them. It had to allow parallel as well as independent and simultaneous development in private repositories without the need for constant synchronization with a central repository, which could form a development bottleneck. It had to allow multiple developers in multiple locations even if some of them were offline temporarily.
It isn’t enough just to have a distributed development model. Linus knew that thousands of developers contribute to each Linux release. So any new VCS had to handle a very large number of developers whether they were working on the same or different parts of a common project. And the new VCS had to be able to integrate all of their work reliably.
Linus was determined to ensure that a new VCS was fast and efficient. In order to support the sheer volume of update operations that would be made on the Linux kernel alone, he knew that both individual update operations and network transfer operations would have to be very fast. To save space and thus transfer time, compression and “delta” techniques would be needed. Using a distributed model instead of a centralized model also ensured that network latency would not hinder daily development.
Because Git is a distributed revision control system, it is vital to obtain absolute assurance that data integrity is maintained and is not somehow being altered. How do you know the data hasn’t been altered in transition from one developer to the next? Or from one repository to the next? Or, for that matter, that the data in a Git repository is even what it purports to be?
Git uses a common cryptographic hash function, called Secure Hash Function (SHA1), to name and identify objects within its database. Though perhaps not absolute, in practice it has proven to be solid enough to ensure integrity and trust for all Git’s distributed repositories.
One of the key aspects of a version control system is knowing who changed files and, if at all possible, why. Git enforces a change log on every commit that changes a file. The information stored in that change log is left up to the developer, project requirements, management, convention, and so on. Git ensures that changes will not happen mysteriously to files under version control because there is an accountability trail for all changes.
Git’s repository database contains data objects that are immutable. That is, once they have been created and placed in the database, they cannot be modified. They can be recreated differently, of course, but the original data cannot be altered without consequences. The design of the Git database means that the entire history stored within the version control database is also immutable. Using immutable objects has several advantages, including quick comparison for equality.
With atomic transactions, a number of different but related changes are performed either all together or not at all. This property ensures that the version control database is not left in a partially changed or corrupted state while an update or commit is happening. Git implements atomic transactions by recording complete, discrete repository states that cannot be broken down into individual or smaller state changes.
Almost all VCSs can name different genealogies of development within a single project. For instance, one sequence of code changes could be called “development” while another is referred to as “test.” Each version control system can also split a single line of development into multiple lines and then unify, or merge, the disparate threads. As with most VCSs, Git calls a line of development a branch and assigns each branch a name.
Along with branching comes merging. Just as Linus wanted easy branching to foster alternate lines of development, he also wanted to facilitate easy merging of those branches. Because branch merging has often been a painful and difficult operation in version control systems, it would be essential to support clean, fast, easy merging.
So that individual developers needn’t query a centralized repository server for historical revision information, it was essential that each repository have a complete copy of all historical revisions of every file.
Even though end users might not be concerned about a clean internal design, it was important to Linus and ultimately to other Git developers as well. Git’s object model has simple structures that capture fundamental concepts for raw data, directory structure, recording changes, and so forth. Coupling the object model with a globally unique identifier technique allowed a very clean data model that could be managed in a distributed development environment.
The complete history of VCSs is beyond the scope of this book. However, there are several landmark, innovative systems that set the stage for or directly led to the development of Git. (This section is selective, hoping to record when new features were introduced or became popular within the free software community.)
The Source Code Control System (SCCS) was one of the original systems on Unix® and was developed by M. J. Rochkind in the very early 1970s. [“The Source Code Control System,” IEEE Transactions on Software Engineering 1(4) (1975): 364-370.] This is arguably the first VCS available on any Unix system.
The central store that SCCS provided was called a repository, and that fundamental concept remains pertinent to this day. SCCS also provided a simple locking model to serialize development. If a developer needed files to run and test a program, he or she would check them out unlocked. However, in order to edit a file, he or she had to check it out with a lock (a convention enforced through the Unix file system). When finished, he or she would check the file back into the repository and unlock it.
The Revision Control System (RCS) was introduced by Walter F. Tichy in the early 1980s. [“RCS: A System for Version Control,” Software Practice and Experience 15(7) (1985): 637-654.] RCS introduced both forward and reverse delta concepts for the efficient storage of different file revisions.
The Concurrent Version System (CVS), designed and originally implemented by Dick Grune in 1986 and then crafted anew some four years later by Berliner and colleagues extended and modified the RCS model with great success. CVS became very popular and was the de facto standard within the open source (http://www.opensource.org) community for many years. CVS provided several advances over RCS, including distributed development and repository-wide change sets for entire “modules.”
Furthermore, CVS introduced a new paradigm for the lock. Whereas earlier systems required a developer to lock each file before changing it and thus forced one developer to wait for another in serial fashion, CVS gave each developer write permission in his or her private working copy. Thus, changes by different developers could be merged automatically by CVS unless two developers tried to change the same line. In that case, the conflict was flagged and the developers were left to work out the solution. The new rules for the lock allowed different developers to write code concurrently.
As often occurs, perceived shortcomings and faults in CVS eventually led to a new VCS. Subversion (SVN), introduced in 2001, quickly became popular within the free software community. Unlike CVS, SVN committed changes atomically and had significantly better support for branches.
BitKeeper and Mercurial were radical departures from all the aforementioned solutions. Each eliminated the central repository; instead, the store was distributed, providing each developer with his own shareable copy. Git is derived from this peer-to-peer model.
Finally, Mercurial and Monotone contrived a hash fingerprint to uniquely identify a file’s content. The name assigned to the file is a moniker and a convenient handle for the user and nothing more. Git features this notion as well. Internally, the Git identifier is based on the file’s contents, a concept known as a content-addressable file store. The concept is not new. [See “The Venti Filesystem,” (Plan 9), Bell Labs, http://www.usenix.org/events/fast02/quinlan/quinlan_html/index.html.] Git immediately borrowed the idea from Monotone, according to Linus. Mercurial was implementing the concept simultaneously with Git.
Git became self-hosted on April 7 with this commit:
commit e83c5163316f89bfbde7d9ab23ca2e25604af29 Author: Linus Torvalds <firstname.lastname@example.org> Date: Thu Apr 7 15:13:13 2005 -0700 Initial revision of "git", the information manager from hell
Shortly thereafter, the first Linux commit was made:
commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Author: Linus Torvalds <email@example.com> Date: Sat Apr 16 15:20:36 2005 -0700 Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
That one commit introduced the bulk of the entire Linux Kernel into a Git repository. It consisted of
17291 files changed, 6718755 insertions(+), 0 deletions(-)
Yes, that’s an introduction of 6.7 million lines of code!
It was just three minutes later when the first patch using Git was applied to the kernel. Convinced that it was working, Linus announced it on April 20, 2005, to the Linux Kernel Mailing List.
Knowing full well that he wanted to return to the task of developing the kernel, Linus handed the maintenance of the Git source code to Junio Hamano on July 25, 2005, announcing that “Junio was the obvious choice.”
Linus himself rationalizes the name “Git” by claiming “I’m an egotistical bastard, and I name all my projects after myself. First Linux, now git.” Granted, the name “Linux” for the kernel was sort of a hybrid of Linus and Minix. The irony of using a British term for a silly or worthless person was not missed, either.
Since then, others had suggested some alternative and perhaps more palatable interpretations: the Global Information Tracker seems to be the most popular.
 Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
 UNIX is a registered trademark of The Open Group in the United States and other countries.
 Private email.