You are previewing Version Control with Git, 2nd Edition.

Version Control with Git, 2nd Edition

Cover of Version Control with Git, 2nd Edition by Jon Loeliger... Published by O'Reilly Media, Inc.
  1. Version Control with Git
  2. Preface
    1. Audience
    2. Assumed Framework
    3. Book Layout and Omissions
    4. Conventions Used in This Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowledgments
    9. Attributions
  3. 1. Introduction
    1. Background
    2. The Birth of Git
    3. Precedents
    4. Timeline
    5. What’s in a Name?
  4. 2. Installing Git
    1. Using Linux Binary Distributions
      1. Debian/Ubuntu
      2. Other Binary Distributions
    2. Obtaining a Source Release
    3. Building and Installing
    4. Installing Git on Windows
      1. Installing the Cygwin Git Package
      2. Installing Standalone Git (msysGit)
  5. 3. Getting Started
    1. The Git Command Line
    2. Quick Introduction to Using Git
      1. Creating an Initial Repository
      2. Adding a File to Your Repository
      3. Configuring the Commit Author
      4. Making Another Commit
      5. Viewing Your Commits
      6. Viewing Commit Differences
      7. Removing and Renaming Files in Your Repository
      8. Making a Copy of Your Repository
    3. Configuration Files
      1. Configuring an Alias
    4. Inquiry
  6. 4. Basic Git Concepts
    1. Basic Concepts
      1. Repositories
      2. Git Object Types
      3. Index
      4. Content-Addressable Names
      5. Git Tracks Content
      6. Pathname Versus Content
      7. Pack Files
    2. Object Store Pictures
    3. Git Concepts at Work
      1. Inside the .git Directory
      2. Objects, Hashes, and Blobs
      3. Files and Trees
      4. A Note on Git’s Use of SHA1
      5. Tree Hierarchies
      6. Commits
      7. Tags
  7. 5. File Management and the Index
    1. It’s All About the Index
    2. File Classifications in Git
    3. Using git add
    4. Some Notes on Using git commit
      1. Using git commit --all
      2. Writing Commit Log Messages
    5. Using git rm
    6. Using git mv
    7. A Note on Tracking Renames
    8. The .gitignore File
    9. A Detailed View of Git’s Object Model and Files
  8. 6. Commits
    1. Atomic Changesets
    2. Identifying Commits
      1. Absolute Commit Names
      2. refs and symrefs
      3. Relative Commit Names
    3. Commit History
      1. Viewing Old Commits
      2. Commit Graphs
      3. Commit Ranges
    4. Finding Commits
      1. Using git bisect
      2. Using git blame
      3. Using Pickaxe
  9. 7. Branches
    1. Reasons for Using Branches
    2. Branch Names
      1. Dos and Don’ts in Branch Names
    3. Using Branches
    4. Creating Branches
    5. Listing Branch Names
    6. Viewing Branches
    7. Checking out Branches
      1. A Basic Example of Checking out a Branch
      2. Checking out When You Have Uncommitted Changes
      3. Merging Changes into a Different Branch
      4. Creating and Checking out a New Branch
      5. Detached HEAD Branches
    8. Deleting Branches
  10. 8. Diffs
    1. Forms of the git diff Command
    2. Simple git diff Example
    3. git diff and Commit Ranges
    4. git diff with Path Limiting
    5. Comparing How Subversion and Git Derive diffs
  11. 9. Merges
    1. Merge Examples
      1. Preparing for a Merge
      2. Merging Two Branches
      3. A Merge with a Conflict
    2. Working with Merge Conflicts
      1. Locating Conflicted Files
      2. Inspecting Conflicts
      3. How Git Keeps Track of Conflicts
      4. Finishing Up a Conflict Resolution
      5. Aborting or Restarting a Merge
    3. Merge Strategies
      1. Degenerate Merges
      2. Normal Merges
      3. Specialty Merges
      4. Applying Merge Strategies
      5. Merge Drivers
    4. How Git Thinks About Merges
      1. Merges and Git’s Object Model
      2. Squash Merges
      3. Why Not Just Merge Each Change One by One?
  12. 10. Altering Commits
    1. Caution About Altering History
    2. Using git reset
    3. Using git cherry-pick
    4. Using git revert
    5. reset, revert, and checkout
    6. Changing the Top Commit
    7. Rebasing Commits
      1. Using git rebase -i
      2. rebase Versus merge
  13. 11. The Stash and the Reflog
    1. The Stash
    2. The Reflog
  14. 12. Remote Repositories
    1. Repository Concepts
      1. Bare and Development Repositories
      2. Repository Clones
      3. Remotes
      4. Tracking Branches
    2. Referencing Other Repositories
      1. Referring to Remote Repositories
      2. The refspec
    3. Example Using Remote Repositories
      1. Creating an Authoritative Repository
      2. Make Your Own Origin Remote
      3. Developing in Your Repository
      4. Pushing Your Changes
      5. Adding a New Developer
      6. Getting Repository Updates
    4. Remote Repository Development Cycle in Pictures
      1. Cloning a Repository
      2. Alternate Histories
      3. Non–Fast-Forward Pushes
      4. Fetching the Alternate History
      5. Merging Histories
      6. Merge Conflicts
      7. Pushing a Merged History
    5. Remote Configuration
      1. Using git remote
      2. Using git config
      3. Using Manual Editing
    6. Working with Tracking Branches
      1. Creating Tracking Branches
      2. Ahead and Behind
    7. Adding and Deleting Remote Branches
    8. Bare Repositories and git push
  15. 13. Repository Management
    1. A Word About Servers
    2. Publishing Repositories
      1. Repositories with Controlled Access
      2. Repositories with Anonymous Read Access
      3. Repositories with Anonymous Write Access
      4. Publishing Your Repository to GitHub
    3. Repository Publishing Advice
    4. Repository Structure
      1. The Shared Repository Structure
      2. Distributed Repository Structure
      3. Repository Structure Examples
    5. Living with Distributed Development
      1. Changing Public History
      2. Separate Commit and Publish Steps
      3. No One True History
    6. Knowing Your Place
      1. Upstream and Downstream Flows
      2. The Maintainer and Developer Roles
      3. Maintainer–Developer Interaction
      4. Role Duality
    7. Working with Multiple Repositories
      1. Your Own Workspace
      2. Where to Start Your Repository
      3. Converting to a Different Upstream Repository
      4. Using Multiple Upstream Repositories
      5. Forking Projects
  16. 14. Patches
    1. Why Use Patches?
    2. Generating Patches
      1. Patches and Topological Sorts
    3. Mailing Patches
    4. Applying Patches
    5. Bad Patches
    6. Patching Versus Merging
  17. 15. Hooks
    1. Installing Hooks
      1. Example Hooks
      2. Creating Your First Hook
    2. Available Hooks
      1. Commit-Related Hooks
      2. Patch-Related Hooks
      3. Push-Related Hooks
      4. Other Local Repository Hooks
  18. 16. Combining Projects
    1. The Old Solution: Partial Checkouts
    2. The Obvious Solution: Import the Code into Your Project
      1. Importing Subprojects by Copying
      2. Importing Subprojects with git pull -s subtree
      3. Submitting Your Changes Upstream
    3. The Automated Solution: Checking out Subprojects Using Custom Scripts
    4. The Native Solution: gitlinks and git submodule
      1. Gitlinks
      2. The git submodule Command
  19. 17. Submodule Best Practices
    1. Submodule Commands
    2. Why Submodules?
    3. Submodules Preparation
    4. Why Read Only?
    5. Why Not Read Only?
    6. Examining the Hashes of Submodule Commits
    7. Credential Reuse
    8. Use Cases
    9. Multilevel Nesting of Repos
    10. Submodules on the Horizon
  20. 18. Using Git with Subversion Repositories
    1. Example: A Shallow Clone of a Single Branch
      1. Making Your Changes in Git
      2. Fetching Before Committing
      3. Committing Through git svn rebase
    2. Pushing, Pulling, Branching, and Merging with git svn
      1. Keeping Your Commit IDs Straight
      2. Cloning All the Branches
      3. Sharing Your Repository
      4. Merging Back into Subversion
    3. Miscellaneous Notes on Working with Subversion
      1. svn:ignore Versus .gitignore
      2. Reconstructing the git-svn Cache
  21. 19. Advanced Manipulations
    1. Using git filter-branch
      1. Examples Using git filter-branch
      2. filter-branch Pitfalls
    2. How I Learned to Love git rev-list
      1. Date-Based Checkout
      2. Retrieve Old Version of a File
    3. Interactive Hunk Staging
    4. Recovering a Lost Commit
      1. The git fsck Command
      2. Reconnecting a Lost Commit
  22. 20. Tips, Tricks, and Techniques
    1. Interactive Rebase with a Dirty Working Directory
    2. Remove Left-Over Editor Files
    3. Garbage Collection
    4. Split a Repository
    5. Tips for Recovering Commits
    6. Subversion Conversion Tips
      1. General Advice
      2. Remove a Trunk After an SVN Import
      3. Removing SVN Commit IDs
    7. Manipulating Branches from Two Repositories
    8. Recovering from an Upstream Rebase
    9. Make Your Own Git Command
    10. Quick Overview of Changes
    11. Cleaning Up
    12. Using git-grep to Search a Repository
    13. Updating and Deleting refs
    14. Following Files that Moved
    15. Keep, But Don’t Track, This File
    16. Have You Been Here Before?
  23. 21. Git and GitHub
    1. Repo for Public Code
    2. Creating a GitHub Repository
    3. Social Coding on Open Source
    4. Watchers
    5. News Feed
    6. Forks
    7. Creating Pull Requests
    8. Managing Pull Requests
    9. Notifications
    10. Finding Users, Projects, and Code
    11. Wikis
    12. GitHub Pages (Git for Websites)
    13. In-Page Code Editor
    14. Subversion Bridge
    15. Tags Automatically Becoming Archives
    16. Organizations
    17. REST API
    18. Social Coding on Closed Source
    19. Eventual Open Sourcing
    20. Coding Models
    21. GitHub Enterprise
    22. GitHub in Sum
  24. Index
  25. About the Authors
  26. Colophon
  27. Copyright
O'Reilly logo

Chapter 1. Introduction

Background

No cautious, creative person starts a project nowadays without a back-up strategy. Because data is ephemeral and can be lost easily—through an errant code change or a catastrophic disk crash, say—it is wise to maintain a living archive of all work.

For text and code projects, the back-up strategy typically includes version control, or tracking and managing revisions. Each developer can make several revisions per day, and the ever increasing corpus serves simultaneously as repository, project narrative, communication medium, and team and product management tool. Given its pivotal role, version control is most effective when tailored to the working habits and goals of the project team.

A tool that manages and tracks different versions of software or other content is referred to generically as a version control system (VCS), a source code manager (SCM), a revision control system (RCS), and several other permutations of the words revision, version, code, content, control, management, and system. Although the authors and users of each tool might debate esoterics, each system addresses the same issue: develop and maintain a repository of content, provide access to historical editions of each datum, and record all changes in a log. In this book, the term version control system (VCS) is used to refer generically to any form of revision control system.

This book covers Git, a particularly powerful, flexible, and low-overhead version control tool that makes collaborative development a pleasure. Git was invented by Linus Torvalds to support the development of the Linux®[1] kernel, but it has since proven valuable to a wide range of projects.

The Birth of Git

Often, when there is discord between a tool and a project, the developers simply create a new tool. Indeed, in the world of software, the temptation to create new tools can be deceptively easy and inviting. In the face of many existing version control systems, the decision to create another shouldn’t be made casually. However, given a critical need, a bit of insight, and a healthy dose of motivation, forging a new tool can be exactly the right course.

Git, affectionately termed the information manager from hell by its creator (Linus is known for both his irascibility and his dry wit), is such a tool. Although the precise circumstances and timing of its genesis are shrouded in political wrangling within the Linux kernel community, there is no doubt that what came from that fire is a well-engineered version control system capable of supporting the worldwide development of software on a large scale.

Prior to Git, the Linux kernel was developed using the commercial BitKeeper VCS, which provided sophisticated operations not available in then-current, free software VCSs such as RCS and the concurrent version system (CVS). However, when the company that owned BitKeeper placed additional restrictions on its free as in beer version in the spring of 2005, the Linux community realized that BitKeeper was no longer a viable solution.

Linus looked for alternatives. Eschewing commercial solutions, he studied the free software packages but found the same limitations and flaws that led him to reject them previously. What was wrong with the existing VCSs? What were the elusive missing features or characteristics that Linus wanted and couldn’t find?

Facilitate Distributed Development

There are many facets to distributed development, and Linus wanted a new VCS that would cover most of them. It had to allow parallel as well as independent and simultaneous development in private repositories without the need for constant synchronization with a central repository, which could form a development bottleneck. It had to allow multiple developers in multiple locations even if some of them were offline temporarily.

Scale to Handle Thousands of Developers

It isn’t enough just to have a distributed development model. Linus knew that thousands of developers contribute to each Linux release. So any new VCS had to handle a very large number of developers whether they were working on the same or different parts of a common project. And the new VCS had to be able to integrate all of their work reliably.

Perform Quickly and Efficiently

Linus was determined to ensure that a new VCS was fast and efficient. In order to support the sheer volume of update operations that would be made on the Linux kernel alone, he knew that both individual update operations and network transfer operations would have to be very fast. To save space and thus transfer time, compression and delta techniques would be needed. Using a distributed model instead of a centralized model also ensured that network latency would not hinder daily development.

Maintain Integrity and Trust

Because Git is a distributed revision control system, it is vital to obtain absolute assurance that data integrity is maintained and is not somehow being altered. How do you know the data hasn’t been altered in transition from one developer to the next? Or from one repository to the next? Or, for that matter, that the data in a Git repository is even what it purports to be?

Git uses a common cryptographic hash function, called Secure Hash Function (SHA1), to name and identify objects within its database. Though perhaps not absolute, in practice it has proven to be solid enough to ensure integrity and trust for all Git’s distributed repositories.

Enforce Accountability

One of the key aspects of a version control system is knowing who changed files and, if at all possible, why. Git enforces a change log on every commit that changes a file. The information stored in that change log is left up to the developer, project requirements, management, convention, and so on. Git ensures that changes will not happen mysteriously to files under version control because there is an accountability trail for all changes.

Immutability

Git’s repository database contains data objects that are immutable. That is, once they have been created and placed in the database, they cannot be modified. They can be recreated differently, of course, but the original data cannot be altered without consequences. The design of the Git database means that the entire history stored within the version control database is also immutable. Using immutable objects has several advantages, including quick comparison for equality.

Atomic Transactions

With atomic transactions, a number of different but related changes are performed either all together or not at all. This property ensures that the version control database is not left in a partially changed or corrupted state while an update or commit is happening. Git implements atomic transactions by recording complete, discrete repository states that cannot be broken down into individual or smaller state changes.

Support and Encourage Branched Development

Almost all VCSs can name different genealogies of development within a single project. For instance, one sequence of code changes could be called development while another is referred to as test. Each version control system can also split a single line of development into multiple lines and then unify, or merge, the disparate threads. As with most VCSs, Git calls a line of development a branch and assigns each branch a name.

Along with branching comes merging. Just as Linus wanted easy branching to foster alternate lines of development, he also wanted to facilitate easy merging of those branches. Because branch merging has often been a painful and difficult operation in version control systems, it would be essential to support clean, fast, easy merging.

Complete Repositories

So that individual developers needn’t query a centralized repository server for historical revision information, it was essential that each repository have a complete copy of all historical revisions of every file.

A Clean Internal Design

Even though end users might not be concerned about a clean internal design, it was important to Linus and ultimately to other Git developers as well. Git’s object model has simple structures that capture fundamental concepts for raw data, directory structure, recording changes, and so forth. Coupling the object model with a globally unique identifier technique allowed a very clean data model that could be managed in a distributed development environment.

Be Free, as in Freedom

‘Nuff said.

Given a clean slate to create a new VCS, many talented software engineers collaborated and Git was born. Necessity was the mother of invention again!

Precedents

The complete history of VCSs is beyond the scope of this book. However, there are several landmark, innovative systems that set the stage for or directly led to the development of Git. (This section is selective, hoping to record when new features were introduced or became popular within the free software community.)

The Source Code Control System (SCCS) was one of the original systems on Unix®[2] and was developed by M. J. Rochkind in the very early 1970s. [The Source Code Control System, IEEE Transactions on Software Engineering 1(4) (1975): 364-370.] This is arguably the first VCS available on any Unix system.

The central store that SCCS provided was called a repository, and that fundamental concept remains pertinent to this day. SCCS also provided a simple locking model to serialize development. If a developer needed files to run and test a program, he or she would check them out unlocked. However, in order to edit a file, he or she had to check it out with a lock (a convention enforced through the Unix file system). When finished, he or she would check the file back into the repository and unlock it.

The Revision Control System (RCS) was introduced by Walter F. Tichy in the early 1980s. [RCS: A System for Version Control, Software Practice and Experience 15(7) (1985): 637-654.] RCS introduced both forward and reverse delta concepts for the efficient storage of different file revisions.

The Concurrent Version System (CVS), designed and originally implemented by Dick Grune in 1986 and then crafted anew some four years later by Berliner and colleagues extended and modified the RCS model with great success. CVS became very popular and was the de facto standard within the open source (http://www.opensource.org) community for many years. CVS provided several advances over RCS, including distributed development and repository-wide change sets for entire modules.

Furthermore, CVS introduced a new paradigm for the lock. Whereas earlier systems required a developer to lock each file before changing it and thus forced one developer to wait for another in serial fashion, CVS gave each developer write permission in his or her private working copy. Thus, changes by different developers could be merged automatically by CVS unless two developers tried to change the same line. In that case, the conflict was flagged and the developers were left to work out the solution. The new rules for the lock allowed different developers to write code concurrently.

As often occurs, perceived shortcomings and faults in CVS eventually led to a new VCS. Subversion (SVN), introduced in 2001, quickly became popular within the free software community. Unlike CVS, SVN committed changes atomically and had significantly better support for branches.

BitKeeper and Mercurial were radical departures from all the aforementioned solutions. Each eliminated the central repository; instead, the store was distributed, providing each developer with his own shareable copy. Git is derived from this peer-to-peer model.

Finally, Mercurial and Monotone contrived a hash fingerprint to uniquely identify a file’s content. The name assigned to the file is a moniker and a convenient handle for the user and nothing more. Git features this notion as well. Internally, the Git identifier is based on the file’s contents, a concept known as a content-addressable file store. The concept is not new. [See The Venti Filesystem, (Plan 9), Bell Labs, http://www.usenix.org/events/fast02/quinlan/quinlan_html/index.html.] Git immediately borrowed the idea from Monotone, according to Linus.[3] Mercurial was implementing the concept simultaneously with Git.

Timeline

With the stage set, a bit of external impetus, and a dire VCS crisis imminent, Git sprang to life in April 2005.

Git became self-hosted on April 7 with this commit:

    commit e83c5163316f89bfbde7d9ab23ca2e25604af29
    Author: Linus Torvalds <torvalds@ppc970.osdl.org>
    Date:   Thu Apr 7 15:13:13 2005 -0700

    Initial revision of "git", the information manager from hell

Shortly thereafter, the first Linux commit was made:

    commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
    Author: Linus Torvalds <torvalds@ppc970.osdl.org>
    Date:   Sat Apr 16 15:20:36 2005 -0700

    Linux-2.6.12-rc2

    Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

That one commit introduced the bulk of the entire Linux Kernel into a Git repository.[4] It consisted of

     17291 files changed, 6718755 insertions(+), 0 deletions(-)

Yes, that’s an introduction of 6.7 million lines of code!

It was just three minutes later when the first patch using Git was applied to the kernel. Convinced that it was working, Linus announced it on April 20, 2005, to the Linux Kernel Mailing List.

Knowing full well that he wanted to return to the task of developing the kernel, Linus handed the maintenance of the Git source code to Junio Hamano on July 25, 2005, announcing that Junio was the obvious choice.

About two months later, Version 2.6.12 of the Linux Kernel was released using Git.

What’s in a Name?

Linus himself rationalizes the name Git by claiming I’m an egotistical bastard, and I name all my projects after myself. First Linux, now git.[5] Granted, the name Linux for the kernel was sort of a hybrid of Linus and Minix. The irony of using a British term for a silly or worthless person was not missed, either.

Since then, others had suggested some alternative and perhaps more palatable interpretations: the Global Information Tracker seems to be the most popular.



[1] Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

[2] UNIX is a registered trademark of The Open Group in the United States and other countries.

[3] Private email.

[4] See http://kerneltrap.org/node/13996 for a starting point on how the old BitKeeper logs were imported into a Git repository for older history (pre-2.5).

The best content for your career. Discover unlimited learning on demand for around $1/day.