Git Internals: Understanding the Architecture of Distributed Version Control

Last updated Feb 1, 2026 Published Nov 21, 2016

The content here is under the Attribution 4.0 International (CC BY 4.0) license

Understanding Git’s internal architecture transforms it from an opaque tool with memorized commands into a transparent system whose behavior becomes predictable and debuggable. While Git’s porcelain commands (git add, git commit, git merge) provide a user-friendly interface, the plumbing commands and underlying data structures reveal a elegant content-addressable filesystem built on cryptographic hashing and directed acyclic graphs (Chacon & Straub, 2014).

This deep understanding proves essential when troubleshooting complex scenarios: merge conflicts, repository corruption, performance issues, or unusual workflow requirements. As Torvalds emphasized in his 2007 tech talk, Git was designed as “a content tracker” and “a stupid file system” that higher-level tools could leverage (Torvalds, 2007). The abstractions Git provides—branches, merges, rebases—are consequences of its underlying object model rather than first-class primitives.

This article examines Git’s internal mechanisms through both theoretical foundations and practical demonstrations. We explore how Git stores data, manages references, optimizes storage, and executes common operations. The analysis bridges academic research on distributed version control systems (Bird et al., 2009; Mackall, 2006) with hands-on examples demonstrating Git’s plumbing commands.

Prerequisites and Related Reading

This article assumes familiarity with basic Git concepts. If you’re new to Git, start with:

Related topics that build on these internals:

The .git Directory: Repository Structure

Every Git repository centers on the .git directory, created when running git init or cloning a repository. This directory contains all metadata, object database entries, and configuration needed to track repository history independently of the working tree. Understanding this structure provides insight into Git’s architecture and enables advanced troubleshooting.

$ git init example-repo
Initialized empty Git repository in /path/to/example-repo/.git/

$ tree -L 2 .git/
.git/
├── HEAD                    # Current branch reference
├── config                  # Repository-specific configuration
├── description            # Repository description (for GitWeb)
├── hooks/                 # Client/server-side hook scripts
│   ├── pre-commit.sample
│   ├── pre-push.sample
│   └── ...
├── info/                  # Global exclude patterns
│   └── exclude
├── objects/               # Object database (all content)
│   ├── info/
│   └── pack/              # Packfiles for compression
├── refs/                  # References (branches, tags)
│   ├── heads/             # Local branches
│   ├── remotes/           # Remote-tracking branches
│   └── tags/              # Tags
└── logs/                  # Reflogs (reference history)
    ├── HEAD
    └── refs/

Critical Components

HEAD: A symbolic reference pointing to the current branch (usually refs/heads/main or refs/heads/master). When in “detached HEAD” state, HEAD points directly to a commit SHA-1 rather than a branch reference (Git SCM, 2024).

objects/: The object database storing all repository content using content-addressable storage. Objects are organized by their SHA-1 hash (first 2 characters determine subdirectory, remaining 38 characters form filename).

refs/: Directory storing references—human-readable names pointing to commit objects. Branches (refs/heads/), remote branches (refs/remotes/), and tags (refs/tags/) reside here.

config: Repository configuration overriding global and system settings. Contains remote definitions, branch tracking relationships, and repository-specific options.

logs/: Reflogs recording historical values of references, enabling recovery of seemingly lost commits through git reflog.

hooks/: Scripts triggered at specific points in Git’s execution (pre-commit, post-commit, pre-push, etc.), enabling workflow automation and policy enforcement (Git SCM, 2024).

Git’s Object Model: The Foundation of Version Control

Git’s architecture centers on four object types stored in the object database: blobs (file content), trees (directories), commits (snapshots), and tags (annotated references). These immutable objects form a directed acyclic graph (DAG) representing repository history (Loeliger & McCullough, 2012).

Content-Addressable Storage

Git employs content-addressable storage where each object’s identifier is the SHA-1 hash of its content. This design provides several properties:

  1. Integrity verification: Any corruption is immediately detectable through hash mismatch
  2. Deduplication: Identical content is stored once regardless of filename or location
  3. Immutability: Modifying content changes its hash, creating a new object
  4. Efficient comparison: Hash comparison determines content equality without reading entire objects

The SHA-1 hash function produces a 160-bit (20-byte) hash, typically represented as a 40-character hexadecimal string (“US Secure Hash Algorithm 1 (SHA1),” 2001). While theoretical collision attacks on SHA-1 exist (Stevens et al., 2017), Git is transitioning to SHA-256 for future-proofing, and practical collision attacks remain infeasible for typical repository operations.

Blob Objects: File Content Storage

Blobs (binary large objects) store file content without metadata—no filename, permissions, or directory information. A blob represents pure content, identified by the SHA-1 hash of its contents.

$ echo "Hello, Git internals" > hello.txt
$ git hash-object -w hello.txt
f572d396fae9206628714fb2ce00f72e94f2258f

$ git cat-file -t f572d396fae9206628714fb2ce00f72e94f2258f
blob

$ git cat-file -p f572d396fae9206628714fb2ce00f72e94f2258f
Hello, Git internals

$ git cat-file -s f572d396fae9206628714fb2ce00f72e94f2258f
21

The git hash-object command computes the SHA-1 hash and, with -w, writes the object to the database. The git cat-file command inspects objects: -t shows type, -p prints content, -s shows size in bytes.

Object Storage Format: Git stores objects in a compressed format using zlib. The object format consists of:

<object-type> <content-length>\0<content>

For our example blob:

blob 21\0Hello, Git internals

This entire content is compressed and stored at .git/objects/f5/72d396fae9206628714fb2ce00f72e94f2258f.

Tree Objects: Directory Structure

Trees represent directories, mapping filenames to blob SHA-1s (for files) or other tree SHA-1s (for subdirectories), along with file mode information. Trees capture the complete directory structure at a point in time (Git SCM, 2024).

$ git mktree <<EOF
100644 blob f572d396fae9206628714fb2ce00f72e94f2258f	hello.txt
100644 blob 2f424382d75e5dbf0b74ebc4d7bf820c3e3aecea	README.md
040000 tree 8d8c4f6e5e5f7a5c8d7e6f5a4b3c2d1e0f1a2b3c	docs/
EOF
c3a7f7c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9

$ git cat-file -p c3a7f7c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9
100644 blob 2f424382d75e5dbf0b74ebc4d7bf820c3e3aecea	README.md
040000 tree 8d8c4f6e5e5f7a5c8d7e6f5a4b3c2d1e0f1a2b3c	docs
100644 blob f572d396fae9206628714fb2ce00f72e94f2258f	hello.txt

File Modes: Git stores Unix file permissions in a simplified form:

  • 100644: Regular file (not executable)
  • 100755: Executable file
  • 120000: Symbolic link
  • 040000: Directory (tree object)
  • 160000: Git submodule

Trees enable efficient directory comparison: if two commits reference the same tree SHA-1 for a subdirectory, that subdirectory’s contents are identical without examining individual files.

Commit Objects: Repository Snapshots

Commits represent snapshots of the repository at specific points in time. A commit object contains:

  • A reference to the root tree object (repository state)
  • Parent commit reference(s) (zero for initial commit, one for normal commit, multiple for merge commit)
  • Author information (name, email, timestamp)
  • Committer information (can differ from author for patches)
  • Commit message
$ git cat-file -p HEAD
tree c3a7f7c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9
parent 8e3c7d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6
author Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000
committer Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000

Add documentation for Git internals

This commit introduces comprehensive documentation explaining
Git's object model, content-addressable storage, and internal
architecture with academic references and practical examples.

$ git log --format=raw HEAD~1..HEAD
commit a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9
tree c3a7f7c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9
parent 8e3c7d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6
author Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000
committer Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000

    Add documentation for Git internals
    
    This commit introduces comprehensive documentation explaining
    Git's object model, content-addressable storage, and internal
    architecture with academic references and practical examples.

The commit structure creates an immutable chain where each commit references its parent(s), forming a directed acyclic graph. This structure enables efficient traversal, bisection, and historical analysis (Zeller, 1999).

Merge Commits: When merging branches, Git creates a commit with multiple parents, explicitly recording the integration point:

$ git cat-file -p <merge-commit-hash>
tree f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0
parent a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9
parent b8d0f4e3c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0
author Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000
committer Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000

Merge branch 'feature' into main

Tag Objects: Annotated References

While lightweight tags are simple references to commits, annotated tags are full objects containing:

  • Tagged object reference (usually a commit, but can be any Git object)
  • Tag name
  • Tagger information (name, email, timestamp)
  • Tag message
  • Optional GPG signature for verification
$ git tag -a v1.0.0 -m "Release version 1.0.0"

$ git cat-file -p v1.0.0
object a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9
type commit
tag v1.0.0
tagger Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000

Release version 1.0.0

$ git verify-tag v1.0.0  # If GPG-signed
gpg: Signature made Sun Nov 12 10:00:00 2025 UTC
gpg: Good signature from "Matheus Marabesi <matheus@marabesi.com>"

Annotated tags persist in the object database independently of branches, providing stable reference points for releases, milestones, or significant commits.

References and the Refspec System

References (refs) provide human-readable names for commits, abstracting the need to remember SHA-1 hashes. References are organized hierarchically under .git/refs/ and can be symbolic (pointing to another reference) or direct (pointing to a commit SHA-1) (Git SCM, 2024).

Branch References (refs/heads/)

Branches are simply files containing commit SHA-1s. When you commit to a branch, Git updates the file with the new commit’s hash:

$ cat .git/refs/heads/main
a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9

$ git commit -m "Update documentation"
[main b3c4d5e] Update documentation

$ cat .git/refs/heads/main
b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2

$ git log --oneline --graph main
* b3c4d5e (HEAD -> main) Update documentation
* a7c9e3f Add documentation for Git internals
* 8e3c7d2 Initial commit

Creating a branch is simply creating a reference file:

$ echo "a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9" > .git/refs/heads/feature
$ git branch
  feature
* main

# Equivalent to:
$ git branch feature a7c9e3f2d4

Remote-Tracking Branches (refs/remotes/)

Remote-tracking branches track the state of branches in remote repositories. These are updated during git fetch and are read-only from the user’s perspective:

$ cat .git/refs/remotes/origin/main
a1b2c3d4e5f6a7b8c9d0e1f2a3b4b5c6d7e8f9a0

$ git fetch origin
From https://github.com/user/repo
   a1b2c3d..d7e8f9a  main       -> origin/main

$ cat .git/refs/remotes/origin/main
d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6

Tags (refs/tags/)

Tags reference specific commits (or other objects for annotated tags):

$ cat .git/refs/tags/v1.0.0
a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9

$ git show-ref --tags
a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9 refs/tags/v1.0.0
c2d3e4f5a6b7c8d9e0f1a2b3c4b5c6d7e8f9a0b1 refs/tags/v1.1.0

HEAD: The Current Branch Pointer

HEAD is a special symbolic reference indicating the current branch. It typically contains a reference to a branch:

$ cat .git/HEAD
ref: refs/heads/main

$ git symbolic-ref HEAD
refs/heads/main

In detached HEAD state, HEAD points directly to a commit rather than a branch:

$ git checkout a7c9e3f2d4
Note: switching to 'a7c9e3f2d4'.

You are in 'detached HEAD' state...

$ cat .git/HEAD
a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9

Detached HEAD state occurs when checking out commits, tags, or remote branches directly. Commits made in this state are not associated with any branch and may be lost when switching away unless explicitly referenced through a branch or tag.

Refspecs: Mapping Between Local and Remote

Refspecs define the relationship between remote and local references during fetch and push operations:

$ cat .git/config
[remote "origin"]
	url = https://github.com/user/repo.git
	fetch = +refs/heads/*:refs/remotes/origin/*

# Refspec format: +<source>:<destination>
# + indicates force update is allowed
# * is a wildcard matching multiple refs

Common refspec patterns:

  • +refs/heads/*:refs/remotes/origin/* - Fetch all branches
  • refs/heads/main:refs/remotes/origin/main - Fetch specific branch
  • refs/heads/main:refs/heads/main - Push to remote branch
  • :refs/heads/feature - Delete remote branch (empty source)

The Reflog: Reference History and Recovery

The reflog (reference log) records every change to branch tips and HEAD, providing a safety net for recovering from mistakes. Unlike commit history, which forms a permanent DAG, reflogs are local and expire after a configurable period (default 90 days for reachable commits, 30 days for unreachable).

$ git reflog
b3c4d5e (HEAD -> main) HEAD@{0}: commit: Update documentation
a7c9e3f HEAD@{1}: commit: Add documentation for Git internals
8e3c7d2 HEAD@{2}: commit (initial): Initial commit

$ cat .git/logs/HEAD
0000000000000000000000000000000000000000 8e3c7d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6 Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000	commit (initial): Initial commit
8e3c7d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6 a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9 Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000	commit: Add documentation for Git internals
a7c9e3f2d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9 b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2 Matheus Marabesi <matheus@marabesi.com> 1699776000 +0000	commit: Update documentation

Recovery Scenarios Using Reflog

Recovering from hard reset:

$ git reset --hard HEAD~2  # Accidentally go back 2 commits
HEAD is now at 8e3c7d2 Initial commit

$ git reflog
8e3c7d2 (HEAD -> main) HEAD@{0}: reset: moving to HEAD~2
b3c4d5e HEAD@{1}: commit: Update documentation
a7c9e3f HEAD@{2}: commit: Add documentation for Git internals

$ git reset --hard HEAD@{1}  # Restore to before the reset
HEAD is now at b3c4d5e Update documentation

Recovering deleted branch:

$ git branch feature
$ git checkout feature
$ echo "Feature work" > feature.txt
$ git add feature.txt && git commit -m "Add feature"
$ git checkout main
$ git branch -D feature  # Accidentally delete branch
Deleted branch feature (was c5d6e7f).

$ git reflog | grep feature
c5d6e7f HEAD@{1}: commit: Add feature
a7c9e3f HEAD@{2}: checkout: moving from main to feature

$ git branch feature c5d6e7f  # Recreate branch at commit
$ git log feature --oneline
c5d6e7f (feature) Add feature
a7c9e3f Add documentation for Git internals

The reflog is branch-specific. Each branch maintains its own reflog under .git/logs/refs/heads/<branch-name>, recording all changes to that branch’s tip.

Pack Files and Storage Optimization

Git initially stores objects as loose objects—individual files under .git/objects/. As repositories grow, this becomes inefficient. Git employs pack files to compress related objects using delta compression, reducing storage requirements by an order of magnitude (Git SCM, 2024).

When Packing Occurs

Git automatically packs objects during:

  • git gc (garbage collection)
  • git push (creating packfiles for network transfer)
  • Automatic maintenance when loose objects exceed threshold (default 6700)
  • Manual invocation: git repack
$ git count-objects -v
count: 23
size: 148
in-pack: 0
packs: 0
size-pack: 0
prune-packable: 0
garbage: 0

$ git gc
Enumerating objects: 23, done.
Counting objects: 100% (23/23), done.
Delta compression using up to 8 threads
Compressing objects: 100% (18/18), done.
Writing objects: 100% (23/23), done.
Total 23 (delta 5), reused 0 (delta 0), pack-reused 0

$ git count-objects -v
count: 0
size: 0
in-pack: 23
packs: 1
size-pack: 12
prune-packable: 0
garbage: 0

After packing, loose objects are removed, and objects reside in pack files under .git/objects/pack/:

$ ls -lh .git/objects/pack/
-r--r--r-- 1 user user  12K pack-a1b2c3d4...f9a0.idx
-r--r--r-- 1 user user  45K pack-a1b2c3d4...f9a0.pack

Pack File Structure

A pack file contains multiple objects compressed together with delta compression. Git uses a sophisticated algorithm to identify similar objects and store only their differences (Hunt & McIlroy, 1976; Miller & Myers, 1985).

Index file (.idx): Provides fast lookup mapping SHA-1 hashes to byte offsets within the pack file.

Pack file (.pack): Contains compressed and delta-encoded objects. Git uses delta chains where objects store differences from a base object:

Object A (base): Full content (10 KB)
Object B (delta from A): +5 lines, -2 lines (1 KB)
Object C (delta from B): +3 lines (0.5 KB)

Storage: 11.5 KB instead of 30 KB if stored separately

Git limits delta chain depth to prevent excessive decompression overhead when accessing objects deep in chains.

Pack File Format Details

Objects in pack files have types:

  • OBJ_COMMIT, OBJ_TREE, OBJ_BLOB, OBJ_TAG - Undeltified objects
  • OBJ_OFS_DELTA - Delta encoded, referring to base by offset within pack
  • OBJ_REF_DELTA - Delta encoded, referring to base by SHA-1

Git prioritizes OBJ_OFS_DELTA over OBJ_REF_DELTA for efficiency, as offset-based references avoid hash lookups.

How Git Commands Work Internally

Understanding how porcelain commands map to plumbing operations clarifies Git’s behavior and enables troubleshooting.

git add Internals

git add performs several operations:

  1. Compute SHA-1 hash of file content
  2. Write blob object to object database (if not already present)
  3. Update the index (staging area) with filename → blob SHA-1 mapping
$ echo "New content" > example.txt

# Internally equivalent to:
$ git hash-object -w example.txt
2f65e8e9f7c6d5b4a3f2e1d0c9b8a7f6e5d4c3b2

$ git update-index --add --cacheinfo 100644 2f65e8e9f7c6d5b4a3f2e1d0c9b8a7f6e5d4c3b2 example.txt

# The index (.git/index) now contains this mapping
$ git ls-files --stage
100644 2f65e8e9f7c6d5b4a3f2e1d0c9b8a7f6e5d4c3b2 0	example.txt

The index is a binary file tracking the staging area’s state. The --cacheinfo flag provides metadata (mode, hash, filename) to update the index without filesystem interaction.

git commit Internals

git commit executes:

  1. Create tree object from index (staging area)
  2. Create commit object referencing tree and parent commit
  3. Update current branch reference to new commit
  4. Update reflog
# Manually creating a commit:
$ git write-tree  # Create tree from index
c3a7f7c9d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9

$ echo "Manual commit message" | git commit-tree c3a7f7c9d4 -p HEAD
a9b8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0

$ git update-ref refs/heads/main a9b8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0

# Equivalent to:
$ git commit -m "Manual commit message"

The commit process is atomic: if interrupted, no partial state exists. The commit object is written, then the branch reference updates—a single write operation.

git checkout / git switch Internals

Checking out a branch involves:

  1. Update HEAD to point to target branch (or commit for detached HEAD)
  2. Read tree object from target commit
  3. Update index to match tree
  4. Update working directory to match index (preserving uncommitted changes if possible)
$ git checkout feature

# Internally:
$ git symbolic-ref HEAD refs/heads/feature  # Update HEAD
$ git read-tree --reset -u feature  # Update index and working tree
$ echo "ref: refs/heads/feature" > .git/HEAD  # Persist HEAD change

Git prevents checkout if working directory changes would be overwritten, requiring explicit handling through stashing, committing, or forcing the checkout.

git merge Internals

Merging involves identifying the common ancestor and performing a three-way merge (Mens, 2002):

$ git merge feature

# Git finds merge base (common ancestor):
$ git merge-base main feature
8e3c7d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6

# Three-way merge using:
# - Common ancestor (merge base)
# - Current branch tip (HEAD)
# - Feature branch tip

# If no conflicts:
$ git commit-tree <merged-tree> -p HEAD -p feature -m "Merge feature"

Git employs multiple merge strategies:

  • Recursive (default): Three-way merge with recursive common ancestor identification
  • Octopus: Merge multiple branches simultaneously (limited conflict resolution)
  • Ours: Keep current branch version for conflicts
  • Subtree: Modified recursive strategy for subtree merges

git rebase Internals

Rebase re-applies commits atop a new base, creating new commit objects with identical trees but different parents and hence different SHA-1s:

$ git rebase main feature

# Git identifies commits unique to feature:
$ git log --oneline main..feature
c5d6e7f Add feature implementation
b4c5d6e Add feature tests

# For each commit:
# 1. Checkout commit's parent
# 2. Apply commit's changes
# 3. Create new commit with main as parent

# Result: New commits with same trees, different parents/SHA-1s
$ git log --oneline feature
e7f8a9b Add feature implementation  # New SHA-1
d6e7f8a Add feature tests           # New SHA-1
a7c9e3f (main) Update documentation

Rebase maintains linear history by avoiding merge commits, but rewrites history—problematic for published branches since collaborators’ commits reference different SHA-1s (Driessen, 2010).

Rebase vs Merge: Internal Differences

While both integrate changes from one branch into another, rebase and merge differ fundamentally in their approach and resulting history structure.

Merge: Preserving History

Merge creates a new commit with multiple parents, explicitly recording the integration point:

    A---B---C main
         \
          D---E feature

$ git checkout main
$ git merge feature

    A---B---C---M main
         \     /
          D---E feature

The merge commit M has parents C and E, preserving the complete development history. This approach:

  • Maintains explicit record of branch integration
  • Preserves context of when features were developed
  • Creates non-linear history reflecting parallel development
  • Enables easy identification of feature boundaries
  • Allows reverting entire feature branches via merge commit reversion

Rebase: Linearizing History

Rebase replays commits atop a new base, creating new commits with identical changes but different parents:

    A---B---C main
         \
          D---E feature

$ git checkout feature
$ git rebase main

    A---B---C main
             \
              D'---E' feature

Commits D' and E' have identical trees to D and E but different SHA-1s due to different parents. This approach:

  • Creates linear history appearing as if work was done sequentially
  • Simplifies graph visualization and log reading
  • Enables fast-forward merges
  • Loses explicit record of when branches diverged
  • Rewrites history, potentially problematic for shared branches

Practical Example: Comparing Histories

# Setup scenario
$ git checkout -b feature
$ echo "Feature line 1" >> file.txt && git commit -am "Feature commit 1"
$ echo "Feature line 2" >> file.txt && git commit -am "Feature commit 2"
$ git checkout main
$ echo "Main line 1" >> file.txt && git commit -am "Main commit 1"

# Merge approach
$ git merge feature
Merge made by the 'recursive' strategy.

$ git log --oneline --graph
*   7c8d9e0 (HEAD -> main) Merge branch 'feature'
|\
| * 6b7c8d9 (feature) Feature commit 2
| * 5a6b7c8 Feature commit 1
* | 4f5a6b7 Main commit 1
|/
* 3e4f5a6 Initial commit

# Rebase approach (reset to before merge)
$ git reset --hard 4f5a6b7
$ git checkout feature
$ git rebase main
$ git log --oneline --graph feature
* 8d9e0f1 (feature) Feature commit 2
* 7c8d9e0 Feature commit 1
* 4f5a6b7 (main) Main commit 1
* 3e4f5a6 Initial commit

The rebase history appears linear, while merge history explicitly shows parallel development. Choose based on project workflow needs: historical accuracy (merge) versus readability (rebase).

Interactive Rebase: History Editing

Interactive rebase (git rebase -i) enables editing commit history before replaying:

$ git rebase -i HEAD~3

# Editor opens with:
pick a1b2c3d Commit 1
pick d4e5f6a Commit 2
pick g7h8i9j Commit 3

# Reorder commits
# Can also use commands like:
# - pick: use commit
# - reword: change commit message
# - edit: amend commit
# - squash: combine with previous commit
# - fixup: like squash but discard message
# - drop: remove commit

Interactive rebase provides powerful history editing but should only be used on local, unpublished commits to avoid collaboration issues.

Git Hooks: Automating Workflows

Git hooks are scripts triggered at specific points in Git’s execution, enabling workflow automation, policy enforcement, and integration with external systems (Git SCM, 2024). Hooks reside in .git/hooks/ and must be executable.

Client-Side Hooks

pre-commit: Runs before commit is created. Use for linting, tests, or validation:

#!/bin/bash
# .git/hooks/pre-commit

# Run linter
npm run lint
if [ $? -ne 0 ]; then
    echo "Linting failed. Commit aborted."
    exit 1
fi

# Run tests
npm test
if [ $? -ne 0 ]; then
    echo "Tests failed. Commit aborted."
    exit 1
fi

prepare-commit-msg: Modifies commit message before editor opens. Use for message templates:

#!/bin/bash
# .git/hooks/prepare-commit-msg

# Add branch name to commit message
BRANCH=$(git branch --show-current)
if [ -n "$BRANCH" ] && [ "$BRANCH" != "main" ]; then
    echo "[$BRANCH] $(cat $1)" > $1
fi

commit-msg: Validates commit message format:

#!/bin/bash
# .git/hooks/commit-msg

# Enforce conventional commits format
if ! grep -qE "^(feat|fix|docs|style|refactor|test|chore): .+" "$1"; then
    echo "Commit message must follow conventional commits format:"
    echo "  type: description"
    echo "  where type is: feat, fix, docs, style, refactor, test, or chore"
    exit 1
fi

post-commit: Runs after commit. Use for notifications or logging:

#!/bin/bash
# .git/hooks/post-commit

# Log commit to external system
COMMIT=$(git rev-parse HEAD)
curl -X POST https://api.example.com/commits -d "{\"sha\": \"$COMMIT\"}"

pre-push: Runs before push. Use for additional validation:

#!/bin/bash
# .git/hooks/pre-push

# Run integration tests before pushing
npm run test:integration
if [ $? -ne 0 ]; then
    echo "Integration tests failed. Push aborted."
    exit 1
fi

Server-Side Hooks

pre-receive: Runs on remote before refs are updated. Use for access control:

#!/bin/bash
# Server: .git/hooks/pre-receive

# Reject force push to main
while read oldrev newrev refname; do
    if [ "$refname" = "refs/heads/main" ]; then
        if [ "$(git merge-base $oldrev $newrev)" != "$oldrev" ]; then
            echo "Force push to main rejected"
            exit 1
        fi
    fi
done

update: Runs for each ref being updated. Use for branch-specific policies:

#!/bin/bash
# Server: .git/hooks/update

REFNAME=$1
OLDREV=$2
NEWREV=$3

# Only allow annotated tags
if [[ $REFNAME == refs/tags/* ]]; then
    if [ "$(git cat-file -t $NEWREV)" != "tag" ]; then
        echo "Only annotated tags allowed"
        exit 1
    fi
fi

post-receive: Runs after refs are updated. Use for notifications or deployment:

#!/bin/bash
# Server: .git/hooks/post-receive

# Deploy on push to main
while read oldrev newrev refname; do
    if [ "$refname" = "refs/heads/main" ]; then
        echo "Deploying to production..."
        ssh production "cd /var/www && git pull && systemctl restart app"
    fi
done

Hooks provide extensive automation capabilities but must be managed carefully—they can block operations if they fail or hang, and they execute with repository access permissions.

Advanced Topics

Partial Clones and Sparse Checkouts

Git supports partial clones, fetching only necessary objects rather than entire repository history. This feature becomes central when we are working on large repositories or when running in pipelines.

# Clone without history (blobless clone)
$ git clone --filter=blob:none https://github.com/user/large-repo.git

# Clone without blobs and trees (treeless clone)
$ git clone --filter=tree:0 https://github.com/user/large-repo.git

# Shallow clone (limited history depth)
$ git clone --depth=1 https://github.com/user/repo.git

Sparse checkouts enable checking out only specific directories:

$ git clone --filter=blob:none --sparse https://github.com/user/mono-repo.git
$ cd mono-repo
$ git sparse-checkout set services/api services/auth

These techniques reduce clone time and disk usage for large monorepos or repositories with extensive binary content (Kalliamvakou et al., 2014).

Git Replace: Object Substitution

The git replace mechanism allows substituting one object for another, useful for grafting histories or replacing corrupted objects:

# Replace commit with another (e.g., fixing author)
$ git replace <bad-commit> <good-commit>

# Git transparently uses good-commit when bad-commit is referenced
$ git cat-file -p <bad-commit>  # Shows good-commit content

Replace references are stored in .git/refs/replace/ and can be pushed/fetched, enabling coordinated history modification across repositories.

Commit Graph: Accelerating Git Operations

Modern Git versions use a commit-graph file (.git/objects/info/commit-graph) caching commit metadata for faster graph traversal:

$ git commit-graph write --reachable

# Accelerates operations like:
# - git log --graph
# - git merge-base
# - git branch --contains

The commit-graph stores commit parents, tree SHA-1s, and generation numbers (distance from root commit), enabling O(1) common ancestor queries rather than O(log n) tree traversal.

Alternates: Shared Object Databases

Git can reference objects from other repositories via alternates, reducing disk usage when maintaining multiple similar repositories:

$ echo "/path/to/reference/repo/.git/objects" > .git/objects/info/alternates

# This repository now uses reference repo's objects without duplicating them

Alternates are used extensively in Git hosting platforms for fork management, where forks share underlying object storage with source repositories.

Resources and Further Reading

Official Documentation

Books and Academic Resources

  • Pro Git by Scott Chacon and Ben Straub (Chacon & Straub, 2014) - Comprehensive Git reference with detailed internals coverage
  • Version Control with Git by Jon Loeliger (Loeliger & McCullough, 2012) - Deep dive into Git’s technical implementation
  • Bird et al., “The promises and perils of mining git” (Bird et al., 2009) - Research on Git repository analysis

Video Resources

Interactive Learning

Community Resources

Conclusion

Git’s internal architecture reflects careful design decisions balancing performance, integrity, and distributed collaboration needs. The content-addressable object model, directed acyclic graph structure, and sophisticated compression techniques enable efficient version control at scales from individual developers to repositories containing millions of commits.

Understanding these internals transforms Git from an opaque tool requiring command memorization into a transparent system whose behavior becomes predictable. When unusual scenarios arise—corrupted repositories, complex merges, performance issues, or unconventional workflows—knowledge of Git’s plumbing commands and underlying data structures enables diagnosis and resolution.

The abstractions Git provides—branches as lightweight pointers, commits as immutable snapshots, merges as explicit integration points—emerge naturally from the object model rather than being imposed externally. This elegant design has enabled Git to dominate version control and serve as a foundation for research in software evolution, code quality analysis, and developer collaboration patterns (Bird et al., 2009; Tornhill, 2018).

For software engineers, architects, and technical leads, Git internals knowledge facilitates better tooling decisions, workflow design, and repository management strategies. The investment in understanding Git’s architecture pays dividends through improved productivity, better debugging capabilities, and enhanced collaboration practices.

References

  1. Chacon, S., & Straub, B. (2014). Pro Git (2nd ed.). Apress. https://doi.org/10.1007/978-1-4842-0076-6
  2. Torvalds, L. (2007). Tech talk: Linus Torvalds on git. https://www.youtube.com/watch?v=4XpnKHJAok8
  3. Bird, C., Rigby, P. C., Barr, E. T., Hamilton, D. J., Germán, D. M., & Devanbu, P. (2009). The promises and perils of mining git. 2009 6th IEEE International Working Conference on Mining Software Repositories, 1–10. https://doi.org/10.1109/MSR.2009.5069475
  4. Mackall, M. (2006). Towards a better SCM: Revlog and mercurial. Linux Symposium, 2, 83–92.
  5. Git SCM. (2024). Git Internals - Git References. https://git-scm.com/book/en/v2/Git-Internals-Git-References
  6. Git SCM. (2024). Git Hooks. https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks
  7. Loeliger, J., & McCullough, M. (2012). Version Control with Git: Powerful Tools and Techniques for Collaborative Software Development (2nd ed.). O’Reilly Media.
  8. US secure hash algorithm 1 (SHA1). (2001). RFC 3174.
  9. Stevens, M., Bursztein, E., Karpman, P., Albertini, A., & Markov, Y. (2017). The first collision for full SHA-1. Annual International Cryptology Conference, 570–596. https://doi.org/10.1007/978-3-319-63688-7_19
  10. Git SCM. (2024). Git Internals - Git Objects. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
  11. Zeller, A. (1999). Yesterday, my program worked. Today, it does not. Why? ACM SIGSOFT Software Engineering Notes, 24(6), 253–267. https://doi.org/10.1145/318774.318946
  12. Git SCM. (2024). Git Internals - Packfiles. https://git-scm.com/book/en/v2/Git-Internals-Packfiles
  13. Hunt, J. W., & McIlroy, M. D. (1976). An algorithm for differential file comparison. Computing Science Technical Report, 41.
  14. Miller, W., & Myers, E. W. (1985). A sequential algorithm for differential file comparison. Software: Practice and Experience, 15(11), 1025–1040. https://doi.org/10.1002/spe.4380151102
  15. Mens, T. (2002). A state-of-the-art survey on software merging. IEEE Transactions on Software Engineering, 28(5), 449–462. https://doi.org/10.1109/TSE.2002.1000449
  16. Driessen, V. (2010). A successful Git branching model. https://nvie.com/posts/a-successful-git-branching-model/
  17. Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., Germán, D. M., & Damian, D. (2014). The promises and perils of mining GitHub. Proceedings of the 11th Working Conference on Mining Software Repositories, 92–101. https://doi.org/10.1145/2597073.2597074
  18. Tornhill, A. (2018). Software Design X-Rays: Fix Technical Debt with Behavioral Code Analysis. Software Design X-Rays, 1–200.

You also might like