GIT : How does it work internally?

Introduction

In the current era of Software development, managing and tracking the software development code can be very challenging without a reliable VCS (Version Control System) like GIT.

We will learn about the VCS, GIT and how does GIT work internally in more detail in this article. We’ll also cover what the hidden Git folder is, how it’s utilized internally, what data structures are used internally, and what really happens in Git when performing common operations.

What is VCS (Version Control System)?

Version Control, also known as Source Control, is the practice of tracking and managing changes to software code.

Version Control Systems are software tools that help software teams manage changes to source code over time.

List of VCS:

There are many open source VCS available to use. Below are the few popular ones:

Subversion (SVN) - A centralized VCS released on October 20, 2000. Originally developed by CollabNet, and now licensed under Apache
Perforce - A centralized VCS founded in 1995 by Christopher Seiwald. It was renamed Perforce Helix in 2015 and then Helix Core in 2017
Mercurial - A distributed VCS released on April 19, 2005 by Matt Mackall
GIT - A distributed VCS released on April 7, 2005 by Linus Torvalds

GIT is the most popular among all of them. We will see why to choose GIT over other VCS.

Why GIT is better than others? What are the key benefits of GIT?

Distributed Nature

Each developer can clone a copy of a remote repository in their local system and work independently as well as offline enabling faster commits without relying on the central server.

Branching & Merging

Developers can create their own branches and merge code, facilitating parallel development on features with minimal conflicts.

Performance

Operations such as commit, difference check, logs and viewing history are very fast due to local repository.

Collaboration

Teams can collaborate by sharing code changes through pull requests and can easily resolve merge conflicts.

Backup & Redundancy

Local copy of the repository with each developer serves as a backup mechanism in itself.

Large community support

GIT has a very vast community as well as extensive documentation, which makes it easier to learn and troubleshoot.

History of GIT

Created in 2005 by Linus Torvalds, Git was designed to maintain the development of the Linux kernel. Git was built upon the pillars of full distribution, speed, simple design, ability to handle large projects, and strong support for non-linear development.

Let’s start with initializing a fresh GIT repository

To initialize a new GIT repository locally, we simply need to run the below GIT command:

git init

The result is as below:

Now, let’s see what all exists in this GIT initialized local repository.

So, a hidden .git folder have been added. Now, let’s check what’s inside this hidden .git folder.

There are a bunch of files and folders present, but don’t worry, we’ll understand in deep how these help us in our day to day life when we use basic GIT commands.

Understanding the structure of hidden GIT files and folders

First of all, let’s take a look at brief explanation of these files and folders.

Eraser Link: https://app.eraser.io/workspace/Rmi4nAZ7iiNvTb8K6qF5

Now, we have basic understanding that what kind of information is stored and where. Now, question arises, what kind of data structures are used to store the information? What are the set of rules followed by GIT to achieve all this?

Most common Data Structures used in GIT internally

In Git, the most common objects are:

Commits: Structures that hold metadata about your commit, as well as the pointers to the parent commit and the files underneath.
Trees: Tree structures of all the files contained in a commit.
Blobs: Compressed collections of files in the tree.

Deep dive into the internal functioning of GIT

Let’s check the initial status of our local GIT repository by using below command:

git status

Now, let’s create a text file.

Now, the GIT status is:

file1.txt is marked as Untracked file. In file explorer of VS Code or any other IDE, you will see U besides the file name which indicates that the file is untracked.

To track it, we will run the below command:

git add file1.txt

Look at the file explorer now, you will see something like this:

We can see A besides file name, which indicates it is indexed and present in Staging area now.

But, what does it mean? Let’s take a look at index file inside .git repository.

The content of index file looks weird, why not, it’s compressed and contains reference of the tracked files. To check the content of the index file, run the below GIT command instead:

git ls-files --stage

The result will look like this:

So, whenever a file is staged or committed in GIT, an SHA-1 Hash is created for each staging/commit. Now, if you check the .git/objects directory, you will find that 1st two characters of hash are used to as a Directory and remaining characters to used as an Object Identifier of the Blob created. As you can see in below screenshot:

The content of the Blob is also compressed as we can see:

Now, let’s check the status of our GIT working directory.

Our changes are in staging area but not yet committed, so let’s commit it using the below GIT command:

git commit -m "our first commit"

The result will be something like this:

Let’s check the logs, using GIT command:

git log

We can see the list of commits, but as of now we have only 1 commit. Each commit log shows the information like commit Hash ID, HEAD reference to the checked out branch, Author name & Email ID, a date of commit and the commit message.

We can run below command to see only the most necessary information in the logs:

git log --oneline

The result will be:

As we can see, it only shows the initial few characters of the whole commit hash id. We can use only first few characters only to refer to a particular commit hash.

Alright, next let’s take a look at the objects folder again.

What’s this? Let’s recall everything once. We initially staged a file named file1.txt, which created a reference in index file, and subfolder 35 with a blob object inside it under the objects folder. Then, after our first commit, we have a hash id starting with 48 for our committed changes.

Now, in objects folder, we can see another subfolder 48 for this commit containing a blob object having reference to the actual content of the commit. But wait, what about subfolder 68? Where did it come from? Huh..

Let’s deep dive a little more here. We’ll first check the contents of the blob object created on our first commit. To do so, run the below command:

git cat-file -p <provide_a_commit_hash_here>

The result is as below for our first commit hash:

It has a reference to a tree object with a hash starting with 68. Now, it looks like there is some link here between committed hash and this another hash object.

We will now actually see how GIT manages the references to our commits and tags using trees, blobs and commit hashes.

First, let’s condense down everything we have done till now.

When we moved our changes from local repository to staging area, GIT create a “cache entry“ for that Blob object in the staging area i.e., inside the .git/index file, which stores the filename/path, permissions, and hash of the blob.

The actual Blob content can be seen inside .git/objects folder (in our case, it’s Blob object inside subfolder 35).

Now, when we committed our changes for the first time, a proper tree format is built up in memory from the cache entries in the .git/index file, and this tree is referred as the “root tree” in the new commit object.

Note: Each commit points to a "root tree" object that represents a snapshot of the working directory contents at that time.

So, that’s why we saw a reference to root tree object in the content of commit blob object.

In our case, the committed Blob object is inside subfolder 48, and this Blob contains reference of the root tree object i.e., Blob object inside subfolder 68 in our case.

Now, let’s check the content of the root tree object.

Wow, so this root tree object contains reference to the blob object which stores the snapshot or actual content of the committed changes (in our case, it’s Blob object inside subfolder 35).

Now please look at the flow chart below:

Eraser Link: https://app.eraser.io/workspace/L4kY2PcSnhrO6KZ7eWmv

It explains us how trees, blobs and commit hashes are interlinked with each other, and how GIT internally has references to our content in the working directory, which is actually stored on our local system itself inside the hidden .git folder.

What is the difference between a tree and a commit?

Trees and commits are different types of Git objects. While a tree links together a set of blobs with their names and permissions, a commit connects a "root tree" with an author, committer, datetime information, commit message, and commit parents - which effectively turns it into a snapshot of your project a certain point in time.
- From initialcommit.com/blog/what-is-a-tree-in-git

Now, we are clear with the basic internal working of the GIT. So, basically we are now well aware with how GIT works under the hood when we run GIT commands.

Summary

In this article, we learned about the GIT, it’s history, what are the benefits of GIT over other VCS, structure of .git repository, and the internal functionally of GIT along with few basic GIT commands.

We have only covered the most basic internal functioning of GIT, and there is a lot more to it. I request you to keep learning and explore more on GIT yourself.

Note: Please note that we never make any changes inside .git repository. Everything in GIT can be handled through it’s vast variety of commands. Any changes to any of the files inside hidden repository is not at all recommended.

At last, I would like to request you to please provide your kind feedback about this article, and do highlight if any mistakes.

Thank you so much for reading this article till the end.

How does GIT work under the hood?

Table of contents