Version Control Part 2 - Lewis Does Data

git --recap

In the first post in this series we covered how to install Git, some essential configuration, and a basic Git workflow, including how to initialise a project repository and how to track, stage and commit changes to files in a repository. In this post we are going to cover how to instruct a repository to ignore certain files or directories and how to view the changes made to files over time.

Source: xkcd.com

How to .gitignore Files and Directories

Data analysis often produces temporary or intermediate files that you don’t want to save. Those of you who are members of the Mac master race will also be all too aware of the scourge of the perpetually irksome “.DS_Store”. Luckily, good old Linus thought to implement a way of instructing a repository to ignore those files and directories that we don’t want Git to track. Genius.

All we have to do is create a hidden “.gitignore” file in the root directory of a repository and within this store a list of filenames, directories or wildcard patterns that specify the objects that we don’t want Git to pay attention to. For example, in the previous post I kicked off by navigating from my home directory to my Desktop and then initialising a Git repository called git_tutorial. The next step I would usually immediately undertake would be to create a “.gitignore” file to specify items I didn’t want Git to track:

# change directory from home dir to desktop
cd Desktop/

# initialise a git repository for a project called "git_tutorial"
git init git_tutorial

# navigate into the root directory of the newly created repository
cd git_tutorial/

# create and edit .gitignore file using the nano text editor
nano .gitignore

Executing the final command will open the nano text editor; you might remember our brief encounter with nano from last time when we set it as the default text editor and used it to write a detailed commit message. When the newly created file is opened in nano we can specify the files and folders we want Git to ignore. For example, you might add the following:

# files
.DS_Store

# directories
temp_data/

The search patterns included here would ensure that the git_tutorial repository does not track any “.DS_Store” files nor any files in the temp_data sub-directory. Slashes “/” control the behaviour of file and directory tracking; ending a pattern with a slash ensures only directories are matched. When a directory is ignored, all of its files and sub-directories are also ignored.

For multiple similar files or folders we can specify wildcard patterns and exceptions. So you might include some further lines in your file to do this:

# wildcard patterns
*.tar.gz
sample_?.fastq

# exceptions
!track_me.tar.gz

The “*” symbol matches zero or more characters, so the pattern included here stipulates that any files of the “.tar.gz” type will be ignored. Similarly, the “?” symbol matches any single character, so any file of the “.fastq” type beginning with the pattern “sample_?” will be ignored, where “?” could be any single character. The “!” symbol allows us to specify exceptions to the rules defined above, so the file “track_me.tar.gz” would still be tracked, despite the repository ignoring all other files ending in “.tar.gz”.

For more in-depth coverage of these topics, you can consult the gitignore documentation.

Once you have done adding the contents of your “.gitignore” file, hit “Ctrl + O” to write-out the changes, hit “Return” to confirm the file name, and then “Ctrl + X” to close the nano editor. You can then either stage and commit this file now or once you have made further changes to the files in the repository.

Viewing Repository History

Once you have been using Git to track, stage and commit changes for a little while, you might want to view the repository history. To do this you can run the command git log. Log entries are organised in temporal order with the most recent shown first:

commit ee4476a0fde3b9e5df5d95946b0771b9e205be81
Author: lquayle88 <drlquayle@gmail.com>
Date:   Wed Apr 6 07:00:00 2022 +0100

    added example.txt

commit 679eze7cfb3f558178b20a6c4e9c6150ea8a7b85
Author: lquayle88 <drlquayle@gmail.com>
Date:   Wed Mar 15 23:58:00 2022 +0000

    maiden commit

The first line of each commit displays a unique alphanumeric identifier for the commit called a hash. More on these shortly. The remaining lines indicate who made the change, when they made the change, and what commit message they wrote.

If you have several commits and the log is relatively large, Git will automatically use a pager display to show one screen of output at a time. If this is the case, you can use the arrow keys to scroll or hit “space” to navigate down an entire page. Pressing “q” at anytime will exit back to the command line.

The entire log for a project can be overwhelmingly large, especially when you’ve been working on a project for a long time and have tens or hundreds of commits. Thankfully there are a couple of options that make viewing history more manageable in such instances:

Inspect only the changes to particular files or directories. This can be achieved using git log path, where path is the path to a specific file or directory. The log for a file shows changes made to that file only, while the log for a directory shows when files were added or deleted in that directory, rather than when the contents of the directory’s files were changed.
Restrict the number of log entries output. This can be achieved using the -n flag and specifying the number of entries as an argument. This can be done for the entire repository using git log -n x or for a specific file or directory git log -n x path, where x would be the desired number of entries and path is the path to a specific file or directory.

Viewing Details for a Specific Commit

A commit identifier, or hash, is a unique 40-character hexadecimal string that is generated when Git runs the changes made to a file or directory through a pseudo-random sequence generator called a hashing algorithm.

Hashes are what enable Git to share data efficiently between repositories. If two files are the same, their hashes are guaranteed to be the same. Similarly, if two commits contain the same files and have identical ancestry, their hashes will also be identical. Git can therefore tell what information needs to be saved where by comparing hashes rather than entire files.

In order to view the details of a specific commit, the git show command can be used, passing the first 6 - 8 characters of a commit’s hash as an argument. For example, running git show ee4476a0 in my git_tutorial repository produces this:

commit ee4476a0fde3b9e5df5d95946b0771b9e205be81
Author: lquayle88 <drlquayle@gmail.com>
Date:   Wed Apr 6 07:00:00 2022 +0100

    added example.txt

diff --git a/example.txt b/example.txt
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/example.txt
@@ -0,0 +1 @@
+a single line of example text

The first part is the same as the log entry shown by git log. The second part shows the actual changes that were recorded during the commit. In this case the file “example.txt” is newly created and the only change to this file was the addition of “a single line of example text” on line 1. We will cover these records in more detail shortly.

While the hash for a commit unambiguously identifies a specific commit like an absolute path, a second way to view the details of a recent commit exist which is akin to a relative path. The HEAD always refers to the most recent commit, so we can view it using git show HEAD. If I ran this command in my git_tutorial repository this would produce the same output as shown above for git show ee4476a0 i.e. the last commit made in this repository.

The number of commits removed from HEAD can be specified using the HEAD~n notation, where n is the number of the commits between the most recent and the one you wish to view. Note that the symbol between HEAD and n is a tilde “~”. Running git show HEAD~1 in my git_tutorial repository would show details of the commit before the most recent i.e. the first commit made immediately after initialising the repository and creating the “.gitignore” file:

commit 679eze7cfb3f558178b20a6c4e9c6150ea8a7b85
Author: lquayle88 <drlquayle@gmail.com>
Date:   Wed Mar 15 23:58:00 2022 +0000

    maiden commit

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..948ec8f
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,12 @@
+# files
+.DS_Store
+
+# directories
+temp_data/
+
+# wildcard patterns
+*.tar.gz
+sample_?.fastq
+
+# exceptions
+!track_me.tar.gz

Now we know how to view the details of specific commits but what about comparing changes between commits?

Viewing git diff..erences

For the purpose of comparing the difference between files and commits, we can use…🥁…git diff. Easily one of the most useful and used Git commands; we will be seeing more of git diff in a later post where we will cover branching and merging projects in Git.

In order to view changes made between two commits using their hashes we would run the following command git diff hash_1..hash_2, where hash_1 and hash_2 identify the two commits you’re interested in; these might be several commits apart. Similarly, we could run git diff HEAD~3..HEAD~1 to show differences between the state of the repository on the commit before the most recent and two commits prior to that point. Note that either the hashes or HEAD labels must be separated by a pair of dots.

Seems like a job for “git diff”

The git diff command also has some other useful functions that allow you to compare files, directories or repository states between commits. If we want to simply view all changes to all files in a repository you can just run git diff within that repository. You can view all changes to a specific file or files in a directory since the last commit by running git diff path, where path is the path to a specific file or directory. In order to compare the state of files in a repository with those in the staging area we can use git diff -r HEAD and we can restrict the results to a file or directory using git diff -r HEAD path where, as before, path is the path to a specific file or directory.

The last thing we are going to cover today is a more detailed look at the output produced upon running any of the variants of a call to the git diff command, which is a formatted display of the differences between two sets of files:

diff --git a/example.txt b/example.txt
index 21bce37..bbe90e3 100644
--- a/example.txt
+++ b/example.txt
@@ -1 +1,3 @@
-a single line of example text
+a new single line of example text
+
+a second line of example text

This example output shows:

The command used to produce the output (i.e. diff) and the file placeholders “a” and “b”, meaning “the first version” and “the second version”.
An index line showing keys into Git’s internal database of changes; no need to worry about these.
The files for which output is displayed and their order, wherein lines being removed are prefixed with “-“ and lines being added are prefixed with “+”.
A line starting with “@@” that tells where the changes were made. The pairs of numbers are the start line and the number of lines where changes occurred. This output indicates changes starting at line 1, with 3 lines where there was once 1.
A line-by-line listing of the changes with deletions and additions indicated by “-“ and “+”, respectively. Lines that haven’t changed in the indicated range have neither symbol in front of them.

git --review

In this post we have covered how to instruct a repository to ignore certain files or directories using a “.gitignore” file, how to view repository history and retrieve commit hashes using git log, and how to view details of commits and changes made to files over time using git show and git diff.

The next post in this series will cover how to undo changes to files, staged changes and commits. See you then.

. . . . .

Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.

Happy Data Analysis!

. . . . .

Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.

← Previous Post Next Post →