Git for Scientists: A Tutorial

John McDonnell, July 12, 2012. [twitter] [github]

octocat

Octocat, Github's mascot.

What is this?
This is a tutorial is targeted at users who are new to version control systems, or just new to git. It goes over the basics of how to save and share your work in git, and give a conceptual explanation of branching. This is an adaptation of the slides I used to present a basic tutorial on git to the NYUCCL.
I'm a solo researcher, why would I use git?
The short answer is that it gives you a powerful toolkit to back up, keep organized, and collaborate. There's a lively and informative discussion of git's merits for researchers on Stack Overflow here and here.
Sounds pretty basic, do you go in depth at all?
More advanced users may be most interested in the section on branching, in which I explain how branching works conceptually and how it can help scientists manage their research projects.
How should I use this tutorial?
It's best to follow along with the commands, so you can see how stuff works in practice. I've also made them available in this script. You'll need to install git on your system, so check here for instructions. (Update: Or follow along online, using tryGit!) The commands here will work on Macs and Linux without modification. Windows doesn't have a "/tmp" directory, so just use a different directory if you're on Windows.

Table of Contents

An Introduction

Why use git?

  1. Backups
  2. Collaboration
  3. Organization

What is git?

threecommits

Let's try.

Here, we set up a new git repo, add a file, and commit it to the repo.


    mkdir /tmp/myfirstrepo
    cd /tmp/myfirstrepo
    echo "This is my first file." > myfile.txt
    git init
    ls .git # A look at what we've done.
    git commit -am 'Initial commit.'
    

File lifecycle.

file lifecycle

Files go through the following stages as you work on your code:

Untracked
Git doesn't know about these files
Tracked, unmodified
Git knows about them, and there have been no changes.
Tracked, modified
File is in git, but changes have been made that git doesn't know about.
Staged
Changes to this file will be committed next time you commit.

Staging, illustrated.

Let's go through an example of this staging business. Here we'll create a new file. All files start out untracked. When we add it with "git add," it becomes staged, ready to be committed.

echo "This is my second file." > myfile2.txt
git status # UNTRACKED
git add myfile.txt
git status  # STAGED 

If we modify the file, our new changes aren't staged yet.

echo "A change to this file." >> myfile2.txt # MODIFIED
git status # STAGED *and* MODIFIED 
git diff # Diff between unstaged vs staged changes
git diff --cached # Diff between staged changes vs. commit 

A commit at this point will commit the file as it was when we added it, without the later changes.

git commit -m "Any guess what gets committed here?"
git status 

Undo!

Oops! We made a mistake there, by not adding our new changes into the commit. Fortunately for us, git can rewrite history! If we commit those changes with the "--amend" option, we're actually changing the last commit. We can also change its commit message.

echo "Forgot this file!" > forgot.txt
git add forgot.txt
git commit --amend -m "Here's a replacement commit message." 

What if we accidentally stage something, we can unstage it with git reset. If we have untracked changes to a file that we want to discard, we can reset that file with git checkout. Bear in mind, you will lose those changes forever if you do this!

echo "Don't really need this line." > myfile.txt
git add myfile.txt
git reset HEAD myfile.txt    # Unstage.
git checkout myfile.txt      # Delete change.  

You just learned...

Remotes / Github

Remotes

codercat

Creating a github repo

Just go to github's website, make an account, and follow the instructions to create a new repo! Then in your practice repo, type out the following commands:

git remote add origin git@github.com:johnmcdonnell/demo.git
git push -u origin master 

Cloning a github repo

This is the process you would go through to download the repository on a new computer.

cd /tmp  # Using the '/tmp' directory to keep your computer clean.
git clone git@github.com:johnmcdonnell/demo.git myclone 

Making changes and sending them back.

Hopefully committing changes is old hat now. The 'push' command is what sends your data back to github.

cd /myclone
echo "Changes to our file, in the clone." >> myfile.txt

git commit -a -m "Remote changes made."
git push origin master 

Getting our changes, with a twist.

Now we're going to make our own changes, before getting the changes someone else made from github. This will result in a conflict! No big deal, you just fix the file in question, then add and commit. The "-a" in "git commit -a" just means, "automatically add everything I've made changes to.

cd myrepo
echo "Changes to our file, in our own repo." >> myfile.txt
git commit -a -m "Local changes made."

git pull origin master

$EDITOR myfile.txt # fix the conflict.
git commit -a

git push origin master 

You just learned...

Branches

treebranch

What is a branch?

What is a commit, really?

Commit Anatomy

In essence, a commit is a snapshot of the files stored in a tree, with some other metadata as well, such as the author and date.

Repo as collection of commits

Three commits

One important piece of metadata is the "parent," which is the preceding commit that was changed to make the present commit.

A branch is a pointer to a commit.

Master branch

The default branch is "master." If you don't ever touch branches, master will always point to the most recent commit.

The files in our working directory reflect HEAD

HEAD

HEAD can be thought of as a designtor for the "active" branch. By default, HEAD points to master. In this diagram, master points to the snapshot "f30ab".

Two branches can point to the same commit.

New branch

If you type the following:

git branch testing

You will create a new branch pointed at the same commit that the HEAD branch is pointed to (here, master).

Move HEAD with checkout

HEAD_testing

When you use the "checkout" command on a branch, as here:

git checkout testing

...you move HEAD to point to that branch. If that branch is on a different commit than your current HEAD branch, this also changes all the files in your directory to reflect that branch's snapshot.

New commits move the HEAD branch

Branch Commit

git commit...

A merge scenario

Merge Scenario

If we make changes to branch "iss53", then also make changes to master like so:

git checkout master
git commit...

We may want to merge these changes together, pulling the changes we made in iss53 into our master branch.

A merge scenario

Merge Aftermath

To do this, just check out "master" then type:

git merge iss53

If there were no conflicts, this should merge "cleanly." The ability to branch and merge so effortlessly is a key feature of git!

Using branches and tags in a scientific context

The following is an example of how a Psychologist might want to use these features in coding an experiment.

Imagine we're working on code for a Psychology experiment. We've just run a pilot experiment, and now we want to deploy two more experiments based on that first experiment manipulating different variables.

First we should tag our existing experiment so we can find it again later.

git tag -a pilot -m "Our initial pilot experiment" 

Then we make new branches, and add our new manipulations to each:

git branch exp1
git branch exp2

git checkout exp1
echo "Exp 1 makes manipulation A" >> myfile.txt
git commit -a -m "Added manipulation A"

git checkout exp2
echo "Exp 2 makes manipulation B" >> myfile.txt
git commit -a -m "Added manipulation B" 

Now imagine we realize there's a bug in our code! With git, we can fix the bug just once, and apply it to both of our experiments. First we fix it in master:

git checkout master
$EDITOR myfile.txt  # Make an important bugfix
git commit -a -m "Important bugfix" 

...and then check out each of our new experiment branches and merge in the new changes in master:

git checkout exp1
git merge master

git checkout exp2
git merge master
git diff master # the only difference now is our exp2 manipulation.  

Maybe at this point we want to take a look at our pilot again and see if the bug caused problems there. Thanks to our tag, taking a second look at the pilot is easy:

git show pilot       # if we just want to see the note.
git checkout pilot   # to actually check out the snapshot.

I hope this example lets researchers see how git can be useful for their own projects.

You just learned...

Conclusion

Resources

Tutorials

Software

Configuration

You may want to use the following commands to configure git once you install it:

Turn on color
git config --global color.ui "auto"
Your signature
git config --global user.rame "Your Name"
git config --global user.email "your.email@nyu.edu"
Your editor
git config --global core.editor "mate -w"
Password caching for https
git config --global credential.helper cache