Introduction to version control and Git

Radovan Bast

Licensed under CC BY 4.0. Code examples: OSI-approved MIT license.

What we will learn in this lesson

Why version control
Why Git
Configuring Git
Basic Git workflow
Linear and local development
Using the staging area
Undoing things

Why version control and why Git?

Version control

Very often we work on projects where versions are useful or important
- Programs
- Scripts
- Configuration files
- Websites
- Manuscripts
Have you ever made a backup copy of source files before modifying them?

$ cp complicated.py complicated.py.backup
$ vi complicated.py

Have you ever wished you had done a backup copy before you completely broke a source code and had to start over?
Programming is an iterative process
When you code typically there comes a point where the code gets broken
It would be nice to be able to go back

How many of you have used the "undo" button when you edit a document?

Version control

A version control system (VCS; or revision control) is a framework that tracks the versions of a project

[1] --> [2] --> [3] --> [4] --> [5] --> [6] --> [7] --> [8] --> [9] --> ...

Like save points in computer games
We archive the current state
We keep all past revision
We can go back to arbitrary revision
We can compare revisions
Compare with Apple Time Machine
The simplest VCS would be to copy the entire project somewhere and give it a revision number/name

Example: Poor man version control

MAG-DKS2-RI_CP_10.8.07.tgz        ReSpect-AFDZ-1.2.4_18.3.07.tgz
MAG-DKS2-RI_CP_17.5.07.tgz        ReSpect-AFDZ-1.2.4_27.7.07.tgz
MAG-DKS2-RI_CP_23.8.07_final.tgz  ReSpect-AFDZ-1.2.4_29.4.08.tgz
MAG-DKS2-RI_CP_24.5.07.tgz        ReSpect-AFDZ-1.2.4_6.10.07.tgz
MAG-DKS2-RI_CP_25.5.07.tgz        ReSpect-AFDZ-1.2.5_23.4.08.tgz
MAG-DKS2-RI_CP_29.5.07.tgz        ReSpect-AFDZ-1.2.5_25.5.07.tgz
MAG-DKS2-RI_CP_30.5.07.tgz        ReSpect-AFDZ-1.2.5_6.6.07.tgz
MAG-DKS2-RI_CP_6.10.07.tgz        ReSpect-AFDZ-1.2.5_bexC.tgz
MAG-DKS2-RI_CP_6.6.07.tgz         ReSpect-AFDZ-1.2.5_D0.tgz
MAG-DKS2-RI_CP_8.6.07.tgz         ReSpect-AFDZ-1.3.0_4.4.08.tgz
MAG-DKS2-RI_KT.tgz                ReSpect-AFDZ-1.3.1_4.4.08.tgz
MAG-DKS2-RI_PI1_2007.tgz          ReSpect-AFDZ-1.3.2_22.4.08.tgz
MAG-DKS2-RI_PI_2007.tgz           ReSpect-AFDZ-1.3.2_4.4.08.tgz
MAG-DKS2-RI_PI2_2007.tgz          ReSpect-AFDZ-1.3.2_5.4.08.tgz
MAG-DKS2-RI_PI_CP_18.3.07.tgz     ReSpect-AFDZ-1.3.3_1.5.08.tgz
MAG-mDKS_11.5.08.tgz              ReSpect-AFDZ-1.3.3_20.5.08.tgz
MAG-mDKS_15.4.08.tgz              ReSpect-AFDZ-1.3.3_TSTrm_27.6.08.tgz
MAG-mDKS_17.6.09_unfinished.tgz   ReSpect-AFDZ-1.3.3_WK_10.8.08.tgz
MAG-mDKS_19.7.09.tgz              ReSpect-AFDZ-1.3.3_WK_11.8.08.tgz
MAG-mDKS-20.7.09.tgz              ReSpect-AFDZ-1.3.3_WK_13.8.08.tgz
...

Merges need to be done manually
Difficult to inspect the project history (e.g.: at which point was a bug introduced?)
Almost impossible to keep track of patches

Motivation for version control: collaboration

Collaborative work

Programs are often developed by many people in parallel
It would be extremely tedious to synchronize this work manually
We write manuscripts with collaborators ("can you please send me the last version?")
Have you ever waited for your collaborator before making changes to a manuscript?
Imagine you are the corresponding author and publish a paper with 20 collaborators
Version controls enables us to work with several people on the same code at the same time
Without the need for manual synchronization
Without the risk of undoing the work of others by accident
For manuscripts we recommend https://www.overleaf.com

Motivation for version control

Sharing code with others

Many of us distribute our programs to users
Git simplifies this process
Easy to share updates and patches
Lowers the barrier for new developers to contribute to our code

Scientific reproducibility

Versions are essential for reproducibility of published computational results
We use scientific code to produce scientific results
These programs evolve, bugs appear and get fixed
It is essential that we can easily access and compare versions of our program

Additional motivation for version control

Version control is also great in a one-(wo)man universe

Often we work on the same thing from different computers/devices
There are many USB sticks that commute between home and office
We can use Git as a "Dropbox"

Version control market

https://www.openhub.net/repositories/compare
Not fully representative but gives an idea

Centralized vs. decentralized vs. distributed

CVS and Subversion are centralized (one server keeps track of versions, working copies are clients)
Git, Mercurial, and Bazaar are distributed (every working copy can keep track of versions)
We will see later why distributed is often better for scientific code development

Why we will choose Git (1/2)

Git is a distributed VCS
But supports any workflow (also legacy centralized modes of operation)
Written by Linus Torvalds
Fast and lightweight
Nearly every operation is done offline on your local disk
You can do commits, diffs, logs, branches, merges, annotation and more entirely offline
Merging development lines is trivial and fun
You get reasonable backup for free (because entire project history is distributed)

Why we will choose Git (2/2)

Prominent companies and projects using Git: Linux Kernel, Google, Facebook, Microsoft, Twitter, Linkedin, Netflix, Perl, PostgreSQL, ALSA, Android, Fedora, GCC, GNU Autotools, GNOME, phpMyAdmin, Ruby on Rails, Samba, VLC, Wine, X11, Yum, ...
It is the version control tool with the most traction
GitHub
"Git is a four-handle, dual boiler espresso machine, not instant coffee".
DAG
Immutable objects

Configuring Git

Before we start working with Git

Before we use Git, let us configure Git on our machine for optimum Git experience
Colorize your life

$ git config --global color.branch auto
$ git config --global color.diff   auto
$ git config --global color.status auto

Identify yourself, set your name and your e-mail (this will show up in the log history)

$ git config --global user.name "Slim Shady"
$ git config --global user.email slim.shady@gmail.com

Set the default mode for git push
Avoids typing git push origin <branch>

$ git config --global push.default current

These settings are stored in ~/.gitconfig

Before we start working with Git

Set your favourite editor

$ git config --global core.editor "vim"

Here are my settings

$ cat ~/.gitconfig

[color]
        branch = auto
        diff = auto
        grep = auto
        status = auto
[user]
        name = Radovan Bast
        email = rbast@users.noreply.github.com
[push]
        default = current
[core]
        editor = vim

We set these only once on a computer and the settings will be global to all Git projects

Basic Git workflow

Git basics

How to initialize a new Git repository
How to add and commit files
How to inspect the project history
How to write useful commit log messages

Exercising a basic Git cycle

Write a haiku and track it with Git (we do this together interactively in the terminal)

On a branch ...
        by Kobayashi Issa

    On a branch
    floating downriver
    a cricket, singing.

Exercising a basic Git cycle

$ git init
$ git add
$ git status
$ git commit -m "commit message"
$ git diff
$ git log
$ git show

Everything in this course we will do with Git to get it into muscle memory

Git basics

We can browse the development and access each state that we have committed
The long hashes uniquely label a state of the code
They are non-incremental (why?)
We will use them when comparing versions and when going back in time
git log --oneline is nice to get an overview
git log --oneline only shows the first 7 characters of the commit hash
If the first characters of the hash are unique it is not necessary to type the entire hash
git log --stat is nice to show which files have been modified (not shown because here we only have one file)

Commit messages

We now understand that the first line of the commit message is very important
Good example

implement Pulay DIIS algorithm

implement Pulay DIIS algorithm to accelerate SCF
convergence and set it as default
this is based on [REF]
this option can be deactivated with
.NODIIS
...

Convention: one line summarizing the commit, then one empty line, then paragraph(s) with more details in free form, if necessary
Not so good example (everything in one long line):

implement Pulay DIIS algorithm to accelerate SCF convergence and set it ...

This is also important for web based repository browsing

Commit messages

Another bad example

rbast:

fixed an important bug for contracted basis sets
...

Other bad commit messages: "fix", "oops", "save work", "foobar", "toto", "qppjdfjd", ""
http://whatthecommit.com
Write commit messages in english that will be understood 15 years from now by someone else than you
Many projects start out as projects "just for me" and end up to be successful projects that are developed by 50 people over decades

Commit messages

It is possible to commit and set commit message at the same time

$ git commit -m "here I have changed this and that"

This does not open any editor and commits directly

Git basics

At any moment we can inspect individual commits

$ git show 49dc419

commit 49dc419c8a44051cfe7826b85ee0a23e5faf3975
Author: Radovan Bast <rbast@users.noreply.github.com>
Date:   Sat Nov 22 15:33:15 2014 +0100

    do not recompute powers

diff --git a/triangle.py b/triangle.py
index cc52fe2..fa35eab 100644
--- a/triangle.py
+++ b/triangle.py
@@ -6,7 +6,10 @@ m = int(sys.argv[1])

 # loop over all a < b < c <= m
 for c in xrange(1, m+1):
+    cp = c*c
     for b in xrange(1, c):
+        bp = b*b
         for a in xrange(1, b):
-            if a*a + b*b == c*c:
+            ap = a*a
+            if ap + bp == cp:
                 print("(%i, %i, %i)" % (a, b, c))

We see that the start of the hash is enough if it is unique

Git basics

Now we know how to save versions

$ git add <file(s)>
$ git commit

And this is what we do as we program
Every state is then saved and later we will learn how to go back to these "checkpoints" and how to undo things
We could live a fulfilled life with the following few Git commands

$ git init       # initialize new repository
$ git add        # add files or stage file(s)
$ git commit     # commit staged file(s)
$ git status     # see what is going on
$ git log        # see history
$ git diff       # show unstaged/uncommitted modifications
$ git show       # show the change for a specific commit
$ git mv         # move tracked files
$ git rm         # remove tracked files

Where is the Git repository?

All the magic is under .git, all the history, all snapshot, all branches, everything
When we staged and committed files, we "copied" them into .git
Here we only track one file but we can track entire file trees
Git does not pollute subdirectories
If we remove .git, we remove the repository (but of course keep the working directory)
It is very easy to create (and remove) a Git repository to track something that you work on
.git uses relative paths (very convenient), you can move the whole thing somewhere else and it will still work

Ignoring files

$ git status

# On branch master
...
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       a.out

Files that are not supposed to be tracked belong in .gitignore (object files, compiled code, generated files, *.pyc files, generated *.pdf files)
.gitignore understands regular expressions and is valid for all subdirectories
Use .gitignore otherwise git status is flooded with untracked files and useless
People who do not use git status make mistakes (e.g. forget to add files)
You should track .gitignore (do not ignore it)

Ignoring files

Here is an example .gitignore file from another project

$ cat .gitignore

 _build/
 build*/
 *.pyc

It is possible to ignore entire directories

Clean working area

Use git status a lot
Use .gitignore
Untracked files belong to .gitignore
All files should either be tracked or ignored

Using the staging area

Three level system

Before we continue we recall that we have committed changes in two steps
We have used git add and git commit
In Git we work in a 3-level system (for very good reasons as we shall see)
- Working directory (your usual directories and files)
- Staging area (preparation for next commit)
- HEAD (last commit)

$ git add          # working directory --> staging area
$ git commit       # staging area becomes the new commit (HEAD)

It is useful to have a nice and readable history

Bad example

b135ec8 now feature A should work
72d78e7 fix for parallel compilation
bf39f9d bugfix
49dc419 removed too much
45831a5 removing debug print
bddb280 more work on feature B
72e0211 another fix to make it compile
e2073c3 oops! forgot another file
61dd3a3 forgot file
a9f5172 save work on feature A
6fe2f23 save work on feature B

Very often you will be obliged to do archaelogy in your code
Imagine that in few months you discover that feature B was a mistake
It is very difficult to find and revert this in this example

Master should have a nice and readable history

Good example

6f0d49f feature C
fee1807 feature B
6fe2f23 feature A

We want to have nice commits
But we also want to "save often" (checkpointing) - how can we have both?
We will now learn to fabricate nice commits using the staging area

Checkpointing using the staging area

                working    staging     HEAD
command        directory    area         |   english
                   |          |          |

*git add file(s)    |--------->|          |   stage file
*git commit         |          |--------->|   commit staged file(s)
git commit file(s) |-------------------->|   commit file(s) directly

*git diff           |<-------->|          |   between workdir and staged
git diff --cached  |          |<-------->|   between staged and last commit
git diff HEAD      |<------------------->|   between workdir and last commit
git diff           |<------------------->|   if nothing is staged

git reset          |<---------|          |   unstage
git reset --soft   |          |<---------|   "uncommit" and stage
git reset --hard   |<--------------------|   discard

*git checkout       |<---------|          |   undo unstaged modifications
git checkout       |<--------------------|   if nothing is staged

git add every change that improves the code
git checkout every change that made things worse
git commit as soon as you have created a nice self-contained unit (not too large, not too small)
Discuss/think about what is too large or too small

Checkpointing using the staging area

We want to do many small commits (checkpoints)
But at the end we want to commit in one nice commit
With git add we can prepare commits

$ git add file.py                 # checkpoint 1
$ git add file.py                 # checkpoint 2
$ git add another_file.py         # checkpoint 3
$ git add another_file.py         # checkpoint 4
$ git diff another_file.py        # diff w.r.t. checkpoint 4
$ git checkout another_file.py    # oops go back to checkpoint 4
$ git commit                      # commit everything that is staged

git diff gives differences with respect to the staging area, this is very practical
Using git add we can fabricate very nice coherent commits

Staging everything

Sometimes you want to stage all modifications
No need to stage them one by one

$ git add -u

Also removals of tracked files are then automatically staged

Working without the staging area

                working    staging     HEAD
command        directory    area         |   english
                   |          |          |

git add file(s)    |--------->|          |   stage file
git commit         |          |--------->|   commit staged file(s)
*git commit file(s) |-------------------->|   commit file(s) directly

git diff           |<-------->|          |   between workdir and staged
git diff --cached  |          |<-------->|   between staged and last commit
git diff HEAD      |<------------------->|   between workdir and last commit
*git diff           |<------------------->|   if nothing is staged

git reset          |<---------|          |   unstage
git reset --soft   |          |<---------|   "uncommit" and stage
git reset --hard   |<--------------------|   discard

git checkout       |<---------|          |   undo unstaged modifications
*git checkout       |<--------------------|   if nothing is staged

Undoing things

In the following (interactive demo) we will learn how to revert code changes that
- have not yet been staged
- have been staged but not committed
- have been committed

Correcting incomplete commits

Imagine we just committed something but realize that the commit is incomplete
For instance we forgot to add a file
git commit --amend adds staged changes to the previous commit
If nothing is staged, we can use git commit --amend to modify the last commit message (e.g. to fix a horrible typo)
This does not modify the actual commit content but opens up the message editor and lets you change it
git commit --amend replaces the last state with a new commit (and a new hash)
Never use git commit --amend on commits that you have shared with others (more about this later)

"Deleting" commits

In Git it is possible to remove commits

$ git log --oneline

ce373ff another terribly terrible error
87c9a94 a horribly embarrassing mistake
0a31903 make it possible to test Fermat Theorem
bf39f9d break loop a if triple found
49dc419 do not recompute powers
45831a5 read upper limit from stdin
4fc4b95 print Pythagorean triples up to c = 20

Imagine we want to go back to commit 0a31903 and completely remove commits 87c9a94 and ce373ff
This is possible with git reset --hard and git reset --soft

$ git reset --hard 0a31903
$ git log --oneline

0a31903 make it possible to test Fermat Theorem
bf39f9d break loop a if triple found
49dc419 do not recompute powers
45831a5 read upper limit from stdin
4fc4b95 print Pythagorean triples up to c = 20

"Deleting" commits

It is called --hard because it is dangerous, use with caution!
The repository and the working tree is reset to state 0a31903
All uncommitted changes will be lost for good!
It can be also useful if you want to reset your working tree to last committed state (HEAD)

$ git reset --hard HEAD        # DANGEROUS

With git reset --soft you can "delete" commits, but you keep the code changes
git reset --soft puts your deleted commits into the staging area
Never use git reset --soft or --hard on commits that you have shared with others
Doing so would create conflicts for people who base their work on commits that you have deleted (more about it later)
You can always do a git revert (it does not replace old commits, it does not change history)
git revert is the only safe option to undo changes that are shared with others

Summary

We have learned basic Git commands
We have practiced the basic git init; git add; git commit workflow
We have not explored the true power of Git: branches
In the following we will learn how to:
- How to work with branches
- How to work with others
- How to go back in time
- And much more

Backup and cloud

Git is not ideal for large binary files (for this consider http://git-annex.branchable.com/)
To synchronize unversionned large binary files among computers
- Google drive
- Dropbox
- SpiderOak
Free cloud hosting of Git projects
We will learn about GitHub later during the course

Git and Subversion

With git-svn you can use Git commands/workflows together with a Subversion server

Git and CVS

git-cvsserver - A CVS server emulator for Git

Files

git-intro.mkd

Latest commit

History

git-intro.mkd

File metadata and controls

Introduction to version control and Git

Radovan Bast

What we will learn in this lesson

Why version control and why Git?

Version control

How many of you have used the "undo" button when you edit a document?

Version control

Example: Poor man version control

Motivation for version control: collaboration

Motivation for version control: collaboration

Collaborative work

Motivation for version control

Sharing code with others

Scientific reproducibility

Additional motivation for version control

Version control is also great in a one-(wo)man universe

Version control market

Centralized vs. decentralized vs. distributed

Why we will choose Git (1/2)

Why we will choose Git (2/2)

Configuring Git

Before we start working with Git

Before we start working with Git

Basic Git workflow

Git basics

Exercising a basic Git cycle

Exercising a basic Git cycle

Git basics

Commit messages

Commit messages

Commit messages

Git basics

Git basics

Where is the Git repository?

Ignoring files

Ignoring files

Clean working area

Using the staging area

Three level system

It is useful to have a nice and readable history

Master should have a nice and readable history

Checkpointing using the staging area

Checkpointing using the staging area

Staging everything

Working without the staging area

Undoing things

Undoing things

Correcting incomplete commits

"Deleting" commits

"Deleting" commits

Summary

Summary

Backup and cloud

Git and Subversion

Git and CVS