Beyond learning to code: Writing programs


Introduction

This week, we've shown you a pretty large fraction of the core Python language. With enough patience, you could read through most of the Python documentation on your own and write code to do whatever you want. However, just as there's more to being a scientist than learning how to pipette (important a skill though that may be), there's more to writing software than learning the syntax of a language. This afternoon, I'll introduce you to a couple important skills that will serve you no matter what language you ultimately decide to program in.

The Project


Coordinate linkage of HIV evolution reveals regions of immunological vulnerability, V Dahirel, et al., PNAS (2011).

For the first half of next week, we'll be using a paper on HIV evolution to introduce you to several libraries and software packages that you may find helpful going forward.

The basic idea behind the paper borrows ideas from the finance industry. In addition to whole-market movements (something you might be able to get an insight into from looking at something like the Dow Jones or S&P 500 indexes), there are large sectors of the market that tend to track together: manufacturing stocks might be relatively independent from tech stocks, and each of those independent from airline stocks.

Similarly, not all of the residues in a protein evolve independently: some are likely coupled together into sectors, where knowing the amino acid at one site can tell you something about what is likely to be at other sites. One simple scenario is if two residues are partially redundant: mutating either one is fine, but mutating both has serious fitness consequences. In that case, knowing that one site is mutated tells you that the other is less likely to also be mutated than you would expect by chance.


Homework


Over the weekend, we'd like you to read the paper for the project (if you haven't already) and stub out the pieces of a larger program that will generate the data for the figures in the paper, given a collection of HIV sequences. You don't need to worry about actually writing the code itself, but try to come up with a relatively detailed plan for what you'll need to do to the data. Then, on Monday morning we'll discuss your designs and then start showing you the tools you'll need to actually implement it.

Source code control

Good record-keeping is of the utmost importance in science, and it turns out to be really, really helpful in programming too. Sometimes, the "improvements" you make to a large piece of software actually break something that you weren't thinking about, so it's nice to have a record of what you did when, and easily be able to go back to previous versions. Alternatively, you could have multiple different versions of the same piece of software floating around lab, and if you're not careful, it's easy to make one set of modifications to one version, and a different set of modifications to another version.

This is problem isn't specific to just scientists, and the software engineering community has come up with a number of different software tools to help keep track of the changes that get made to source code. These are called Version Control Systems, and today I'll be showing you a brief overview of one, called Git.

git init

The first thing you'll want to do is initialize a new repository. A repository is just the term for a collection of files that Git will keep track of. From the command line, the way to do this is straightforward.
$ git init
Initialized empty Git repository in /Users/pcombs/Documents/PythonCourse2011/.git/
What it's saying is that it's created a directory called ".git" inside of the PythonCourse directory. By convention, Unix-derived operating systems (like Linux and Mac OS X) hide things that begin with a . by default, though it is possible to get them to show up using the ls -a instead of just ls. For the most part, though, you won't need to muck about in .git directly anyways, so don't worry too much about it.

Also, in the event that you already had a git repository set up in the current folder, doing git init won't overwrite it, it just "reinitializes". I haven't figured out what this means, exactly, aside from a slightly different message that it shows up. But basically, you don't need to worry if you think you might already have a repository there: you can init one anyways and it won't break anything. In fact, all but a very few git commands are safe, and won't destroy data.

git add

So now that we have our shiny new repository, what do we do with it? Git will only keep track of things that we tell it to track, and the way we do that is by using git add. I'm first going to make a really simple file, and then git add it.

hello.py
print "Hello, world!"

$ git add hello.py


git commit

At this point, we're almost tracking hello.py, but not quite! Git uses a 2-step process: first, you add files to a staging area (called the index); then, once you've added the files you want to the index, you commit them to the repository. A commit (Computer Scientists aren't the best at grammar, so what used to be a verb is now a noun) is a snapshot of your code at a particular time. Each commit has an associated message that can be as long or as short as you'd like, but traditionally, the first line is a brief, one-line summary of the changes you've made, and then you can put in a blank line and then as long or as short of a message as you'd like to explain the changes in more detail. This is like your lab notebook, so be as verbose as you need to explain why you did what you did.

$ git commit

So now let's make some more changes to our code:
hello.py
print "Hello, world!"
x = 3
print x

status and diff

Now let's say we made those changes last night right before going home, and we don't remember if we added them to the index and/or committed them. There are a couple commands you can use to check on them:

$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#    modified:   hello.py
#
no changes added to commit (use "git add" and/or "git commit -a")

In this case, we see that we've modified hello.py, but we haven't added it to the index. If we want to find out what changes we've made, exactly, we can do:
$ git diff
diff --git a/hello.py b/hello.py
index 4351743..2aae829 100644
--- a/hello.py
+++ b/hello.py
@@ -1 +1,3 @@
 print "Hello, world!"
+x = 3
+print x

Now by default, git diff will tell you the difference between what is in the working directory and the most recent commit. That is, it's the changes that we could *add*. If you want to find out what changes we've already added, you can give it the --staged flag, so:

$ git add hello.py
$ git diff
(nothing gets displayed, so there aren't any more changes we can add)
$ git diff --staged
diff --git a/hello.py b/hello.py
index 4351743..2aae829 100644
--- a/hello.py
+++ b/hello.py
@@ -1 +1,3 @@
 print "Hello, world!"
+x = 3
+print x

$git commit

Branches


So now let's say your lab mate (let's call him "Aaron") comes up to you after you show off the results of your program in lab meeting and says, "That program's really cool, but to use it for my project, I'd want to print out 3 squared instead of 3." Now, your project relies on plain 3, so you'd need to either
  1. Print out both 3 and 3 squared and rely on the user to figure out which one to use. That might work in this case, but maybe Aaron asked for modifications that aren't compatible with that approach.
  2. Copy the whole folder full of code elsewhere, and then make the change there. The problem with that approach is that if you discover a bug in the original program, you have to fix it in both places, which won't necessarily be trivial or obvious, and then you're never quite sure whether you've actually made the fix in both places, and ...
  3. Make a new branch of the repository. The code is allowed to diverge, but by storing the two branches in the same repository, you can keep track of the changes, and merge the changes from one to the other.

$git branch xsquared
$git checkout xsquared

hello.py
print "Hello, world!"
x = 3
print x**2

$ git add hello.py
$ git commit -m "Prints x^2 instead of x"

Now we have both branches of code running in parallel to each other, and we can make changes in one without affecting the other. If you're ever not sure what branch you're on, you can do:
$ git branch # Note the lack of a name for the branch
  master
* xsquared

Merging


As we work some more, we realize perhaps that something is wrong. Our program isn't nearly excited enough. That's an easy change, though:

$ git checkout master
hello.py -- on branch master
print "Hello, world!!!!"
x = 3
print x**2
$ git add hello.py
$ git commit -m "Getting excited"

We are really excited and want to make this change apply to both branches, though, so it would be nice to have some way to merge the changes into the xsquared branch.
$ git checkout xsquared # First, we switch back over to xsquared
$ git merge master # We say what branch we want to merge the changes from.
Auto-merging hello.py
Merge made by recursive.
 hello.py |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Now, when we take a look at the code, we see that the program has automatically done the Right Thing™, and made the changes it was supposed to.

hello.py -- on branch xsquared
print "Hello, world!!!!"
x = 3
print x**2

Advanced Merging


Sometimes, though, it's not possible for git to know what changes to make, and sometimes it does guess wrong. Let's work through an example where that happens.

Let's say that in two different branches, we make a change to the same line of code:

hello.py -- on branch master
print "Hello, world!!!! Let's print x"
 x = 3
 print x
$ git add hello.py
$ git commit -m "More descriptive message on master"

$git checkout xsquared
hello.py -- on branch xsquared
print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
 
print "Goodbye, cruel world..."

Now we add this in with two separate commits, one for the introductory message, and one for the sign-off message:
$ git add -p # The -p flag lets us do thing's piecewise
diff --git a/hello.py b/hello.py
index e9692b1..57e3f45 100644
--- a/hello.py
+++ b/hello.py
@@ -1,3 +1,6 @@
-print "Hello, world!!!!"
+print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
+
+print "Goodbye, cruel world..."
+
Stage this hunk [y,n,q,a,d,/,s,e,?]? ?
y - stage this hunk
n - do not stage this hunk
q - quit; do not stage this hunk nor any of the remaining ones
a - stage this hunk and all later hunks in the file
d - do not stage this hunk nor any of the later hunks in the file
g - select a hunk to go to
/ - search for a hunk matching the given regex
j - leave this hunk undecided, see next undecided hunk
J - leave this hunk undecided, see next hunk
k - leave this hunk undecided, see previous undecided hunk
K - leave this hunk undecided, see previous hunk
s - split the current hunk into smaller hunks
e - manually edit the current hunk
? - print help
@@ -1,3 +1,6 @@
-print "Hello, world!!!!"
+print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
+
+print "Goodbye, cruel world..."
+
Stage this hunk [y,n,q,a,d,/,s,e,?]? s
Split into 2 hunks.
@@ -1,3 +1,3 @@
-print "Hello, world!!!!"
+print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]? y
@@ -2,2 +2,5 @@
 x = 3
 print x**2
+
+print "Goodbye, cruel world..."
+
Stage this hunk [y,n,q,a,d,/,K,g,e,?]? n
$ git commit "More descriptive intro message on xsquared"
$ git diff
diff --git a/hello.py b/hello.py
index e5df48d..57e3f45 100644
--- a/hello.py
+++ b/hello.py
@@ -1,3 +1,6 @@
 print "Hello, world!!!! Let's print x**2."
 x = 3
 print x**2
+
+print "Goodbye, cruel world..."
+
$ git add hello.py
$ git commit -m "Added sign-off message"

So now we have some code (the sign-off message) from xsquared that we want to merge back into the master branch.

$ git checkout master
$ git merge xsquared
Auto-merging hello.py
CONFLICT (content): Merge conflict in hello.py
Automatic merge failed; fix conflicts and then commit the result.

So let's take a look at the difference between the code now and our last commit:
$ git diff
diff --cc hello.py
index 0a78149,57e3f45..0000000
--- a/hello.py
+++ b/hello.py
@@@ -1,3 -1,6 +1,10 @@@
++<<<<<<< HEAD
 +print "Hello, world!!!! Let's print x"
++=======
+ print "Hello, world!!!! Let's print x**2."
++>>>>>>> xsquared
  x = 3
- print x
+ print x**2
+
+ print "Goodbye, cruel world..."
+

So we see a few things here:
  • The first line has two different versions. Because the same line was changed, it has no way to know what the Right Thing™ is, so it just gives us both options and makes us manually make the change.
  • It's been a little overzealous with the changes, and turned the "print x" into "print x2". This is easy to fix by hand.
  • It added in the sign-off message. That we'll just leave there.

Once we make those changes, we can add them to the index and then commit them.
$ git add hello.py
$ git commit -m "Resolved merge"**

By the way, this style of having an "experimental" branch and a "master" branch can be a good way to go about things. That way, you always have a branch that works, but you still have a place to add in new features and whatnot.

Collaboration

Even if you're going to be the only person touching your code, some kind of version control will likely be helpful, but if you're going to be working on it with other people, it's nearly essential. Git was designed by Linus Torvalds to help with the development of Linux, which has hundreds of individual contributors. (He also named it after himself: "I'm an egotistical bastard, and I name all my projects after myself. First Linux, now git.")

Teaching you how to do this is outside the scope of this course, but Git is able to deal with it. Unlike some other Version Control Systems, Git is distributed, meaning that there is no central copy that everyone agrees on. Each copy of a repository is just as valid as any other, and they can be merged at will. If you do find yourself collaborating with someone else (and maybe even if you don't), I'd encourage you to look at Github, a Git-based code server. In the free level, all your repositories are openly displayed (though only you can modify them, unless you give other users permission), but there are also relatively cheap options for having closed-source repositories, if you're concerned about getting scooped on something. It's also possible to set up a Git server on a central lab server, but setting that up is way outside the scope of this course.

Stubbing and the 'pass' statement


When we write complicated code, we need to decompose it into simpler parts. This is an intuitive concept, and one that we've touched on before. Let's say that we want to make a program that gambles online and makes money for you so that you are free to pursue the standard academic career path of postdoctoral positions ad infinitum.

Stubbing is writing what your program should be doing, without actually getting around to filling in the details. It's like writing an outline of a paper. In this case:

gambler.py

#!/usr/bin/env python
import sys
import internet
import gambling
 
accountID = sys.argv[1]
password = sys.argv[2]
 
[balance,sessionInfo] = internet.loginToIllegalGamblingServer(accountID,password)
while balance > 0:
    balance = gambling.playGame('poker',sessionInfo)
    if balance > 1000000000:
         print 'Congratulations: you are a billionaire.'
         internet.logoffFromIllegalGamblingServer()
         sys.exit()
print 'Darn!'
internet.logoffFromIllegalGamblingServer()

internet.py
def loginToIllegalGamblingServer(accountID,password):
     pass
 
def logoffFromIllegalGamblingServer():
     pass

gambling.py
def playGame(gameType,sessionInfo,balance):
     if gameType == 'poker':
         hand = requestHand(sessionInfo)
         if handIsGood(hand):
             amountWon = goAllIn(hand,balance,sessionInfo)
             if amountWon:
                 return amountWon + balance
             else:
                 return 0
         else:
             return balance
     else:
         print "I don't know how to play that game yet"
         return balance
 
def requestHand(sessionInfo):
    pass
 
def handIsGood(hand):
     pass
 
def goAllIn(hand,balance,sessionInfo):
     pass


We're free to write the easy parts first, saving the hard parts for later. We've already created the logical flow of the program, and by doing this early, we can keep it organized.

The only thing new that we've covered here is the statement pass. It's pretty simple: it does nothing. Although this sounds somewhat pointless, in this case it allows you to write little function stubs without Python (or your text editor) complaining. However, it pops up in other places as well, usually as a shortcut where you mean to write more code later. This could be in an if or else statement or while raising an exception. Each of those cases requires something after the colon for it to be valid Python, and pass is a valid way to put in something that does nothing. We won't cover those applications here, but keep them in mind while you're programming: we encourage you to try it out.