As Python matured into the multifunctional language it is today, more and more features were added. We've seen most of the original functionality of the language, which was originally intended for scripting of UNIX and web processes, and writing user interfaces. But as Python became popular as a scripting language, it found its way into the arms of scientists, who generally speaking, care more about efficient data manipulation than web programming. The big piece that was missing in Python was a numerical library, which means a collection of tools for dealing with large amounts of numbers at once. After a few efforts, an integrated one was developed. Numerical Python, or NumPy provided a large collection of mathematical functions similar to those found in a language like MATLAB or R, and also similarly to those languages, provides datatypes that are manipulatable as vectors and matrices. We'll introduce these data types, then their associated functions.

NumPy Basics

Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:

Arrays cannot be of mixed types. They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays. Which brings us to:

Arrays can be multidimensional, but they are not sparse structures.

We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in slice of the array.

Arrays

Here's one way: start with a list and change it:

import numpy as np
>>> a =[0] * 40>>> a = np.array(a)

Or this can be shortened:

>>> a = np.array([0] * 40)

But there's a better way to get a vector of zeros:

>>> a = np.zeros(40)

Note that the default type when declaring an array is float64:

>>>type(a[0])<type'numpy.float64'>

And here's the simplest way to change that:

>>> a = np.zeros(40,int)>>>type(a[0])<type'numpy.int64'>

And here's how to declare something that's not all zeros:

>>> a = np.arange(40)>>> a
array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39])>>>type(a[0])<type'numpy.int64'>

Notice the int type.

What if we want a float? There's a couple ways to do it:

>>> a = np.arange(40, dtype=float)#Explicitly tell it to make a float>>>type(a[0])<type'numpy.float64'>>>> a = np.arange(40.0)#Implicitly give a float, and it will still work>>>type(a[0])<type'numpy.float64'>

Like with range(), you can also give arange() more parameters:

>>> np.arange(40,50)# Start and Stoparray([40,41,42,43,44,45,46,47,48,49])>>> np.arange(40,50,1)# Start, Stop, and incrementarray([40,41,42,43,44,45,46,47,48,49])>>> np.arange(40,50,.25)# If any of the parameters are floats, the output will be a floatarray([40. ,40.25,40.5,40.75,41. ,41.25,41.5,41.75,42. ,42.25,42.5,42.75,43. ,43.25,43.5,43.75,44. ,44.25,44.5,44.75,45. ,45.25,45.5,45.75,46. ,46.25,46.5,46.75,47. ,47.25,47.5,47.75,48. ,48.25,48.5,48.75,49. ,49.25,49.5,49.75])

For our project, we're going to need to work with vectors of more than one dimension:

>>> a = np.zeros((10,10))# Note the inner set of parentheses>>> a
array([[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

>>> a[5][5]=3>>> a[6,6]=42>>> a
array([[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,3.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,42.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

>>> a[1,:]=1>>> a
array([[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[1.,1.,1.,1.,1.,1.,1.,1.,1.,1.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,3.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,42.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])>>> a[:,0]=7>>> a
array([[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,1.,1.,1.,1.,1.,1.,1.,1.,1.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,3.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,42.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])

Let us pause for a moment and think about how we would do this with a for loop in lists:

LoL =[[0]*10for i inrange(10)]for i, elem inenumerate(LoL[1]):
LoL[1][i]=1for L in LoL:
L[0]=7

We can also take slices of arrays, just as if they were lists:

>>> a = np.arange(10)>>> a[2:5]array([2,3,4])>>> a[-1]9>>> a[::-1]array([9,8,7,6,5,4,3,2,1,0])

Maybe you can see the advantage of the array syntax. But wait, there's more! Act now, and we'll throw in math operations for free!

Vector Math with Arrays

We can do math on many values at once with arrays, no for loop required.

>>> a = np.arange(0,100,2)>>> b = np.arange(50)# Simple math across the whole array:>>> a
array([0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98])>>> a + 10array([10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108])>>> b
array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49])>>> b/2.0array([0. ,0.5,1. ,1.5,2. ,2.5,3. ,3.5,4. ,4.5,5. ,5.5,6. ,6.5,7. ,7.5,8. ,8.5,9. ,9.5,10. ,10.5,11. ,11.5,12. ,12.5,13. ,13.5,14. ,14.5,15. ,15.5,16. ,16.5,17. ,17.5,18. ,18.5,19. ,19.5,20. ,20.5,21. ,21.5,22. ,22.5,23. ,23.5,24. ,24.5])# pairwise multiplication, consider the for loop alternative>>> a * b
array([0,2,8,18,32,50,72,98,128,162,200,242,288,338,392,450,512,578,648,722,800,882,968,1058,1152,1250,1352,1458,1568,1682,1800,1922,2048,2178,2312,2450,2592,2738,2888,3042,3200,3362,3528,3698,3872,4050,4232,4418,4608,4802])>>>>>> a = np.arange(0,100,2)>>> b = np.arange(50)>>>>>>sum(a * b)# alternatively, the function np.dot(a,b)80850

Boolean (logical) Values with Arrays

>>> a = np.zeros(10, dtype=bool)>>> a
array([False,False,False,False,False,False,False,False,False,False], dtype=bool)# Slicing and mass-assignment still works>>> a[2:5]=True>>> a
array([False,False,True,True,True,False,False,False,False,False], dtype=bool)# Comparison and assignment>>> b =(a ==False)>>> b
array([True,True,False,False,False,True,True,True,True,True], dtype=bool)>>> b =~a
>>> b
array([True,True,False,False,False,True,True,True,True,True], dtype=bool)>>> b =not a # It would be nice if this worked, but no such luck
Traceback (most recent call last):
File "<stdin>", line 1,in<module>ValueError: The truth value of an arraywith more than one element is ambiguous. Use a.any()or a.all()

Basic Statistics with Numpy

NumPy is freaking huge, with around 1200 pages of reference documentation, but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

Summary Statistics

>>> a = np.arange(99)>>> np.mean(a)49.0# Standard Deviation>>> np.std(a)28.577380332470412# Variance>>> np.var(a)816.66666666666674

Random distributions

>>> a = np.random.uniform(0,100,10)# Low, High, Size of Output>>> a
array([59.29058916,6.92921008,91.93188435,45.38511999,34.46457059,41.29478046,33.51351871,42.5493656,50.54698303,19.79408663])>>> a = np.random.uniform(0,100,(3,3))>>> a
array([[12.06197457,90.90232851,51.27616458],[61.12517983,88.14112687,29.36606864],[77.35499074,53.7975086,72.63088545]])>>> a = np.random.uniform(0,100,1000000)>>> np.mean(a)50.022576176522485# exponential sample with mean = e>>> np.mean(np.random.exponential(np.exp(1),1000000))2.7188135956330033

SciPy and Fitting

SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The stats module of SciPy has functions for even more statistical distributions, statistical tests, and other random functions that a good statistician might need. As an example, let's see how we might use the linregress function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

import numpy as np
slope =.5
intercept = -10
x = np.arange(0,100)
y = slope*x + intercept
noise =5 * np.random.randn(len(x))
y = y + noise

Aaron's going to show you how to generate plots like this in the afternoon, but if you know any Matlab, it's eerily similar.

This isn't a math class, so we're going to start with the equations for slope, intercept, and correlation coefficient of the best-fit line as given:

Now we just translate the math to Python code:

n =len(x)
m =(n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x**2) - (sum(x))**2)
b =(sum(y) - m * sum(x))/n
r =(n * sum(x * y) - sum(x) * sum(y)) / np.sqrt((n*sum(x**2) - sum(x)**2) * (n * sum(y**2) - sum(y)**2))print m, b, r

0.475921780193 -8.40815519235 0.937965104934

This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

Now that we've learned all of these high throughput ways of dealing with data, let's put it to use on the HIV project. First off, let's standardize our libraries, so download these files and put them in your Library/Python/Modules folder (or whatever it is that your PYTHONPATH is set to use):

Let's start by stubbing out the project based on the Supplemental Information of the paper. No doubt we'll need to add more as we go along, but this will give us a pretty good idea of the top-level functions we'll need to write.

def generate_binary_matrix(data, consensus):
""" This generates the binary matrix x_i(s) in S1.
x will be have num_residues rows and num_strains columns, and x[i,s] will be 1 if residue i of strain s is equal to residue i of the consensus sequence.
"""passdef get_consensus(strains):
""" Get the consensus sequence of aligned strains
This assumes that all strains are the same length, and defines consensus
sequence as the single most common residue at each position. If two or more
residues are equally common, it chooses arbitrarily (but not necessarily
randomly)
"""passdef generate_correlation_matrix(binary_matrix):
""" The other part of S1, which actually generates the correlation matrix
"""passdef clean_matrix(correlation_matrix, lambda_cutoff):
""" Uses RMT to clean the correlation matrix, as in S2
Every eigenvector with an eigenvalue greater than the cutoff is used to
generate a new correlation matrix
"""passdef clean_phylogeny(binary_matrix):
""" Cleans the binary matrix by removing the contribution of phylogeny
This is section S4.
"""passdef find_distinct_evo(binary_matrix):
""" Removes evolutionarily distinct sequences
Calculates a correlation matrix for the strains as they relate to each
other, and then removes those that are significantly different
This is section S5
"""passdef determine_sectors(correlation_matrix, lambda_cutoff):
""" Determines the sectors of the protein
See sections S6 and S7 of the supplementals.
This function returns both the strongest eigenvalue at a given residue and
a list of counters with the projection of each residue onto significant
eigenvectors
"""pass################################################################ Main Program Start#############################################################
gag_seq_file ='../data/HIV1_ALL_2009_GAG_PRO.fasta'
gag_data_full = read_fasta(gag_seq_file)# We don't actually need the names of the sequences, and a list is more# convenient for what we're doing than a dictionary
gag_data =[gag_data_full[name]for name in gag_data_full
if'B'in name.split('.')[0]]
consensus = get_consensus(gag_data)
x = generate_binary_matrix(gag_data, consensus)
distinct_strains = find_distinct_evo(x)# Somehow remove the distinct strains#consensus may have changed, so let's recalculate that:
consensus = get_consensus(gag_data)
x = generate_binary_matrix(gag_data, consensus)
x = clean_phylogeny(x)
corr_matrix = np.linalg.corrcoef(x)# We could write this by hand, but using the library function is likely# faster for the computer and well-debugged, so we won't have to# worry about my tendency to make stupid errors# Determine the lambda cutoff here
corr_matrix_clean = clean_matrix(corr_matrix, lambda_cutoff)
best, loadings = determine_sectors(corr_matrix_clean, lambda_cutoff)

Reading the math

The supplementals in the paper play a little bit fast and loose with the notation, and they borrow some from physics that you might not have seen, even if you've taken other linear algebra classes. Fortunately, I was a physics major in undergrad. I'll spend a little bit on the "what" of the premise behind the paper, though the "why" is best left to more theoretical folks than myself.

If you see a variable with 2 subscripts, like Cij, that typically means it's a matrix, and the element in the ith row and jth column is defined by the stuff after the equals sign. In code, this would be C[i,j]. At the moment, they're using the <angle brackets> to mean "average over", and the variable that they're averaging over is the subscript, so the first term means "average the product of the ith residue and jth residues of x over all sequences. In code, this would be mean(x[i,:] * x[j,:]).

Here, they use the angle brackets to mean something else entirely. They are abstract representations of vectors in what's called Bra-ket notation, or sometimes Dirac notation after the physicist who invented it. <k| is a "bra", or row vector, and |k> is a ket, or column vector. A bra times a ket is the same as a dot product (also called an inner product) of two vectors, which is a scaler. However, in the equation above, we have a ket times a bra, which is an outer product, and takes two vectors and makes a matrix. Don't worry too much if you don't have an intuitive understanding of what's going on, since I'm not certain I do. I just don't want you to be too blown away when we turn the math into code.

Building the Correlation Matrix

def get_consensus(strains):
""" Get the consensus sequence of aligned strains
This assumes that all strains are the same length, and defines consensus
sequence as the single most common residue at each position. If two or more
residues are equally common, it chooses arbitrarily (but not necessarily
randomly)
"""# Set up a list of counters equal to the length of the sequence
residue_counters =[Counter()for residue in strains[0]]for strain in strains:
# Loop over each strainfor index, residue inenumerate(strain):
# Loop over each residue and count how many times that residue has# appeared at that position
residue_counters[index][residue] +=1# Use a list comprehension to get the most common residue at each position
consensus_list =[counter.most_common()[0][0]for counter in residue_counters]# If there's no variation at a residue, then it will mess with the corrcoef# function later. Accumulate these in a list to return as well
novars =[]for i, counter inenumerate(residue_counters):
iflen(counter)==1:
novars.append(i)# Efficiently convert a list into a string
consensus =''.join(consensus_list)return consensus, novars

The reason we also want a list of non-varying residues is that, if the variance is 0, then the denominator of equation S1 goes to zero.

Let's also write a function that will remove these non-varying residues from the alignment.

def strip_positions(data, consensus, novar):
"""Remove positions given in novar from all of the strains as well as the
consensus strain.
"""
data =[strip_single_position(seq, novar)for seq in data]
consensus = strip_single_position(consensus, novar)return data, consensus
def strip_single_position(string, novar):
"Remove positions given in novar from a single string"
novar =set(novar)# Sets allow constant-time test of membership, as opposed to lists which# depend on the length of the listreturn"".join([char for i, char inenumerate(string)if i notin novar])

Now that we have our consensus sequence in place, and we've stripped out our perfectly conserved residues (as it turns out, there really aren't that many in our dataset), let's actually make the binary matrix x.

def generate_binary_matrix(data, consensus):
""" Generates a binary array x_i(s), where:
* Each column corresponds to a particular strain
* Each row corresponds to a particular site
* The element is 1 if that strain at that site is indentical to the
consensus sequence at that site
"""
x = np.zeros((len(consensus),len(data)), dtype=bool)for s, strain inenumerate(data):
for i,siteinenumerate(strain):
x[i,s]=(site== consensus[i])return x

Random Matrix Theory to Clean the Matrix

def clean_matrix(correlation_matrix, lambda_cutoff):
""" Uses RMT to clean the correlation matrix
Every eigenvector with an eigenvalue greater than the cutoff is used to
generate a new correlation matrix
"""# Calculate the eigenvalues and eigenvectors# the h at the end means it is Hermitian (for real values: it's symmetric)
eigvals, vecs = np.linalg.eigh(correlation_matrix)# The "clean" matrix starts out as just zeros
clean = np.zeros_like(correlation_matrix)for k, eigval inenumerate(eigvals):
if eigval > lambda_cutoff:
# For each eigenvalue larger than the cutoff, compute the outer# product, and add it to the clean matrix. This is equation S5
clean += eigval * np.outer(vecs[:,k], vecs[:,k])return clean

I think this is a fairly straightforward translation of equation S5. The real hard part is how do we know what that cutoff is supposed to be? In the paper, they do this "by randomly permuting the values of x_i(s) across different sequences sFor example, if there were only 3 sequences in the MSA, and a particular position had A, F, L as the amino acids in the 3 sequences, one random permutation would be F, L, A." So we can write a function to do this. For convenience sake, we don't actually have to recalculate the consensus sequence, since we don't change the number of each type of amino acid at a given residue.

def find_cutoff(alignment):
eigs =[]# We don't want our permutations to mess with the original alignment, which# it might do if we don't make a copy of it. Remember the interlude from# day 2?
alignment = alignment.copy()
nresidues, nstrains = np.shape(alignment)for i inrange(100):
# Shuffle the residues at each positionfor r inrange(nresidues):
alignment[r, :]= np.random.permutation(alignment[r, :])# Calculate the correlation coefficient
corr = np.corrcoef(alignment, bias=1)# Add the eigenvalues to the running list of eigenvalues
eigs.extend(np.linalg.eigvalsh(corr))# Poor-man's Progress barif i%10==9:
print'.',if i%100==99:
print''return eigs

Note that this typically takes a while to run, so I'll just tell you that I typically get around 3.45 as the highest value in the list of eigenvectors, which is somewhat higher than the paper claims, but shouldn't affect the results too much.

Cleaning of Phylogeny

The authors find that "the eigenvector corresponding to the highest eigenvalue [â€¦] largely reflects correlations due to phylogeny", and "it would be prefereable to have a way to recompute the eigenspectrum after removal of the highest eigenvalue". They thus calculate a value for each sequence that can be thought of as representing the evolutionary distance from the consensus sequence, and then they do a residue-by-residue fit to remove this evolutionary distance variable.

def clean_phylogeny(binary_matrix):
""" Cleans the binary matrix by removing the contribution of phylogeny
This is section S4.
"""
eigvals, eigvecs = np.linalg.eigh(np.corrcoef(binary_matrix))# "u_i^1 are the components of the eigenvector corresponding to the largest# eigenvalue"
u1 = eigvecs[:,eigvals.argmax()]
num_residues, num_strains = np.shape(binary_matrix)# Equation S11
M = np.array([sum(u1[i] * binary_matrix[i,s]for i inrange(num_residues))for s inrange(num_strains)])# Alpha could be a 1D array, but this is more convenient since we will need# to force it into a 2D array eventually
alpha = np.zeros((num_residues, num_strains))
beta = np.zeros( num_residues )
epsilon = np.zeros((num_residues, num_strains))for i inrange(num_residues):
# "The value of the parameters alpha_i and beta_i are estimated through# a least square regression..."
slope, intercept, rval, pval, stderr = stats.linregress(M, x[i,:])
alpha[i,:]= intercept
beta[i]= slope
# Equation S10:# x_i(s) = alpha_i + beta_i M(s) + epsilon_i(s)
epsilon = x - alpha - np.outer(beta, M)# Equation S12:# y_i(s) = alpha_i + epsilon_i(s)return alpha + epsilon

Removing evolutionarily distinct sequences

The procedure above only stands a chance of working well if the evolutionary distance is roughly evenly distributed. If some of the strains are grouped together into sub-clades, then M will likely have a multi-modal distribution. The authors claim that they find these sub-clades, and while I can't results as clean as figure S3, there does seem to be a central group of sequences and then some that are farther out.

def find_distinct_evo(binary_matrix):
""" Removes evolutionarily distinct sequences
Calculates a correlation matrix for the strains as they relate to each
other, and then removes those that are significantly different
This is section S5
"""
gamma = np.cov(binary_matrix.T)
eigvals, vecs = np.linalg.eigh(gamma)
vecs = vecs.T# Here there be dragons# Using the projections along eigenvector 2 and the cutoff of -.1 was# empirically determined. Your mileage may vary
proj1 =[np.dot(gamma[i], vecs[-1])for i inrange(len(eigvals))]
proj2 =[np.dot(gamma[i], vecs[-2])for i inrange(len(eigvals))]return[pos for pos, proj inenumerate(proj2)if proj > -.1]

Determining Sectors

We've done all this work to clean away various sources of noise, but we still haven't actually gotten the sectors. The way we'll go about identifying them is by the similarity of their row of the correlation matrix to the eigenvectors of the matrix. Fortunately, we have a really convenient to calculate the similarity of two vectors: the dot product. We can visualize this easily in two or three dimensions: if two vectors point in roughly the same direction, they will have a higher dot product than if they point nearly perpendicular to each other. This is the same idea, except in 500 dimensions instead of 3.

def determine_sectors(correlation_matrix, lambda_cutoff):
""" Determines the sectors of the protein
See sections S6 and S7 of the supplementals.
This function returns both the strongest eigenvalue at a given residue and
the a list of counters with the projection of each residue onto significant
eigenvectors
"""
eigvals, vecs = np.linalg.eigh(correlation_matrix)
n_residues = n_vectors =len(eigvals)
loadings =[Counter()for i inrange(n_residues)]# removes the autocorrelations, which should typically be much higher than# the inter-residue correlations# This works by multiplying by the inverse of the identity matrix
othercorrs =abs(correlation_matrix
* (~ np.identity(n_residues, dtype=bool)))for r inrange(n_residues):
ifmax(othercorrs[r])<0.15:
# "We chose to exclude from sectors those residues that did not# show any correlation higher than 0.1 (in absolute magnitude) with# any other sector residue"# I thought their threshold was a little low, since in a given# random matrix of about ~500x500, you would expect to see over 1000# correlations higher than that by chance, if you believe their P <# 5x10^-3 statisticcontinuefor k inrange(n_vectors):
if eigvals[k]> lambda_cutoff:
# The loading is simply the projection of the correlation values for# each residue onto a given eigenvector
loading = np.dot(correlation_matrix[:,r], vecs[:,k])
loadings[r][k]=abs(loading)
best =[(l.most_common()[0][0]if(len(l)>0)elseNone)for l in loadings]return best, loadings

Putting it all together

correlation_matrix.py

import numpy as np
from sequence_tools import read_fasta
fromcollectionsimport Counter
from scipy import stats
def filter_and_generate_binary(data):
"""Removes malformed strains and all sites that have non-varying residues or
gaps.
The return value x is a 2D boolean array, where:
*each row corresponds to a residue
*each column corresponds to a different strain
*a 1 means that the given strain at that residue is identical to the
consensus sequence
"""# Remove strains that aren't similar enough to each other
data = filter_strains(data)# Find the most common residue at each nucleotide location
consensus_sequence, novars = get_consensus(data)# For scientific and pratical reasons, we should exclude all sites that have# are a gap in the consensus strain. These won't be as useful for, say,# identifying locations for antibodies, and if we leave them in, we get way# too many sectors.
gaps =[idx for idx, res inenumerate(consensus_sequence)if res is'-']
novars.extend(gaps)
data, consensus_sequence = strip_positions(data, consensus_sequence,
novars)
x = generate_binary_matrix(data, consensus_sequence)return x
def filter_strains(data):
""" Filters out strains that are sufficiently different from the rest
On my first look-through of this data, some of the aligned proteins are
different lengths. Since we don't know whether they were clipped at the
beginning, the end, or what, we'll just take strains that are the most
common length.
In principle, you could have other filter conditions in here too!
"""# This is a python data type that counts things, and can tell you the most# common element
length_counter = Counter([len(strain)for strain in data])# A Counter's most_common method returns a list of tuples, with the first# element as the item, and the second element as the count of that item,# with the list sorted in descending order by counts
most_common = length_counter.most_common()[0][0]# Collect only strains that are the right length into the return variable
good_data =[]for sequence in data:
iflen(sequence)== most_common:
good_data.append(sequence)# Must not forget the return statementreturn good_data
def get_consensus(strains):
""" Get the consensus sequence of aligned strains
This assumes that all strains are the same length, and defines consensus
sequence as the single most common residue at each position. If two or more
residues are equally common, it chooses arbitrarily (but not necessarily
randomly)
"""# Set up a list of counters equal to the length of the sequence
residue_counters =[Counter()for residue in strains[0]]for strain in strains:
# Loop over each strainfor index, residue inenumerate(strain):
# Loop over each residue and count how many times that residue has# appeared at that position
residue_counters[index][residue] +=1# Use a list comprehension to get the most common residue at each position
consensus_list =[counter.most_common()[0][0]for counter in residue_counters]# If there's no variation at a residue, then it will mess with the corrcoef# function later. Accumulate these in a list to return as well
novars =[]for i, counter inenumerate(residue_counters):
iflen(counter)==1:
novars.append(i)# Efficiently convert a list into a string
consensus =''.join(consensus_list)return consensus, novars
def strip_positions(data, consensus, novar):
"""Remove positions given in novar from all of the strains as well as the
consensus strain.
"""
data =[strip_single_position(seq, novar)for seq in data]
consensus = strip_single_position(consensus, novar)return data, consensus
def strip_single_position(string, novar):
"Remove positions given in novar from a single string"
novar =set(novar)# Sets allow constant-time test of membership, as opposed to lists which# depend on the length of the listreturn"".join([char for i, char inenumerate(string)if i notin novar])def generate_binary_matrix(data, consensus):
""" Generates a binary array x_i(s), where:
* Each column corresponds to a particular strain
* Each row corresponds to a particular site
* The element is 1 if that strain at that site is indentical to the
consensus sequence at that site
"""
x = np.zeros((len(consensus),len(data)), dtype=bool)for s, strain inenumerate(data):
for i,siteinenumerate(strain):
x[i,s]=(site== consensus[i])return x
def find_cutoff(alignment):
eigs =[]# We don't want our permutations to mess with the original alignment, which# it might do if we don't make a copy of it. Remember the interlude from# day 2?
alignment = alignment.copy()
nresidues, nstrains = np.shape(alignment)
max_corr =0global allcorrs
allcorrs = np.empty(100*nresidues**2)for i inrange(100):
# Shuffle the residues at each positionfor r inrange(nresidues):
alignment[r, :]= np.random.permutation(alignment[r, :])# Calculate the correlation coefficient
corr = np.corrcoef(alignment, bias=1)# Add the eigenvalues to the running list of eigenvalues
eigs.extend(np.linalg.eigvalsh(corr))
allcorrs[i*nresidues**2:(i+1)*nresidues**2]= \
abs((corr*~np.identity(nresidues,dtype=bool)).ravel())# Poor-man's Progress barif i%10==9:
print'.',if i%100==99:
print''return eigs
def clean_matrix(correlation_matrix, lambda_cutoff):
""" Uses RMT to clean the correlation matrix
Every eigenvector with an eigenvalue greater than the cutoff is used to
generate a new correlation matrix
"""# Calculate the eigenvalues and eigenvectors# the h at the end means it is Hermitian (for real values: it's symmetric)
eigvals, vecs = np.linalg.eigh(correlation_matrix)# The "clean" matrix starts out as just zeros
clean = np.zeros_like(correlation_matrix)for k, eigval inenumerate(eigvals):
if eigval > lambda_cutoff and eigval !=max(eigvals):
# For each eigenvalue larger than the cutoff, compute the outer# product, and add it to the clean matrix. This is equation S5
clean += eigval * np.outer(vecs[:,k], vecs[:,k])return clean
def clean_phylogeny(binary_matrix):
""" Cleans the binary matrix by removing the contribution of phylogeny
This is section S4.
"""
eigvals, eigvecs = np.linalg.eigh(np.corrcoef(binary_matrix))# "u_i^1 are the components of the eigenvector corresponding to the largest# eigenvalue"
u1 = eigvecs[:,eigvals.argmax()]
num_residues, num_strains = np.shape(binary_matrix)# Equation S11
M = np.array([sum(u1[i] * binary_matrix[i,s]for i inrange(num_residues))for s inrange(num_strains)])# Alpha could be a 1D array, but this is more convenient since we will need# to force it into a 2D array eventually
alpha = np.zeros((num_residues, num_strains))
beta = np.zeros( num_residues )
epsilon = np.zeros((num_residues, num_strains))for i inrange(num_residues):
# "The value of the parameters alpha_i and beta_i are estimated through# a least square regression..."
slope, intercept, rval, pval, stderr = stats.linregress(M, x[i,:])
alpha[i,:]= intercept
beta[i]= slope
# Equation S10:# x_i(s) = alpha_i + beta_i M(s) + epsilon_i(s)
epsilon = x - alpha - np.outer(beta, M)# Equation S12:# y_i(s) = alpha_i + epsilon_i(s)return alpha + epsilon
def find_distinct_evo(binary_matrix):
""" Removes evolutionarily distinct sequences
Calculates a correlation matrix for the strains as they relate to each
other, and then removes those that are significantly different
This is section S5
"""
gamma = np.cov(binary_matrix.T)
eigvals, vecs = np.linalg.eigh(gamma)
vecs = vecs.T# Here there be dragons# Using the projections along eigenvector 2 and the cutoff of -.1 was# empirically determined. Your mileage may vary
proj1 =[np.dot(gamma[i], vecs[-1])for i inrange(len(eigvals))]
proj2 =[np.dot(gamma[i], vecs[-2])for i inrange(len(eigvals))]return[pos for pos, proj inenumerate(proj2)if proj > -.1]def determine_sectors(correlation_matrix, lambda_cutoff):
""" Determines the sectors of the protein
See sections S6 and S7 of the supplementals.
This function returns both the strongest eigenvalue at a given residue and
the a list of counters with the projection of each residue onto significant
eigenvectors
"""
eigvals, vecs = np.linalg.eigh(correlation_matrix)
n_residues = n_vectors =len(eigvals)
loadings =[Counter()for i inrange(n_residues)]# removes the autocorrelations, which should typically be much higher than# the inter-residue correlations# This works by multiplying by the inverse of the identity matrix
othercorrs =abs(correlation_matrix
* (~ np.identity(n_residues, dtype=bool)))for r inrange(n_residues):
ifmax(othercorrs[r])<0.15:
# "We chose to exclude from sectors those residues that did not# show any correlation higher than 0.1 (in absolute magnitude) with# any other sector residue"# I thought their threshold was a little low, since in a given# random matrix of about ~500x500, you would expect to see over 1000# correlations higher than that by chance, if you believe their P <# 5x10^-3 statisticcontinuefor k inrange(n_vectors):
if eigvals[k]> lambda_cutoff:
# The loading is simply the projection of the correlation values for# each residue onto a given eigenvector
loading = np.dot(correlation_matrix[:,r], vecs[:,k])
loadings[r][k]=abs(loading)
best =[(l.most_common()[0][0]if(len(l)>0)elseNone)for l in loadings]return best, loadings
################################################################ Main Program Start#############################################################
gag_seq_file ='../data/HIV1_ALL_2009_GAG_PRO.fasta'
gag_data_full = read_fasta(gag_seq_file)# We don't actually need the names of the sequences, and a list is more# convenient for what we're doing than a dictionary
gag_data =[gag_data_full[name]for name in gag_data_full
if'B'in name.split('.')[0]]print"First Pass Filtering"
x = filter_and_generate_binary(gag_data)
distinct_strains = find_distinct_evo(x)
gag_data2 =[strain for idx, strain inenumerate(gag_data)if idx in
distinct_strains]
rows, cols = np.shape(x)print"Found %d locations in %d strains" % (rows, cols)print"Second Pass filtering"# The claim in S4 is that the method works best if we remove the contribution# from evolutionarily distinct sequences, so we'll have to re-run once we've# taken those out.
x = filter_and_generate_binary(gag_data2)
x = clean_phylogeny(x)
rows, cols = np.shape(x)print"Found %d locations in %d strains" % (rows, cols)print"Building matrix"
corr_matrix = np.corrcoef(x, bias=1)# It actually takes a while for this to run, so I'm going to leave it commented# out, and just put in the result that I happen to know it will give us.#print "Finding Cutoff"#eigs = find_cutoff(x)#lambda_cutoff = max(eigs)
lambda_cutoff =3.45print"Cleaning matrix"
corr_matrix_clean = clean_matrix(corr_matrix, lambda_cutoff)
best, loadings = determine_sectors(corr_matrix_clean, lambda_cutoff)

Exercises

1) Writing Mathematical Functions
a) Write a function that accepts an array of floats. Return an array where every value of the input array is divided by 1.5.
b) Use the random functions to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation from the mean of the array.
c) Write a function that evaluate the function y = e^x for values from 0 to 10 and stores the values in an array. The function should also generate a random exponential sample of the same length. Sort the random sample, then return the sum of the distances at each point from the random sample to the exponential model.

2) Comparing the Sectors
At the end of all of this, we have identified which sector (if any) each site belongs to. Compare the sites we found to those that Dahirel et. al found. For instance, we might be curious how many sites we identified as significant that they didn't (and vice-versa), and how many we agree on (hint: see about sets from the second day). Feel free to compare the sectors in other ways too.

For your convenience, I've pulled out the sets from the supplemental information and put them into python lists below:

# From Dahirel et al:
sec1 =[1,88,2,94,3,97,4,99,5,100,6,108,8,118,9,120,11,122,12,123,14,128,16,129,19,131,20,133,21,134,24,135,27,136,29,138,32,139,33,141,35,142,36,143,38,144,39,145,41,148,45,149,48,150,50,151,51,152,52,153,57,154,60,155,63,156,73,158,77,160,79,251,83,276,86,279,87,433]
sec2 =[23,446,37,450,178,452,379,455,381,457,386,459,391,461,392,462,393,463,394,464,395,489,399,400,402,405,406,407,408,412,413,414,416,417,419,420,423,430,431,432,434,435,437,438,439,440,442,443,444,445]
sec3 =[53,305,140,306,163,310,167,316,169,317,170,319,171,323,172,326,174,338,175,344,179,345,180,346,181,182,185,186,187,189,191,198,199,212,221,225,229,233,240,243,245,249,257,260,263,265,269,284,288,291,295,347,363,364,365,366,367]
sec4 =[166,197,211,222,236,237,308,318,354,396]
sec5 =[17,31,47,137,161,261,275,278,290,298,324,334,337,343]
secQ =[18,30,54,62,69,90,125,130,146,147,159,173,176,200,218,219,223,224,228,230,234,242,248,252,255,256,264,267,268,273,280,281,286,312,341,357,362,374,375,376,401,403]

3) Sorting by sector

Using some of the plotting commands we'll show you this afternoon, we can generate a heatmap of the correlation matrix.

Python 2.7.1 |EPD 7.0-2 (64-bit)| (r271:86832, Dec 3 2010, 15:56:20)
Type "copyright", "credits" or "license" for more information.

IPython 0.10.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

Welcome to pylab, a matplotlib-based Python environment.
For more information, type 'help(pylab)'. In[1]: run correlation_matrix.py

We get the figure on the left, which looks only vaguely similar to the figure from the paper on the right.

Reading through the supplementals even more thoroughly, we see this line: "...the cleaned correlation matrices are represented by heat maps. However, the positions are grouped according to sites contained in each sector, not the linear sequence of the protein". If you have a list of sites you want to pull out of a correlation matrix, you can do something like:
corr_matrix_clean[np.meshgrid(sites, sites)]
to group those sites together and ignore the rest. (The function meshgrid returns two matrices that are x- and y-values corresponding to its two arguments, and something called "fancy indexing" lets us pull only those sites out of the correlation matrix in the proper way).

Sort the sites in each sector according to its loading along its best eigenvector, combine the list of sites together using the numpy function hstack, then use imshow to plot only the sectors.

Solutions

1) Functions

# a)def two_thirds(an_array):
return an_array/1.5# b)from numpy import mean, std
from pylab import normal
def reject_outliers(an_array):
middle = mean(an_array)
spread = std(an_array)return an_array[((middle - spread)> an_array) | (an_array >(middle + spread))]# the | combines the two arrays# OR#
ret =[]for el in an_array:
if -spread < el - middle < spread:
ret.append(el)return ret
print reject_outliers(normal(0,1,1000))# c)from pylab import arange, exp, uniform
x = arange(0,10,.01)
y = exp(x)
z = uniform(0,10,len(y))
z = exp(z)
z.sort()# note that z.sort() returns Noneprintsum(y - z) / len(y)

2) Comparing the sectors

secs =[]for sec inset(best):
if sec isNone:
continue
sites = np.where(best == sec)[0]## OR ##
sites =[]for res inrange(len(best)):
if best[res]== sec:
sites.append(res)
secs.append(sites)
secs_me = np.hstack(secs)# From Dahirel et al:
sec1 =[1,88,2,94,3,97,4,99,5,100,6,108,8,118,9,120,11,122,12,123,14,128,16,129,19,131,20,133,21,134,24,135,27,136,29,138,32,139,33,141,35,142,36,143,38,144,39,145,41,148,45,149,48,150,50,151,51,152,52,153,57,154,60,155,63,156,73,158,77,160,79,251,83,276,86,279,87,433]
sec2 =[23,446,37,450,178,452,379,455,381,457,386,459,391,461,392,462,393,463,394,464,395,489,399,400,402,405,406,407,408,412,413,414,416,417,419,420,423,430,431,432,434,435,437,438,439,440,442,443,444,445]
sec3 =[53,305,140,306,163,310,167,316,169,317,170,319,171,323,172,326,174,338,175,344,179,345,180,346,181,182,185,186,187,189,191,198,199,212,221,225,229,233,240,243,245,249,257,260,263,265,269,284,288,291,295,347,363,364,365,366,367]
sec4 =[166,197,211,222,236,237,308,318,354,396]
sec5 =[17,31,47,137,161,261,275,278,290,298,324,334,337,343]
secQ =[18,30,54,62,69,90,125,130,146,147,159,173,176,200,218,219,223,224,228,230,234,242,248,252,255,256,264,267,268,273,280,281,286,312,341,357,362,374,375,376,401,403]
secs_them = sec1+sec2+sec3+sec4+sec5+secQ
me_only =set(secs_me).difference(secs_them)
them_only =set(secs_me).symmetric_difference(secs_them) - me_only
print"I did find %3d that Dahirel et al didn't" % len(me_only)print"Did not find %3d that Dahirel et al did" % len(them_only)

3) Sorting the sectors

bestloadings =[]for i,l inenumerate(loadings):
iflen(l)>0:
bestloadings.append(l.most_common()[0][1])# Append the Loading (rather than the number of the eigenvector) to# bestloadingselse:
bestloadings.append(0)# There wasn't a good loading here
bestloadings = np.array(bestloadings)
sorted_secs =[]for sec in secs:
sec = np.array(sec)
ls = bestloadings[sec]
sorted_secs.append(sec[ls.argsort()])
seclist = np.hstack(sorted_secs)
imshow(corr_matrix_clean[meshgrid(seclist, seclist)])

## Numerical Python (NumPy)

## Topics

arraydata typesnumpyandscipypackagesforloops.## Introduction

As Python matured into the multifunctional language it is today, more and more features were added. We've seen most of the original functionality of the language, which was originally intended for scripting of UNIX and web processes, and writing user interfaces. But as Python became popular as a scripting language, it found its way into the arms of scientists, who generally speaking, care more about efficient data manipulation than web programming. The big piece that was missing in Python was a numerical library, which means a collection of tools for dealing with large amounts of numbers at once. After a few efforts, an integrated one was developed. Numerical Python, or NumPy provided a large collection of mathematical functions similar to those found in a language like MATLAB or R, and also similarly to those languages, provides datatypes that are manipulatable as vectors and matrices. We'll introduce these data types, then their associated functions.## NumPy Basics

Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:

integers, floats, strings, logical(orboolean) values, or otherimmutablevalues. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays. Which brings us to:## Arrays

Here's one way: start with a list and change it:Or this can be shortened:

But there's a better way to get a vector of zeros:

Note that the default type when declaring an

arrayisfloat64:And here's the simplest way to change that:

And here's how to declare something that's not all zeros:

Notice the

inttype.What if we want a float? There's a couple ways to do it:

Like with

range(), you can also givearange()more parameters:For our project, we're going to need to work with vectors of more than one dimension:

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

Let us pause for a moment and think about how we would do this with a for loop in lists:

We can also take slices of arrays, just as if they were lists:

Maybe you can see the advantage of the array syntax. But wait, there's more! Act now, and we'll throw in math operations for free!

## Vector Math with Arrays

We can do math on many values at once with arrays, no for loop required.## Boolean (logical) Values with Arrays

## Basic Statistics with Numpy

NumPy is freaking huge, with around 1200 pages of reference documentation, but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

Summary StatisticsRandom distributions## SciPy and Fitting

SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The stats module of SciPy has functions for even more statistical distributions, statistical tests, and other random functions that a good statistician might need. As an example, let's see how we might use the linregress function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

Aaron's going to show you how to generate plots like this in the afternoon, but if you know any Matlab, it's eerily similar.

This isn't a math class, so we're going to start with the equations for slope, intercept, and correlation coefficient of the best-fit line as given:

Now we just translate the math to Python code:

0.475921780193 -8.40815519235 0.937965104934This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

Regression Slope: 0.475921780193Regression Intercept: -8.40815519235Regression correlation: 0.937965104934R^2:, 0.879778538074p(slope is 0): 7.10492025205e-47## HIV Project

Now that we've learned all of these high throughput ways of dealing with data, let's put it to use on the HIV project. First off, let's standardize our libraries, so download these files and put them in your Library/Python/Modules folder (or whatever it is that your PYTHONPATH is set to use):

Let's start by stubbing out the project based on the Supplemental Information of the paper. No doubt we'll need to add more as we go along, but this will give us a pretty good idea of the top-level functions we'll need to write.

## Reading the math

The supplementals in the paper play a little bit fast and loose with the notation, and they borrow some from physics that you might not have seen, even if you've taken other linear algebra classes. Fortunately, I was a physics major in undergrad. I'll spend a little bit on the "what" of the premise behind the paper, though the "why" is best left to more theoretical folks than myself.

If you see a variable with 2 subscripts, like Cij, that typically means it's a matrix, and the element in the ith row and jth column is defined by the stuff after the equals sign. In code, this would be

C[i,j]. At the moment, they're using the <angle brackets> to mean "average over", and the variable that they're averaging over is the subscript, so the first term means "average the product of the ith residue and jth residues of x over all sequences. In code, this would bemean(x[i,:] * x[j,:]).Here, they use the angle brackets to mean something else entirely. They are abstract representations of vectors in what's called Bra-ket notation, or sometimes Dirac notation after the physicist who invented it. <k| is a "bra", or row vector, and |k> is a ket, or column vector. A bra times a ket is the same as a dot product (also called an inner product) of two vectors, which is a scaler. However, in the equation above, we have a ket times a bra, which is an outer product, and takes two vectors and makes a matrix. Don't worry too much if you don't have an intuitive understanding of what's going on, since I'm not certain I do. I just don't want you to be too blown away when we turn the math into code.

## Building the Correlation Matrix

The reason we also want a list of non-varying residues is that, if the variance is 0, then the denominator of equation S1 goes to zero.

Let's also write a function that will remove these non-varying residues from the alignment.

Now that we have our consensus sequence in place, and we've stripped out our perfectly conserved residues (as it turns out, there really aren't that many in our dataset), let's actually make the binary matrix x.

## Random Matrix Theory to Clean the Matrix

I think this is a fairly straightforward translation of equation S5. The real hard part is how do we know what that cutoff is supposed to be? In the paper, they do this "by randomly permuting the values of x_i(s) across different sequences

sFor example, if there were only 3 sequences in the MSA, and a particular position had A, F, L as the amino acids in the 3 sequences, one random permutation would be F, L, A." So we can write a function to do this. For convenience sake, we don't actually have to recalculate the consensus sequence, since we don't change the number of each type of amino acid at a given residue.Note that this typically takes a while to run, so I'll just tell you that I typically get around 3.45 as the highest value in the list of eigenvectors, which is somewhat higher than the paper claims, but shouldn't affect the results too much.

## Cleaning of Phylogeny

The authors find that "the eigenvector corresponding to the highest eigenvalue [â€¦] largely reflects correlations due to phylogeny", and "it would be prefereable to have a way to recompute the eigenspectrum after removal of the highest eigenvalue". They thus calculate a value for each sequence that can be thought of as representing the evolutionary distance from the consensus sequence, and then they do a residue-by-residue fit to remove this evolutionary distance variable.## Removing evolutionarily distinct sequences

The procedure above only stands a chance of working well if the evolutionary distance is roughly evenly distributed. If some of the strains are grouped together into sub-clades, then M will likely have a multi-modal distribution. The authors claim that they find these sub-clades, and while I can't results as clean as figure S3, there does seem to be a central group of sequences and then some that are farther out.## Determining Sectors

We've done all this work to clean away various sources of noise, but we still haven't actually gotten the sectors. The way we'll go about identifying them is by the similarity of their row of the correlation matrix to the eigenvectors of the matrix. Fortunately, we have a really convenient to calculate the similarity of two vectors: the dot product. We can visualize this easily in two or three dimensions: if two vectors point in roughly the same direction, they will have a higher dot product than if they point nearly perpendicular to each other. This is the same idea, except in 500 dimensions instead of 3.## Putting it all together

correlation_matrix.py## Exercises

1) Writing Mathematical Functionsa) Write a function that accepts an array of floats. Return an array where every value of the input array is divided by 1.5.

b) Use the random functions to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation from the mean of the array.

c) Write a function that evaluate the function y = e^x for values from 0 to 10 and stores the values in an array. The function should also generate a random exponential sample of the same length. Sort the random sample, then return the sum of the distances at each point from the random sample to the exponential model.

2) Comparing the SectorsAt the end of all of this, we have identified which sector (if any) each site belongs to. Compare the sites we found to those that Dahirel et. al found. For instance, we might be curious how many sites we identified as significant that they didn't (and vice-versa), and how many we agree on (hint: see about

sets from the second day). Feel free to compare the sectors in other ways too.For your convenience, I've pulled out the sets from the supplemental information and put them into python lists below:

3) Sorting by sectorUsing some of the plotting commands we'll show you this afternoon, we can generate a heatmap of the correlation matrix.

$

ipython -pylabEnthought Python Distribution -- http://www.enthought.com

Python 2.7.1 |EPD 7.0-2 (64-bit)| (r271:86832, Dec 3 2010, 15:56:20)

Type "copyright", "credits" or "license" for more information.

IPython 0.10.1 -- An enhanced Interactive Python.

? -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help -> Python's own help system.

object? -> Details about 'object'. ?object also works, ?? prints more.

Welcome to pylab, a matplotlib-based Python environment.

For more information, type 'help(pylab)'.

In[1]:run correlation_matrix.pyIn[2]:imshow(corr_matrix_clean, vmin=0, vmax=.25)We get the figure on the left, which looks only vaguely similar to the figure from the paper on the right.

Reading through the supplementals even more thoroughly, we see this line: "...the cleaned correlation matrices are represented by heat maps. However, the positions are grouped according to sites contained in each sector, not the linear sequence of the protein". If you have a list of sites you want to pull out of a correlation matrix, you can do something like:

corr_matrix_clean[np.meshgrid(sites, sites)]

to group those sites together and ignore the rest. (The function

meshgridreturns two matrices that are x- and y-values corresponding to its two arguments, and something called "fancy indexing" lets us pull only those sites out of the correlation matrix in the proper way).Sort the sites in each sector according to its loading along its best eigenvector, combine the list of sites together using the numpy function

hstack, then useimshowto plot only the sectors.## Solutions

1) Functions2) Comparing the sectors3) Sorting the sectors