Numerical Python (NumPy)

Topics

  • array data types
  • Conversion between standard Python data types and numpy data types
  • Introduction to numpy and scipy packages
  • Basic statistical analysis with numpy tools
  • Compare performance using vector math and for loops.
  • The math-heavy part of the project


Introduction

As Python matured into the multifunctional language it is today, more and more features were added. We've seen most of the original functionality of the language, which was originally intended for scripting of UNIX and web processes, and writing user interfaces. But as Python became popular as a scripting language, it found its way into the arms of scientists, who generally speaking, care more about efficient data manipulation than web programming. The big piece that was missing in Python was a numerical library, which means a collection of tools for dealing with large amounts of numbers at once. After a few efforts, an integrated one was developed. Numerical Python, or NumPy provided a large collection of mathematical functions similar to those found in a language like MATLAB or R, and also similarly to those languages, provides datatypes that are manipulatable as vectors and matrices. We'll introduce these data types, then their associated functions.


NumPy Basics



Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:

  1. Arrays cannot be of mixed types. They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays. Which brings us to:
  2. Arrays can be multidimensional, but they are not sparse structures.
  3. We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in slice of the array.

Arrays

Here's one way: start with a list and change it:

import numpy as np
 
>>> a = [0] * 40
>>> a = np.array(a)

Or this can be shortened:

>>> a = np.array([0] * 40)

But there's a better way to get a vector of zeros:
>>> a = np.zeros(40)

Note that the default type when declaring an array is float64:
>>> type(a[0])
<type 'numpy.float64'>

And here's the simplest way to change that:
>>> a = np.zeros(40, int)
>>> type(a[0])
<type 'numpy.int64'>

And here's how to declare something that's not all zeros:
>>> a = np.arange(40)
>>> a
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39])
>>> type(a[0])
<type 'numpy.int64'>
Notice the int type.

What if we want a float? There's a couple ways to do it:
>>> a = np.arange(40, dtype=float)  #Explicitly tell it to make a float
>>> type(a[0])
<type 'numpy.float64'>
>>> a = np.arange(40.0)  #Implicitly give a float, and it will still work
>>> type(a[0])
<type 'numpy.float64'>

Like with range(), you can also give arange() more parameters:
>>> np.arange(40, 50)  # Start and Stop
array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
>>> np.arange(40, 50, 1) # Start, Stop, and increment
array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
>>> np.arange(40,50,.25)  # If any of the parameters are floats, the output will be a float
array([ 40.  ,  40.25,  40.5 ,  40.75,  41.  ,  41.25,  41.5 ,  41.75,
        42.  ,  42.25,  42.5 ,  42.75,  43.  ,  43.25,  43.5 ,  43.75,
        44.  ,  44.25,  44.5 ,  44.75,  45.  ,  45.25,  45.5 ,  45.75,
        46.  ,  46.25,  46.5 ,  46.75,  47.  ,  47.25,  47.5 ,  47.75,
        48.  ,  48.25,  48.5 ,  48.75,  49.  ,  49.25,  49.5 ,  49.75])

For our project, we're going to need to work with vectors of more than one dimension:

>>> a = np.zeros( (10, 10) )  # Note the inner set of parentheses
>>> a
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

>>> a[5][5] = 3
>>> a[6,6] = 42
>>> a
array([[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

>>> a[1,:] = 1
>>> a
array([[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])
>>> a[:,0] = 7
>>> a
array([[  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])


Let us pause for a moment and think about how we would do this with a for loop in lists:

LoL = [[0]*10 for i in range(10)]
 
for i, elem in enumerate(LoL[1]):
    LoL[1][i] = 1
 
for L in LoL:
    L[0] = 7

We can also take slices of arrays, just as if they were lists:
>>> a = np.arange(10)
>>> a[2:5]
array([2, 3, 4])
>>> a[-1]
9
>>> a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Maybe you can see the advantage of the array syntax. But wait, there's more! Act now, and we'll throw in math operations for free!

Vector Math with Arrays

We can do math on many values at once with arrays, no for loop required.

>>> a = np.arange(0,100,2)
>>> b = np.arange(50)
# Simple math across the whole array:
>>> a
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66,
       68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98])
>>> a + 10
array([ 10,  12,  14,  16,  18,  20,  22,  24,  26,  28,  30,  32,  34,
        36,  38,  40,  42,  44,  46,  48,  50,  52,  54,  56,  58,  60,
        62,  64,  66,  68,  70,  72,  74,  76,  78,  80,  82,  84,  86,
        88,  90,  92,  94,  96,  98, 100, 102, 104, 106, 108])
>>> b
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
>>> b/2.0
array([  0. ,   0.5,   1. ,   1.5,   2. ,   2.5,   3. ,   3.5,   4. ,
         4.5,   5. ,   5.5,   6. ,   6.5,   7. ,   7.5,   8. ,   8.5,
         9. ,   9.5,  10. ,  10.5,  11. ,  11.5,  12. ,  12.5,  13. ,
        13.5,  14. ,  14.5,  15. ,  15.5,  16. ,  16.5,  17. ,  17.5,
        18. ,  18.5,  19. ,  19.5,  20. ,  20.5,  21. ,  21.5,  22. ,
        22.5,  23. ,  23.5,  24. ,  24.5])
 
# pairwise multiplication, consider the for loop alternative
>>> a * b
array([   0,    2,    8,   18,   32,   50,   72,   98,  128,  162,  200,
        242,  288,  338,  392,  450,  512,  578,  648,  722,  800,  882,
        968, 1058, 1152, 1250, 1352, 1458, 1568, 1682, 1800, 1922, 2048,
       2178, 2312, 2450, 2592, 2738, 2888, 3042, 3200, 3362, 3528, 3698,
       3872, 4050, 4232, 4418, 4608, 4802])
>>>
>>> a = np.arange(0,100,2)
>>> b = np.arange(50)
>>>
>>> sum(a * b)  # alternatively, the function np.dot(a,b)
80850

Boolean (logical) Values with Arrays


>>> a = np.zeros(10, dtype=bool)
>>> a
array([False, False, False, False, False, False, False, False, False, False], dtype=bool)
 
# Slicing and mass-assignment still works
>>> a[2:5] = True
>>> a
array([False, False,  True,  True,  True, False, False, False, False, False], dtype=bool)
 
# Comparison and assignment
>>> b = (a == False)
>>> b
array([ True,  True, False, False, False,  True,  True,  True,  True,  True], dtype=bool)
>>> b = ~a
>>> b
array([ True,  True, False, False, False,  True,  True,  True,  True,  True], dtype=bool)
>>> b = not a # It would be nice if this worked, but no such luck
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Basic Statistics with Numpy


NumPy is freaking huge, with around 1200 pages of reference documentation, but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

Summary Statistics
>>> a = np.arange(99)
>>> np.mean(a)
49.0
# Standard Deviation
>>> np.std(a)
28.577380332470412
# Variance
>>> np.var(a)
816.66666666666674

Random distributions
>>> a = np.random.uniform(0, 100, 10) # Low, High, Size of Output
>>> a
array([ 59.29058916,   6.92921008,  91.93188435,  45.38511999,
        34.46457059,  41.29478046,  33.51351871,  42.5493656 ,
        50.54698303,  19.79408663])
>>> a = np.random.uniform(0, 100, (3,3))
>>> a
array([[ 12.06197457,  90.90232851,  51.27616458],
       [ 61.12517983,  88.14112687,  29.36606864],
       [ 77.35499074,  53.7975086 ,  72.63088545]])
 
>>> a = np.random.uniform(0,100,1000000)
>>> np.mean(a)
50.022576176522485
 
# exponential sample with mean = e
>>> np.mean(np.random.exponential(np.exp(1),1000000))
2.7188135956330033

SciPy and Fitting


SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The stats module of SciPy has functions for even more statistical distributions, statistical tests, and other random functions that a good statistician might need. As an example, let's see how we might use the linregress function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

import numpy as np
 
slope = .5
intercept = -10
 
x = np.arange(0,100)
y = slope*x + intercept
noise = 5 * np.random.randn(len(x))
 
y = y + noise
7.1-fitting-points.png

Aaron's going to show you how to generate plots like this in the afternoon, but if you know any Matlab, it's eerily similar.

This isn't a math class, so we're going to start with the equations for slope, intercept, and correlation coefficient of the best-fit line as given:

eq3.JPG
Now we just translate the math to Python code:
n = len(x)
 
m = (n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x**2) - (sum(x))**2)
b = (sum(y) - m * sum(x))/n
r = (n * sum(x * y) - sum(x) * sum(y)) / np.sqrt((n*sum(x**2) - sum(x)**2) * (n * sum(y**2) - sum(y)**2))
 
print m, b, r
 

0.475921780193 -8.40815519235 0.937965104934

This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

from scipy import stats
 
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
 
print "Regression Slope: ", r_slope
print "Regression Intercept: ", r_int
print "Regression correlation: ", r_rval
print "R^2:, ", r_rval**2
print "p(slope is 0): ", r_pval
Regression Slope: 0.475921780193
Regression Intercept: -8.40815519235
Regression correlation: 0.937965104934
R^2:, 0.879778538074
p(slope is 0): 7.10492025205e-47

7.1-fitting.png

HIV Project


Now that we've learned all of these high throughput ways of dealing with data, let's put it to use on the HIV project. First off, let's standardize our libraries, so download these files and put them in your Library/Python/Modules folder (or whatever it is that your PYTHONPATH is set to use):



Let's start by stubbing out the project based on the Supplemental Information of the paper. No doubt we'll need to add more as we go along, but this will give us a pretty good idea of the top-level functions we'll need to write.

def generate_binary_matrix(data, consensus):
    """ This generates the binary matrix x_i(s) in S1.
 
     x will be have num_residues rows and num_strains columns, and x[i,s] will be 1 if residue i of strain s is equal to residue i of the consensus sequence.
 
    """
    pass
 
 
def get_consensus(strains):
    """ Get the consensus sequence of aligned strains
 
    This assumes that all strains are the same length, and defines consensus
    sequence as the single most common residue at each position. If two or more
    residues are equally common, it chooses arbitrarily (but not necessarily
    randomly)
 
    """
    pass
 
 
 
def generate_correlation_matrix(binary_matrix):
    """ The other part of S1, which actually generates the correlation matrix
 
    """
    pass
 
 
def clean_matrix(correlation_matrix, lambda_cutoff):
    """ Uses RMT to clean the correlation matrix, as in S2
 
    Every eigenvector with an eigenvalue greater than the cutoff is used to
    generate a new correlation matrix
    """
    pass
 
def clean_phylogeny(binary_matrix):
    """ Cleans the binary matrix by removing the contribution of phylogeny
 
    This is section S4.
 
    """
    pass
 
def find_distinct_evo(binary_matrix):
    """ Removes evolutionarily distinct sequences
 
    Calculates a correlation matrix for the strains as they relate to each
    other, and then removes those that are significantly different
 
    This is section S5
    """
    pass
 
def determine_sectors(correlation_matrix, lambda_cutoff):
    """ Determines the sectors of the protein
 
    See sections S6 and S7 of the supplementals.
 
    This function returns both the strongest eigenvalue at a given residue and
    a list of counters with the projection of each residue onto significant
    eigenvectors
    """
    pass
 
 
#############################################################
### Main Program Start
#############################################################
 
gag_seq_file = '../data/HIV1_ALL_2009_GAG_PRO.fasta'
gag_data_full = read_fasta(gag_seq_file)
 
# We don't actually need the names of the sequences, and a list is more
# convenient for what we're doing than a dictionary
gag_data = [gag_data_full[name] for name in gag_data_full
            if 'B' in name.split('.')[0]]
 
consensus = get_consensus(gag_data)
 
x = generate_binary_matrix(gag_data, consensus)
 
distinct_strains = find_distinct_evo(x)
# Somehow remove the distinct strains
#consensus may have changed, so let's recalculate that:
 
consensus = get_consensus(gag_data)
 
x = generate_binary_matrix(gag_data, consensus)
x = clean_phylogeny(x)
 
 
 
corr_matrix = np.linalg.corrcoef(x)
# We could write this by hand, but using the library function is likely
# faster for the computer and well-debugged, so we won't have to
# worry about my tendency to make stupid errors
 
# Determine the lambda cutoff here
 
corr_matrix_clean = clean_matrix(corr_matrix, lambda_cutoff)
 
best, loadings = determine_sectors(corr_matrix_clean, lambda_cutoff)

Reading the math


The supplementals in the paper play a little bit fast and loose with the notation, and they borrow some from physics that you might not have seen, even if you've taken other linear algebra classes. Fortunately, I was a physics major in undergrad. I'll spend a little bit on the "what" of the premise behind the paper, though the "why" is best left to more theoretical folks than myself.


Screen_shot_2011-07-21_at_1.05.49_PM.png

If you see a variable with 2 subscripts, like Cij, that typically means it's a matrix, and the element in the ith row and jth column is defined by the stuff after the equals sign. In code, this would be C[i,j]. At the moment, they're using the <angle brackets> to mean "average over", and the variable that they're averaging over is the subscript, so the first term means "average the product of the ith residue and jth residues of x over all sequences. In code, this would be mean(x[i,:] * x[j,:]).

Screen_shot_2011-07-21_at_1.06.31_PM.png

Here, they use the angle brackets to mean something else entirely. They are abstract representations of vectors in what's called Bra-ket notation, or sometimes Dirac notation after the physicist who invented it. <k| is a "bra", or row vector, and |k> is a ket, or column vector. A bra times a ket is the same as a dot product (also called an inner product) of two vectors, which is a scaler. However, in the equation above, we have a ket times a bra, which is an outer product, and takes two vectors and makes a matrix. Don't worry too much if you don't have an intuitive understanding of what's going on, since I'm not certain I do. I just don't want you to be too blown away when we turn the math into code.

Building the Correlation Matrix

def get_consensus(strains):
    """ Get the consensus sequence of aligned strains
 
    This assumes that all strains are the same length, and defines consensus
    sequence as the single most common residue at each position. If two or more
    residues are equally common, it chooses arbitrarily (but not necessarily
    randomly)
 
    """
 
    # Set up a list of counters equal to the length of the sequence
    residue_counters = [Counter() for residue in strains[0]]
 
    for strain in strains:
        # Loop over each strain
        for index, residue in enumerate(strain):
            # Loop over each residue and count how many times that residue has
            # appeared at that position
            residue_counters[index][residue] += 1
 
    # Use a list comprehension to get the most common residue at each position
    consensus_list = [counter.most_common()[0][0]
                      for counter in residue_counters]
 
    # If there's no variation at a residue, then it will mess with the corrcoef
    # function later. Accumulate these in a list to return as well
    novars = []
    for i, counter in enumerate(residue_counters):
        if len(counter) == 1:
            novars.append(i)
 
    # Efficiently convert a list into a string
    consensus = ''.join(consensus_list)
 
    return consensus, novars

The reason we also want a list of non-varying residues is that, if the variance is 0, then the denominator of equation S1 goes to zero.
Screen_shot_2011-07-21_at_1.05.49_PM.png
Let's also write a function that will remove these non-varying residues from the alignment.

def strip_positions(data, consensus, novar):
    """Remove positions given in novar from all of the strains as well as the
    consensus strain.
    """
    data = [strip_single_position(seq, novar) for seq in data]
    consensus = strip_single_position(consensus, novar)
 
    return data, consensus
 
def strip_single_position(string, novar):
    "Remove positions given in novar from a single string"
    novar = set(novar)
    # Sets allow constant-time test of membership, as opposed to lists which
    # depend on the length of the list
 
    return "".join([char for i, char in enumerate(string)
                    if i not in novar])
 

Now that we have our consensus sequence in place, and we've stripped out our perfectly conserved residues (as it turns out, there really aren't that many in our dataset), let's actually make the binary matrix x.

def generate_binary_matrix(data, consensus):
    """ Generates a binary array x_i(s), where:
        * Each column corresponds to a particular strain
        * Each row corresponds to a particular site
        * The element is 1 if that strain at that site is indentical to the
           consensus sequence at that site
 
    """
 
    x = np.zeros( (len(consensus), len(data)), dtype=bool)
    for s, strain in enumerate(data):
        for i, site in enumerate(strain):
            x[i,s] = (site == consensus[i])
 
    return x

Random Matrix Theory to Clean the Matrix


def clean_matrix(correlation_matrix, lambda_cutoff):
    """ Uses RMT to clean the correlation matrix
 
    Every eigenvector with an eigenvalue greater than the cutoff is used to
    generate a new correlation matrix
    """
    # Calculate the eigenvalues and eigenvectors
    # the h at the end means it is Hermitian (for real values: it's symmetric)
    eigvals, vecs = np.linalg.eigh(correlation_matrix)
 
    # The "clean" matrix starts out as just zeros
    clean = np.zeros_like(correlation_matrix)
    for k, eigval in enumerate(eigvals):
        if eigval > lambda_cutoff:
            # For each eigenvalue larger than the cutoff, compute the outer
            # product, and add it to the clean matrix. This is equation S5
            clean += eigval * np.outer(vecs[:,k], vecs[:,k])
    return clean

I think this is a fairly straightforward translation of equation S5. The real hard part is how do we know what that cutoff is supposed to be? In the paper, they do this "by randomly permuting the values of x_i(s) across different sequences sFor example, if there were only 3 sequences in the MSA, and a particular position had A, F, L as the amino acids in the 3 sequences, one random permutation would be F, L, A." So we can write a function to do this. For convenience sake, we don't actually have to recalculate the consensus sequence, since we don't change the number of each type of amino acid at a given residue.

def find_cutoff(alignment):
    eigs = []
 
    # We don't want our permutations to mess with the original alignment, which
    # it might do if we don't make a copy of it.  Remember the interlude from
    # day 2?
    alignment = alignment.copy()
 
    nresidues, nstrains = np.shape(alignment)
 
    for i in range(100):
        # Shuffle the residues at each position
        for r in range(nresidues):
            alignment[r, :] = np.random.permutation(alignment[r, :])
 
        # Calculate the correlation coefficient
        corr = np.corrcoef(alignment, bias=1)
 
        # Add the eigenvalues to the running list of eigenvalues
        eigs.extend(np.linalg.eigvalsh(corr))
 
        # Poor-man's Progress bar
        if i%10 == 9:
            print '.',
        if i%100 == 99:
            print ''
    return eigs
 

Note that this typically takes a while to run, so I'll just tell you that I typically get around 3.45 as the highest value in the list of eigenvectors, which is somewhat higher than the paper claims, but shouldn't affect the results too much.

Cleaning of Phylogeny

The authors find that "the eigenvector corresponding to the highest eigenvalue […] largely reflects correlations due to phylogeny", and "it would be prefereable to have a way to recompute the eigenspectrum after removal of the highest eigenvalue". They thus calculate a value for each sequence that can be thought of as representing the evolutionary distance from the consensus sequence, and then they do a residue-by-residue fit to remove this evolutionary distance variable.

def clean_phylogeny(binary_matrix):
    """ Cleans the binary matrix by removing the contribution of phylogeny
 
    This is section S4.
 
    """
 
    eigvals, eigvecs = np.linalg.eigh(np.corrcoef(binary_matrix))
 
    # "u_i^1 are the components of the eigenvector corresponding to the largest
    # eigenvalue"
    u1 = eigvecs[:,eigvals.argmax()]
 
    num_residues, num_strains = np.shape(binary_matrix)
 
    # Equation S11
    M = np.array([sum(u1[i] * binary_matrix[i,s]
                      for i in range(num_residues))
                  for s in range(num_strains)])
 
    # Alpha could be a 1D array, but this is more convenient since we will need
    # to force it into a 2D array eventually
    alpha = np.zeros( (num_residues, num_strains) )
    beta = np.zeros( num_residues )
    epsilon = np.zeros( (num_residues, num_strains) )
 
    for i in range(num_residues):
        # "The value of the parameters alpha_i and beta_i are estimated through
        # a least square regression..."
        slope, intercept, rval, pval, stderr = stats.linregress(M, x[i,:])
        alpha[i,:] = intercept
        beta[i] = slope
 
    # Equation S10:
    # x_i(s) = alpha_i + beta_i M(s) + epsilon_i(s)
    epsilon = x - alpha - np.outer(beta, M)
 
    # Equation S12:
    # y_i(s) = alpha_i + epsilon_i(s)
    return alpha + epsilon
 

Removing evolutionarily distinct sequences

The procedure above only stands a chance of working well if the evolutionary distance is roughly evenly distributed. If some of the strains are grouped together into sub-clades, then M will likely have a multi-modal distribution. The authors claim that they find these sub-clades, and while I can't results as clean as figure S3, there does seem to be a central group of sequences and then some that are farther out.

def find_distinct_evo(binary_matrix):
    """ Removes evolutionarily distinct sequences
 
    Calculates a correlation matrix for the strains as they relate to each
    other, and then removes those that are significantly different
 
    This is section S5
    """
    gamma = np.cov(binary_matrix.T)
    eigvals, vecs = np.linalg.eigh(gamma)
    vecs = vecs.T
 
    # Here there be dragons
    # Using the projections along eigenvector 2 and the cutoff of -.1 was
    # empirically determined. Your mileage may vary
    proj1 = [np.dot(gamma[i], vecs[-1]) for i in range(len(eigvals))]
    proj2 = [np.dot(gamma[i], vecs[-2]) for i in range(len(eigvals))]
    return [pos for pos, proj in enumerate(proj2) if proj > -.1]
 
 

Determining Sectors

We've done all this work to clean away various sources of noise, but we still haven't actually gotten the sectors. The way we'll go about identifying them is by the similarity of their row of the correlation matrix to the eigenvectors of the matrix. Fortunately, we have a really convenient to calculate the similarity of two vectors: the dot product. We can visualize this easily in two or three dimensions: if two vectors point in roughly the same direction, they will have a higher dot product than if they point nearly perpendicular to each other. This is the same idea, except in 500 dimensions instead of 3.

def determine_sectors(correlation_matrix, lambda_cutoff):
    """ Determines the sectors of the protein
 
    See sections S6 and S7 of the supplementals.
 
    This function returns both the strongest eigenvalue at a given residue and
    the a list of counters with the projection of each residue onto significant
    eigenvectors
    """
 
    eigvals, vecs = np.linalg.eigh(correlation_matrix)
 
    n_residues = n_vectors = len(eigvals)
 
    loadings = [Counter() for i in range(n_residues)]
 
    # removes the autocorrelations, which should typically be much higher than
    # the inter-residue correlations
    # This works by multiplying by the inverse of the identity matrix
    othercorrs = abs(correlation_matrix
                     * (~ np.identity(n_residues, dtype=bool)))
 
    for r in range(n_residues):
        if max(othercorrs[r]) < 0.15:
            # "We chose to exclude from sectors those residues that did not
            # show any correlation higher than 0.1 (in absolute magnitude) with
            # any other sector residue"
            # I thought their threshold was a little low, since in a given
            # random matrix of about ~500x500, you would expect to see over 1000
            # correlations higher than that by chance, if you believe their P <
            # 5x10^-3 statistic
            continue
 
        for k in range(n_vectors):
            if eigvals[k] > lambda_cutoff:
                # The loading is simply the projection of the correlation values for
                # each residue onto a given eigenvector
                loading = np.dot(correlation_matrix[:,r], vecs[:,k])
                loadings[r][k] = abs(loading)
 
    best = [(l.most_common()[0][0] if (len(l) > 0) else None)
            for l in loadings]
 
    return best, loadings

Putting it all together


correlation_matrix.py
import numpy as np
from sequence_tools import read_fasta
from collections import Counter
from scipy import stats
 
def filter_and_generate_binary(data):
    """Removes malformed strains and all sites that have non-varying residues or
    gaps.
 
    The return value x is a 2D boolean array, where:
        *each row corresponds to a residue
        *each column corresponds to a different strain
        *a 1 means that the given strain at that residue is identical to the
            consensus sequence
    """
    # Remove strains that aren't similar enough to each other
    data = filter_strains(data)
 
    # Find the most common residue at each nucleotide location
    consensus_sequence, novars = get_consensus(data)
 
    # For scientific and pratical reasons, we should exclude all sites that have
    # are a gap in the consensus strain. These won't be as useful for, say,
    # identifying locations for antibodies, and if we leave them in, we get way
    # too many sectors.
    gaps = [idx for idx, res in enumerate(consensus_sequence) if res is '-']
    novars.extend(gaps)
 
 
    data, consensus_sequence = strip_positions(data, consensus_sequence,
                                                     novars)
 
    x = generate_binary_matrix(data, consensus_sequence)
 
    return x
 
 
def filter_strains(data):
    """ Filters out strains that are sufficiently different from the rest
 
    On my first look-through of this data, some of the aligned proteins are
    different lengths.  Since we don't know whether they were clipped at the
    beginning, the end, or what, we'll just take strains that are the most
    common length.
 
    In principle, you could have other filter conditions in here too!
    """
    # This is a python data type that counts things, and can tell you the most
    # common element
    length_counter = Counter([len(strain) for strain in data])
 
 
    # A Counter's most_common method returns a list of tuples, with the first
    # element as the item, and the second element as the count of that item,
    # with the list sorted in descending order by counts
    most_common = length_counter.most_common()[0][0]
 
    # Collect only strains that are the right length into the return variable
    good_data = []
    for sequence in data:
        if len(sequence) == most_common:
            good_data.append(sequence)
 
    # Must not forget the return statement
    return good_data
 
def get_consensus(strains):
    """ Get the consensus sequence of aligned strains
 
    This assumes that all strains are the same length, and defines consensus
    sequence as the single most common residue at each position. If two or more
    residues are equally common, it chooses arbitrarily (but not necessarily
    randomly)
 
    """
 
    # Set up a list of counters equal to the length of the sequence
    residue_counters = [Counter() for residue in strains[0]]
 
    for strain in strains:
        # Loop over each strain
        for index, residue in enumerate(strain):
            # Loop over each residue and count how many times that residue has
            # appeared at that position
            residue_counters[index][residue] += 1
 
    # Use a list comprehension to get the most common residue at each position
    consensus_list = [counter.most_common()[0][0]
                      for counter in residue_counters]
 
    # If there's no variation at a residue, then it will mess with the corrcoef
    # function later. Accumulate these in a list to return as well
    novars = []
    for i, counter in enumerate(residue_counters):
        if len(counter) == 1:
            novars.append(i)
 
    # Efficiently convert a list into a string
    consensus = ''.join(consensus_list)
 
    return consensus, novars
 
def strip_positions(data, consensus, novar):
    """Remove positions given in novar from all of the strains as well as the
    consensus strain.
    """
    data = [strip_single_position(seq, novar) for seq in data]
    consensus = strip_single_position(consensus, novar)
 
    return data, consensus
 
def strip_single_position(string, novar):
    "Remove positions given in novar from a single string"
    novar = set(novar)
    # Sets allow constant-time test of membership, as opposed to lists which
    # depend on the length of the list
 
    return "".join([char for i, char in enumerate(string)
                    if i not in novar])
 
def generate_binary_matrix(data, consensus):
    """ Generates a binary array x_i(s), where:
        * Each column corresponds to a particular strain
        * Each row corresponds to a particular site
        * The element is 1 if that strain at that site is indentical to the
           consensus sequence at that site
 
    """
 
    x = np.zeros( (len(consensus), len(data)), dtype=bool)
    for s, strain in enumerate(data):
        for i, site in enumerate(strain):
            x[i,s] = (site == consensus[i])
 
    return x
 
def find_cutoff(alignment):
    eigs = []
 
    # We don't want our permutations to mess with the original alignment, which
    # it might do if we don't make a copy of it.  Remember the interlude from
    # day 2?
    alignment = alignment.copy()
 
    nresidues, nstrains = np.shape(alignment)
    max_corr = 0
    global allcorrs
    allcorrs = np.empty(100*nresidues**2)
 
    for i in range(100):
        # Shuffle the residues at each position
        for r in range(nresidues):
            alignment[r, :] = np.random.permutation(alignment[r, :])
 
        # Calculate the correlation coefficient
        corr = np.corrcoef(alignment, bias=1)
 
        # Add the eigenvalues to the running list of eigenvalues
        eigs.extend(np.linalg.eigvalsh(corr))
 
        allcorrs[i*nresidues**2:(i+1)*nresidues**2] = \
            abs((corr*~np.identity(nresidues,dtype=bool)).ravel())
 
        # Poor-man's Progress bar
        if i%10 == 9:
            print '.',
        if i%100 == 99:
            print ''
    return eigs
 
def clean_matrix(correlation_matrix, lambda_cutoff):
    """ Uses RMT to clean the correlation matrix
 
    Every eigenvector with an eigenvalue greater than the cutoff is used to
    generate a new correlation matrix
    """
    # Calculate the eigenvalues and eigenvectors
    # the h at the end means it is Hermitian (for real values: it's symmetric)
    eigvals, vecs = np.linalg.eigh(correlation_matrix)
 
    # The "clean" matrix starts out as just zeros
    clean = np.zeros_like(correlation_matrix)
    for k, eigval in enumerate(eigvals):
        if eigval > lambda_cutoff and eigval != max(eigvals):
            # For each eigenvalue larger than the cutoff, compute the outer
            # product, and add it to the clean matrix. This is equation S5
            clean += eigval * np.outer(vecs[:,k], vecs[:,k])
    return clean
 
def clean_phylogeny(binary_matrix):
    """ Cleans the binary matrix by removing the contribution of phylogeny
 
    This is section S4.
 
    """
 
    eigvals, eigvecs = np.linalg.eigh(np.corrcoef(binary_matrix))
 
    # "u_i^1 are the components of the eigenvector corresponding to the largest
    # eigenvalue"
    u1 = eigvecs[:,eigvals.argmax()]
 
    num_residues, num_strains = np.shape(binary_matrix)
 
    # Equation S11
    M = np.array([sum(u1[i] * binary_matrix[i,s]
                      for i in range(num_residues))
                  for s in range(num_strains)])
 
    # Alpha could be a 1D array, but this is more convenient since we will need
    # to force it into a 2D array eventually
    alpha = np.zeros( (num_residues, num_strains) )
    beta = np.zeros( num_residues )
    epsilon = np.zeros( (num_residues, num_strains) )
 
    for i in range(num_residues):
        # "The value of the parameters alpha_i and beta_i are estimated through
        # a least square regression..."
        slope, intercept, rval, pval, stderr = stats.linregress(M, x[i,:])
        alpha[i,:] = intercept
        beta[i] = slope
 
    # Equation S10:
    # x_i(s) = alpha_i + beta_i M(s) + epsilon_i(s)
    epsilon = x - alpha - np.outer(beta, M)
 
    # Equation S12:
    # y_i(s) = alpha_i + epsilon_i(s)
    return alpha + epsilon
 
def find_distinct_evo(binary_matrix):
    """ Removes evolutionarily distinct sequences
 
    Calculates a correlation matrix for the strains as they relate to each
    other, and then removes those that are significantly different
 
    This is section S5
    """
    gamma = np.cov(binary_matrix.T)
    eigvals, vecs = np.linalg.eigh(gamma)
    vecs = vecs.T
 
    # Here there be dragons
    # Using the projections along eigenvector 2 and the cutoff of -.1 was
    # empirically determined. Your mileage may vary
    proj1 = [np.dot(gamma[i], vecs[-1]) for i in range(len(eigvals))]
    proj2 = [np.dot(gamma[i], vecs[-2]) for i in range(len(eigvals))]
    return [pos for pos, proj in enumerate(proj2) if proj > -.1]
 
def determine_sectors(correlation_matrix, lambda_cutoff):
    """ Determines the sectors of the protein
 
    See sections S6 and S7 of the supplementals.
 
    This function returns both the strongest eigenvalue at a given residue and
    the a list of counters with the projection of each residue onto significant
    eigenvectors
    """
 
    eigvals, vecs = np.linalg.eigh(correlation_matrix)
 
    n_residues = n_vectors = len(eigvals)
 
    loadings = [Counter() for i in range(n_residues)]
 
    # removes the autocorrelations, which should typically be much higher than
    # the inter-residue correlations
    # This works by multiplying by the inverse of the identity matrix
    othercorrs = abs(correlation_matrix
                     * (~ np.identity(n_residues, dtype=bool)))
 
    for r in range(n_residues):
        if max(othercorrs[r]) < 0.15:
            # "We chose to exclude from sectors those residues that did not
            # show any correlation higher than 0.1 (in absolute magnitude) with
            # any other sector residue"
            # I thought their threshold was a little low, since in a given
            # random matrix of about ~500x500, you would expect to see over 1000
            # correlations higher than that by chance, if you believe their P <
            # 5x10^-3 statistic
            continue
 
        for k in range(n_vectors):
            if eigvals[k] > lambda_cutoff:
                # The loading is simply the projection of the correlation values for
                # each residue onto a given eigenvector
                loading = np.dot(correlation_matrix[:,r], vecs[:,k])
                loadings[r][k] = abs(loading)
 
    best = [(l.most_common()[0][0] if (len(l) > 0) else None)
            for l in loadings]
 
    return best, loadings
 
#############################################################
### Main Program Start
#############################################################
 
gag_seq_file = '../data/HIV1_ALL_2009_GAG_PRO.fasta'
gag_data_full = read_fasta(gag_seq_file)
 
# We don't actually need the names of the sequences, and a list is more
# convenient for what we're doing than a dictionary
gag_data = [gag_data_full[name] for name in gag_data_full
            if 'B' in name.split('.')[0]]
 
print "First Pass Filtering"
x = filter_and_generate_binary(gag_data)
 
distinct_strains = find_distinct_evo(x)
gag_data2 = [strain for idx, strain in enumerate(gag_data) if idx in
             distinct_strains]
 
rows, cols = np.shape(x)
print "Found %d locations in %d strains" % (rows, cols)
 
print "Second Pass filtering"
 
# The claim in S4 is that the method works best if we remove the contribution
# from evolutionarily distinct sequences, so we'll have to re-run once we've
# taken those out.
 
 
x = filter_and_generate_binary(gag_data2)
x = clean_phylogeny(x)
 
rows, cols = np.shape(x)
print "Found %d locations in %d strains" % (rows, cols)
 
 
print "Building matrix"
 
corr_matrix = np.corrcoef(x, bias=1)
 
 
# It actually takes a while for this to run, so I'm going to leave it commented
# out, and just put in the result that I happen to know it will give us.
 
#print "Finding Cutoff"
#eigs = find_cutoff(x)
#lambda_cutoff = max(eigs)
lambda_cutoff = 3.45
 
print "Cleaning matrix"
corr_matrix_clean = clean_matrix(corr_matrix, lambda_cutoff)
 
best, loadings = determine_sectors(corr_matrix_clean, lambda_cutoff)
 
 
 



Exercises


1) Writing Mathematical Functions
a) Write a function that accepts an array of floats. Return an array where every value of the input array is divided by 1.5.
b) Use the random functions to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation from the mean of the array.
c) Write a function that evaluate the function y = e^x for values from 0 to 10 and stores the values in an array. The function should also generate a random exponential sample of the same length. Sort the random sample, then return the sum of the distances at each point from the random sample to the exponential model.

2) Comparing the Sectors
At the end of all of this, we have identified which sector (if any) each site belongs to. Compare the sites we found to those that Dahirel et. al found. For instance, we might be curious how many sites we identified as significant that they didn't (and vice-versa), and how many we agree on (hint: see about sets from the second day). Feel free to compare the sectors in other ways too.

For your convenience, I've pulled out the sets from the supplemental information and put them into python lists below:

# From Dahirel et al:
 
sec1 = [1, 88, 2, 94, 3, 97, 4, 99, 5, 100, 6, 108, 8, 118, 9, 120, 11, 122, 12,
        123, 14, 128, 16, 129, 19, 131, 20, 133, 21, 134, 24, 135, 27, 136, 29,
        138, 32, 139, 33, 141, 35, 142, 36, 143, 38, 144, 39, 145, 41, 148, 45,
        149, 48, 150, 50, 151, 51, 152, 52, 153, 57, 154, 60, 155, 63, 156, 73,
        158, 77, 160, 79, 251, 83, 276, 86, 279, 87, 433]
 
sec2 = [23, 446, 37, 450, 178, 452, 379, 455, 381, 457, 386, 459, 391, 461, 392,
        462, 393, 463, 394, 464, 395, 489, 399, 400, 402, 405, 406, 407, 408,
        412, 413, 414, 416, 417, 419, 420, 423, 430, 431, 432, 434, 435, 437,
        438, 439, 440, 442, 443, 444, 445]
 
sec3 = [53, 305, 140, 306, 163, 310, 167, 316, 169, 317, 170, 319, 171, 323,
        172, 326, 174, 338, 175, 344, 179, 345, 180, 346, 181, 182, 185, 186,
        187, 189, 191, 198, 199, 212, 221, 225, 229, 233, 240, 243, 245, 249,
        257, 260, 263, 265, 269, 284, 288, 291, 295, 347, 363, 364, 365, 366,
        367]
 
sec4 = [166, 197, 211, 222, 236, 237, 308, 318, 354, 396]
 
sec5 = [17, 31, 47, 137, 161, 261, 275, 278, 290, 298, 324, 334, 337, 343]
 
secQ = [18, 30, 54, 62, 69, 90, 125, 130, 146, 147, 159, 173, 176, 200, 218,
        219, 223, 224, 228, 230, 234, 242, 248, 252, 255, 256, 264, 267, 268,
        273, 280, 281, 286, 312, 341, 357, 362, 374, 375, 376, 401, 403]
 

3) Sorting by sector

Using some of the plotting commands we'll show you this afternoon, we can generate a heatmap of the correlation matrix.

$ ipython -pylab
Enthought Python Distribution -- http://www.enthought.com

Python 2.7.1 |EPD 7.0-2 (64-bit)| (r271:86832, Dec 3 2010, 15:56:20)
Type "copyright", "credits" or "license" for more information.

IPython 0.10.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

Welcome to pylab, a matplotlib-based Python environment.
For more information, type 'help(pylab)'.
In[1]: run correlation_matrix.py

In[2]: imshow(corr_matrix_clean, vmin=0, vmax=.25)


7.1-cleaned.pngScreen_shot_2011-07-07_at_5.51.48_PM.png

We get the figure on the left, which looks only vaguely similar to the figure from the paper on the right.

Reading through the supplementals even more thoroughly, we see this line: "...the cleaned correlation matrices are represented by heat maps. However, the positions are grouped according to sites contained in each sector, not the linear sequence of the protein". If you have a list of sites you want to pull out of a correlation matrix, you can do something like:
corr_matrix_clean[np.meshgrid(sites, sites)]
to group those sites together and ignore the rest. (The function meshgrid returns two matrices that are x- and y-values corresponding to its two arguments, and something called "fancy indexing" lets us pull only those sites out of the correlation matrix in the proper way).

Sort the sites in each sector according to its loading along its best eigenvector, combine the list of sites together using the numpy function hstack, then use imshow to plot only the sectors.


Solutions

1) Functions
# a)
def two_thirds(an_array):
    return an_array/1.5
 
# b)
from numpy import mean, std
from pylab import normal
def reject_outliers(an_array):
    middle = mean(an_array)
    spread = std(an_array)
    return an_array[((middle - spread) > an_array) | (an_array  > (middle + spread))]
    # the | combines the two arrays
    # OR#
    ret = []
    for el in an_array:
        if -spread < el - middle < spread:
            ret.append(el)
    return ret
 
print reject_outliers(normal(0, 1, 1000))
 
# c)
 
from pylab import arange, exp, uniform
 
x = arange(0,10,.01)
y = exp(x)
 
z = uniform(0,10, len(y))
z = exp(z)
z.sort()  # note that z.sort() returns None
 
print sum(y - z) / len(y)
2) Comparing the sectors

secs = []
for sec in set(best):
    if sec is None:
        continue
    sites = np.where(best == sec)[0]
    ## OR ##
    sites = []
    for res in range(len(best)):
        if best[res] == sec:
            sites.append(res)
    secs.append(sites)
 
 
secs_me = np.hstack(secs)
 
# From Dahirel et al:
 
sec1 = [1, 88, 2, 94, 3, 97, 4, 99, 5, 100, 6, 108, 8, 118, 9, 120, 11, 122, 12,
        123, 14, 128, 16, 129, 19, 131, 20, 133, 21, 134, 24, 135, 27, 136, 29,
        138, 32, 139, 33, 141, 35, 142, 36, 143, 38, 144, 39, 145, 41, 148, 45,
        149, 48, 150, 50, 151, 51, 152, 52, 153, 57, 154, 60, 155, 63, 156, 73,
        158, 77, 160, 79, 251, 83, 276, 86, 279, 87, 433]
 
sec2 = [23, 446, 37, 450, 178, 452, 379, 455, 381, 457, 386, 459, 391, 461, 392,
        462, 393, 463, 394, 464, 395, 489, 399, 400, 402, 405, 406, 407, 408,
        412, 413, 414, 416, 417, 419, 420, 423, 430, 431, 432, 434, 435, 437,
        438, 439, 440, 442, 443, 444, 445]
 
sec3 = [53, 305, 140, 306, 163, 310, 167, 316, 169, 317, 170, 319, 171, 323,
        172, 326, 174, 338, 175, 344, 179, 345, 180, 346, 181, 182, 185, 186,
        187, 189, 191, 198, 199, 212, 221, 225, 229, 233, 240, 243, 245, 249,
        257, 260, 263, 265, 269, 284, 288, 291, 295, 347, 363, 364, 365, 366,
        367]
 
sec4 = [166, 197, 211, 222, 236, 237, 308, 318, 354, 396]
 
sec5 = [17, 31, 47, 137, 161, 261, 275, 278, 290, 298, 324, 334, 337, 343]
 
secQ = [18, 30, 54, 62, 69, 90, 125, 130, 146, 147, 159, 173, 176, 200, 218,
        219, 223, 224, 228, 230, 234, 242, 248, 252, 255, 256, 264, 267, 268,
        273, 280, 281, 286, 312, 341, 357, 362, 374, 375, 376, 401, 403]
 
secs_them = sec1+sec2+sec3+sec4+sec5+secQ
 
me_only = set(secs_me).difference(secs_them)
them_only = set(secs_me).symmetric_difference(secs_them) - me_only
 
print "I did find   %3d that Dahirel et al didn't" % len(me_only)
print "Did not find %3d that Dahirel et al did" % len(them_only)
 
 
 


3) Sorting the sectors

bestloadings = []
for i,l in enumerate(loadings):
    if len(l) > 0:
        bestloadings.append(l.most_common()[0][1])
        # Append the Loading (rather than the number of the eigenvector) to
        # bestloadings
    else:
        bestloadings.append(0)
        # There wasn't a good loading here
 
bestloadings = np.array(bestloadings)
 
sorted_secs = []
for sec in secs:
    sec = np.array(sec)
    ls = bestloadings[sec]
    sorted_secs.append(sec[ls.argsort()])
 
seclist = np.hstack(sorted_secs)
 
imshow(corr_matrix_clean[meshgrid(seclist, seclist)])