This morning we did some pretty crazy math. My brain still is spinning. Bonus points for finding the eigenvector that best predicts how fast the rotation is! But anyway, it would be really nice if we could see some of the results of our fancy footwork! For this, we are going to turn to the graphing package matplotlib. This is not something that comes with the offical Python distribution but it was bundled together with the Enthought distribution. We can use this to create publication worthy figures directly and also to export figures in a format that makes touching up with Illustrator or Inkscape a pain-free experience. In my experience, the figure making process is incremental and explores many options of colors, axis limits, labels, and graph types. For this reason we are going to be using iPython for the majority of the time.


For our first venture into plotting, let us revisit the PCA method. First make some random but correlated data sets in iPython. Helpful hint: use cpaste:

import numpy as np
slope = .5
intercept = -10
x = np.arange(0,100)
y = slope*x + intercept
noise = 5 * np.random.randn(len(x))
y = y + noise

And now let's make our first plot:

from matplotlib import pyplot as plt

If all went well, a new window should pop up with a line plot of our data. By default, pyplot.plot() makes a line plot, with blue lines connecting the dots. This is not quite what we want so clear the figure and try making a scatter plot


If you noticed, plt.scatter() returned an object: matplotlib.collections.CircleCollection that we can use to change things like color instead of clearing the whole figure.

points = plt.scatter(x,y,c='b',s=10, label='original) # this draws small blue circles

Now continue to the next step of subtracting the mean from each data set

x_adj = x - x.mean()
y_adj = y - y.mean()
plt.scatter(x_adj, y_adj, c='r', s=10,label='mean adjusted')

We can add vertical and horizontal lines to make it clear that the means have been subtracted


Now that w have a nice example of normalized data, we can give the figure a title and save it

plt.title('Raw and Normalized Data Sets')

Just for practice: write a function that takes to lists and calculates the covariance between them. Try using combinations of x, y, x_adj and y_adj and compare your answers to np.cov(). Then look up the numpy help for eigenvectors: and plot the eigenvectors for your random data.

cov_m = np.cov(x_adj,y_adj)
w, v = np.linalg.eigh(cov_m)
egv1 = v[0,0]/v[1,0]
egv2 = v[0,1]/v[1,1]
plt.scatter(x_adj, y_adj, c='r', s=10, label='mean adjusted')
xlim = plt.xlim()
egv_x = np.arange(xlim[0], xlim[1], 0.01)
plt.plot(egv_x, egv_x*egv1, linestyle='--', c='r', label='PC2')
plt.plot(egv_x, egv_x*egv2, linestyle='--', c='r', label='PC1')
plt.title('Normalized Data with Eigenvectors')
Hopefully this makes the concept of the PCA a bit clearer.

More Plots!!!

Moving on! Our next goal is to replicate Figure 1A from the HIV paper. For this we are going to need to take finer control of the plotting procedure. First though, we need to get the random matrices eigenvalues. First we need to restore some commented out code from

#print "Finding Cutoff"
#eigs = find_cutoff(x)
#lambda_cutoff = max(eigs)
print "Finding Cutoff"
eigs = find_cutoff(x)
lambda_cutoff = max(eigs)

Get a quick look at the data and note the return value of plt.hist()


We can control the number of bins or the position of the bins by setting hist(x,bins=?) to a int or array, respectively. hist() returns a tuple with (n, bins, patches) where n = height of the bins, bins is the position of the bins and patches is a list of graphics objects. Looking closely at Fig1A, we want to make a graph of the frequencies instead. Read the documentation to find the parameter that will do this.

n,bins,patches = plt.hist(eigs,bins=np.arange(0,11,0.05),normed=True)
# change the color without remaking the histogram
for p in patches:p.set_color('k')

Now that plot is getting close to what we want, let's look at the first part. First get the eigenvalues of the real data and plot as a histogram.

eigvals, vecs = np.linalg.eigh(corr_matrix)
plt.hist(eigvals, bins=bins)

Looks nice but the number of values in each bin doesn't match the figure so redo it with smaller bins.

0.01 gives bins with 6 eigenvalues, 0.005 give bins with 4 eigenvalues, and 0.002 finally gives bins with at most 2 eigenvalues per bin.
n,bins,patches = plt.hist(eigvals,bins=np.arange(0,11,0.002))

Now that we know how to make each plot, we need to take control of the figure. So far, we have been using a whole lot of default values and letting matplotlib deal with it. Now we want to make a long figure with two subplots, one on top of the other. We can do this in a vary straight forward way:

fig = plt.figure(figsize=(10,5))
subp = fig.add_subplot(2,1,1)
n,bins,patches = subp.hist(eigvals,bins=np.arange(0,11,0.002),color='k')
subp2 = fig.add_subplot(2,1,2)
n2, bins2, patches2 = subp2.hist(eigs,bins=bins,normed=True,color='k')

Now we can change the axis limits on each subplot with the subp.set_xlim() method

We can also set the tick locations and labels for each plot:
from matplotlib.ticker import MultipleLocator
subp2.set_yticks([0.0, 0.5, 1.0])
subp.set_yticks([0.0, 1.0, 2.0])

And the last thing is to set the labels and title
fig.suptitle('eigenvalues (Gag alignment)')


1. Create two sets of data from a Poisson distribution with lamba=5 and n=1000: one that is correlated and one that is not. Then make a plot that looks like the following for each paired data set:
2. Calculate a z-score for the values in each data set and color the points red if the absolute value of the z-score is less than 0.05. Then use the YlOrRd cmap to make the points with the largest z-scores Yellow and the smallest z-scores Red. Z-scores can be calculated using scipy.

3. Plot the polymorphism rate by position for HIV-GAG using a bar chart and color the bars with the Blues cmap.

4. Plot the expression of eve along the A-P axis from this image


import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
def plotScatterHist(x,y,fig_name):
    nullfmt = NullFormatter()         # no labels
    # definitions for the axes
    left, width = 0.1, 0.65
    bottom, height = 0.1, 0.65
    bottom_h = left_h = left+width+0.02
    rect_scatter = [left, bottom, width, height]
    rect_histx = [left, bottom_h, width, 0.2]
    rect_histy = [left_h, bottom, 0.2, height]
    # start with a rectangular Figure
    plt.figure(1, figsize=(8,8))
    axScatter = plt.axes(rect_scatter)
    axHistx = plt.axes(rect_histx)
    axHisty = plt.axes(rect_histy)
    # no labels
    # the scatter plot:
    axScatter.scatter(x, y)
    # now determine nice limits by hand:
    binwidth = 1
    xymax = np.max( [np.max(np.fabs(x)), np.max(np.fabs(y))] ) + 1
    xymin = np.min([np.min(np.fabs(x)), np.min(np.fabs(y))] ) - 1
    lim = ( int(xymax/binwidth) + 1) * binwidth
    axScatter.set_xlim( (xymin, xymax) )
    axScatter.set_ylim( (xymin, xymax) )
    bins = np.arange(xymin, xymax, binwidth)
    axHistx.hist(x, bins=bins)
    axHisty.hist(y, bins=bins, orientation='horizontal')
    axHistx.set_xlim( axScatter.get_xlim() )
    axHisty.set_ylim( axScatter.get_ylim() )
def main():
    # the random data
    x = np.random.poisson(5,1000)
    y = np.random.poisson(5,1000)
    m = np.random.poisson(5,1000)
    n = m+5*np.random.randn(len(m))
if __name__ == "__main__":

 # get z-scores
    x_z = np.abs(zscore(x))
    y_z = np.abs(zscore(y))
    p_z = [np.max(i) for i in zip(x_z,y_z)]
    #print x, type(x), type(y), type(p_z)
    zipped = zip(x,y,p_z)
    smaller_z = np.array([(i,j) for (i,j,z) in zipped if z < 0.05])
    larger_z =  np.array([(i,j) for (i,j,z) in zipped if z >= 0.05])
    # the scatter plot:
    # for just the small z-scores
    #axScatter.scatter(smaller_z[:,0], smaller_z[:,1],c='r')
    #axScatter.scatter(larger_z[:,0], larger_z[:,1],c='b')
    # for all z-scores

import numpy as np
import matplotlib.pyplot as plt
import as cm
import matplotlib.colors as colors
import correlation_matrix
from sequenceTools import read_fasta
def main():
    gag_seq_file = '../data/hiv/HIV1_ALL_2009_GAG_PRO.fasta'
    gag_data_full = read_fasta(gag_seq_file)
    # We don't actually need the names of the sequences, and a list is more
    # convenient for what we're doing than a dictionary
    gag_data = [gag_data_full[name] for name in gag_data_full
                if 'B' in name.split('.')[0]]
    print "First Pass Filtering"
    x = correlation_matrix.filter_and_generate_binary(gag_data)
    distinct_strains = correlation_matrix.find_distinct_evo(x)
    gag_data2 = [strain for idx, strain in enumerate(gag_data) if idx in
    rows, cols = np.shape(x)
    print "Found %d locations in %d strains" % (rows, cols)
    print x, 'should be a binary matrix'
    polymorphism = np.array([sum(x[:,i])/float(cols) for i in range(rows)])
    print polymorphism, polymorphism.shape
    rectangles =,rows+1),polymorphism,width=1)
    # we need to normalize the data to 0..1 for the full
    # range of the colormap
    fracs = polymorphism/polymorphism.max()
    norm = colors.normalize(fracs.min(), fracs.max())
    for thisfrac, thispatch in zip(fracs, rectangles):
       color = cm.Blues(norm(thisfrac))
if __name__ == "__main__":

from matplotlib import pyplot as plt
from matplotlib import image as mpimg
import numpy as np
strain = mpimg.imread('eve.png')
#gray = stain.mean(axis=2)
gray = strain[:,:,2]
gray = np.abs(gray-1)