All By Ourselves: Mapping Short Reads to a Reference Genome


For the rest of this week, we'll be covering various applications of high-throughput sequencing. All of these different approaches (SNP finding, RNA-seq, ChIP-Seq, and many more) start with the same basic step: taking individual reads and figuring out where in a reference genome they might have come from. As we'll see, this is a task demanding of both CPU power and memory, and the most obvious solution is intractable on our puny little laptops. However, we can think about ways to store the data more intelligently, process only the necessary pieces of the data, and thereby increase our computational leverage.

Sometimes, the sequences of interest are not the reads that map precisely, but the ones that have mismatches. These are the reads that reveal the polymorphisms in a genome. This afternoon, we will use an optimized algorithm coded by a program called BWA to not only accomplish the task of mapping matching reads more efficiently, but extend the algorithm to identify reads representing polymorphisms.

For today, we'll be looking at data from yeast, trying to identify the specific sequence change that causes a mutation. We've discovered some minor issues with the file already in the PythonCourse folder, so let's all download this slightly modified version:



First attempt



Getting Started

As usual, we'll start by assembling our toolbox of functions:

first_attempt.py

#!/usr/bin/python
from os import chdir
from sys import exit
from file_tools import open_file_by_mimetype
from sequence_tools import read_fasta, rev_comp
 
def map_to_reference_slow():
    pass

And, then we'll want to setup out variables for managing data and files.

But first, let's consider what each file's contents should look like when we get them open. Recall that a FASTA file should have the names of sequences, followed in each case by the actual sequence.

About the FASTQ format, q-scores, and how we will use them


According to the wikipedia page for the FASTQ format, we can expect it to look like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
The first line starts with an '@' and then identifies the sequence on the second line.
The second line is the sequence read in 5' -> 3' orientation.
The third line begins with a '+' and is an optional description of the sequence, in this case omitted.
The fourth line is an encoded line representing the quality score for each base in the read.

FASTQ quality scores reflect the confidence the Illumina GA II Sequencer has that the base reported at a given position is the correct base. Generally, the q-scores decrease toward the end of each read as the dynamics of the sequencing reaction become less stable. The q-scores are registered on a scale from 0-256, which is encoded by the symbols on the fourth line in the example above. As we discussed in the first week, chr are stored as ASCII values, not coincidentally on a scale from 0-256. Thus these values are a convenient way of storing the q-scores (imagine a tab-separated list of integers: ick!).

The q-scores are used to discard the reads in the dataset that do not meet certain criteria. In our case, we're going to eliminate reads with any of these traits:

  1. If the q-score is effectively zero, the base at that position is simply labeled 'N.' If the read contains any 'N's, we're tossing it.
  2. If the q-score at any point is below a given threshold (this will be variable), we toss the read.
  3. If the average q-score for the entire read is below a given threshold, we toss the read.

Typical Illumina HiSeq data are 50-bp reads, but depending it's possible to get shorter reads from older machines (like the GA II), or longer reads from either HiSeq or GA II.

Since we have a genome with roughly 12M bases, we expect to encounter a random 12-mer by chance about once every 16M times (i.e. (1/4)^12). So, maybe we could get away with using only the first 12bp of the read (also usually the highest-quality 12bp)? But then again, genomes do not contain random distributions of nucleotides, nor of 12-mers, so it's probably safer to use something longer.

How long? Some of the data are 36bp reads, which should be fantastically unique, barring conserved duplications. We can set this as a variable, but 20bp is probably a good starting guess.

Handling the Data

For the purposes of testing our code, we'll use a truncated version of a FASTQ file, with the first 40,000 lines (10,000 reads), which should be in your data folder:

So, let's add the code to open these files:

if __name__ == "__main__":
 
    # open the FASTQ file for processing
    fastqfn = '../data/SRR064545_sample_rev_1.fastq.gz'
    fastqfh = open_file_by_mimetype(fastqfn, 'r')
 
    # open the reference FASTA file for processing
    refgenomefn = '../data/yeastgenome.fa'
    refgenome = read_fasta(refgenomefn)
 
    plusstrand = refgenome
 
    # and we can use our rev_comp function to get the minus strand
    # Here I'm using a "dictionary comprehension". It's actually a fairly new
    # feature in python, but it works more or less like you'd expect a list
    # comprehension to work, except we use the colon to separate the key and value
    minusstrand = {key: rev_comp(plusstrand[key]) for key in plusstrand}
 

In the case of the genome file we just opened, we are only accessing about half a megabyte of data, which is trivial for use to load into memory as two strands of DNA. However, the full FASTQ file in question is a roaring 5GB of data once it's opened, and even with high-memory servers (which our laptops are not), one must be careful about loading almost 5GB of data at a time.

Thus, it behooves us to implement a strategy that handles less-than-all of the data in the FASTQ file at one time. Since the fundamental unit of information in the FASTQ file is the sequence read, let's approach the problem by reading one read worth of data at a time. And since we're hoping to count all of the reads and how many times they fall on each location of the genome, we'll need a data structure to keep track of things.

if __name__ == "__main__":
    # File reading code here
 
    # for each read worth of data, we'll keep track of each of the four lines of
    # data associated with the read as a list
    seq = [ '', '', '', '' ]
 
    # to store the location of each read, we'll have a dictionary keyed by read with
    # the values of each index it maps to.
    readmap = {}
    numreads = 0
 
    # we start with a while loop that never evaluates to false
    while True:
 
        # then we add a for loop to take four lines at a time
        for i in range(4):
            seq[i] = fastqfh.readline().strip()
        if '' == seq[0]:
            # if we run out of lines in the filehandle, it will keep returning us
            # blank lines; we can use this to exit the loop we're otherwise stuck
            # in.
            print "EXITING READFILE AT CHARACTER NUMBER:", fastqfh.tell()
            break
        else:
            numreads+=1
            if numreads % 100 == 0:
                print ".",
 
        # now that we have the data for a read, let's call the function to check the
        # read by qscores; we'll code this to return False if the read is discarded,
        # allowing the if statement to pass the read on to be mapped if it's not
        # discarded
        checkedseq = check_seq_by_qscore(seq)
        if checkedseq:
            # when we code the next function, we'll want it to consider the sequence
            # of the read, both strands of reference and we'll want the function to
            # add the read mapping to our dictionary of readmaps
            map_to_reference_slow(checkedseq[1][:20], plusstrand, minusstrand, readmap)
 
    # and we're now done with our 3GB data file so let's close the filehandle
    fastqfh.close()
 
 

At this point, we've got a basic control structure setup to handle our data; we haven't yet thought about how we'd like the data output, and more importantly, we haven't figured out how to actually check the reads nor map them. So, let's take care of the two processing functions first.

Processing Data: Checking and Mapping the Reads

Our strategy to conserve memory usage is to check and map the reads one at a time, so let's begin with the checking (no need to map it if it's trash!).

We'll need a function that uses the read data to decide whether it is of reasonable quality or not, and then return either "yes, this is a good read," or, "no, toss it."

from numpy import mean
 
def check_seq_by_qscore(seq, meancutoff = 40, mincutoff = 25, minseqlength = 20):
    '''Examine a sequence from a FASTQ file and trim away poor quality bases, or
    discard the entire read.
 
    Accepts a sequence, which is a list of four lines from a FASTQ file:
    @SEQ_ID
    READ SEQUENCE
    +SEQUENCE DESCRIPTION
    QSCORES
 
    see:  http://en.wikipedia.org/wiki/FASTQ_format for formatting details.
 
    meancutoff is a threshold for the average q-score in the read.
    mincutoff is a threshold for the minimum any given NT is allowed
    minseqlength is the shortest the read is allowed to become during trimming.
 
    Returns a checked sequence (possibly trimmed) sequence or False for a
    rejected read.
    '''
 
    seqID = seq[0]
    read = seq[1]
    seqDesc = seq[2]
    qscores = seq[3]
 
    # if there's an "N", toss the read
    if 'N' in read:
        return False
 
    # evaluate the qscore values; start with converting the qscores to a list of
    # their ASCII values for the read the built-in function ord() is used for
    # this conversion, which we'll do with list comprehension:
    qscores = [ord(qscore) for qscore in qscores]
 
    # if the mean qscore is below the cutoff threshold, then toss the read
    if mean(qscores) < meancutoff:
        return False
 
    # if there are any NTs less than the mincutoff, make note of their positions
    # in the read sequence
    indices = []
    for idx, qscore in enumerate(qscores):
        if qscore < mincutoff:
            indices.append(idx)
            if idx < minseqlength:
                return False
    # if any of the reads are before the mincutoff position, then toss the read
    # otherwise, return the sequence with the bad position and subsequent
    # positions trimmed or, if there are no bad positions return the full
    # sequence
 
    if indices == []:
        # this is the case where none of the bases had a bad q-score.  we should
        # convert the qscores back into ASCII values before we send them back
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read, seqDesc, qscores ]
    else:
        # Otherwise, we trim down the read down to get all good positions
        end = indices[0]
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read[:end], seqDesc, qscores[:end] ]
 

Okay, we can now ensure that every read will be checked, and then if it's not False, we'll pass it on to be mapped.

Deep breath, and let's consider how we're going to approach this task:

The reads should be able to map to either strand. So, in addition to accepting the read, this function needs access to the plusstrand and minusstrand strings we created earlier. In addition, since we're doing this one read at a time, it's a good idea to just pass along the readmap dictionary so that the function can modify the map as it goes.

While we're at it, we need to consider that though the read is only approximately 36bp, the sequence that is read comes from a DNA molecule of longer length. So, once we know what strand we're mapping the read to, we should extend it by the length of the average insert (the data is actually paired-end, so we could figure this out, but that's an extra step of complexity we don't really need to worry about).

def map_to_reference_slow(read, plus, minus, readmap):
    '''Accepts a read from single-end Illumina data, as well as two strings
    representing the plus and minus strands of the reference DNA, and a
    dictionary to store the readmap data.  This dictionary should have
    the following form after the first call of this function:
 
 
    readmap = { read: [  {} ,  {}  ] }
 
 
    The first-level keys are the strings of each read.  The values are two
    dictionaries, one for the plus strand, one for the minus strand.  Each
    dictionary is keyed by a tuple of the chromosome and indice, with each
    having a value for the number of times that index has a read "fall" across
    it.
 
    I'm really harshing the Zen of Python here:
        Simple is better than complex.
        Complex is better than complicated.
        Flat is better than nested.
 
    Returns nothing if successful; the readmap is modified in place.
    '''
 
    from random import choice
    # choice() picks an item randomly from a list
 
 
    # we'll assume that our insert size is about 100nt, even though in reality
    # it's probably longer.  100nt will be a faster assumption that 250, which
    # is probably more accurate.
    insertlen = 100
 
 
    # for accounting purposes later, we'll want to know the length of the
    # strands
    strandlen = len(plus)
 
 
    # first is plus, second is minus.
    if not read in readmap.keys():
        readmap[read] = [ {}, {} ]
 
 
    # to find the map position, we're going to have to scan down the length of
    # the genome, recording not just the first, but every incidence of a match.
    # Once we figure out where they all could be, we'll pick one at random and
    # extend it.
 
 
    # The following code is quite sloppy, as we consider each strand separately,
    # then combine them before we choose a single index to map to.  This could
    # be done in a nested loop where we iterate over the list of the strands
    # (plus, minus), but since we've used a list of dictionaries
    # above, it's easier to hard-code the [0] for plus and [1] for minus instead
    # of using conditions inside the hypothetical loop.
 
 
    # idx will start at the first nt of the strand, and we'll use lastidx to
    # keep track of where the last match was each time through the while loop.
    # The plusidxlist is to keep track of all matches.
 
 
    # So, here's the plus strand:
 
    plusidxlist = []
 
 
    for chrom in plus:
        idx = 0
        lastidx = 0
        # the loop will not break until we're done finding matches
        while True:
            idx = plus[chrom].find(read, lastidx)
            # Find can take an argument telling you where to start looking
 
            # recall that find returns -1 if there's no match
            if idx == -1:
                idx = 0
                break
            # gotta add the lastidx each time
            plusidxlist.append((chrom, idx))
            # and now take the position we just appended and increment by one more
            lastidx = idx + 1
 
 
    # Now look at the minus strand:
    minusidxlist = []
 
    for chrom in minus:
        idx = 0
        lastidx = 0
        while True:
            idx = minus[chrom].find(read, lastidx)
            if idx == -1:
                idx = 0
                break
            # except, we're going to store the minus strand as a negative value, so
            # that we can compare to the plus strand
            minusidxlist.append( (chrom, -idx))
            lastidx = idx + 1
 
 
 
 
    # Now we have two lists of matches for each read, and we need to choose one
    # index at random for the actual mapping.  The mappings could be on either,
    # neither, or both strands.  We must consider each case:
 
 
    # check for the neither case, it's not True, then pick one of the indices
    # from all the possibilities
    if (plusidxlist + minusidxlist) != []:
        chrom, idx = choice(plusidxlist + minusidxlist)
    else:
        return
        # by returning, we don't execute any of the rest of the code in this
        # function. Do not pass go, do not collect $200.
 
 
    if idx > 0:
        for i in range(insertlen):
            if (chrom, idx + i) not in readmap[read][0].keys():
                readmap[read][0][(chrom, idx + i)] = 1
            else:
                readmap[read][0][(chrom, idx + i)] += 1
    elif idx < 0:
        for i in range(insertlen):
            if (chrom, idx + i) not in readmap[read][1].keys():
                readmap[read][1][(chrom, idx + i)] = 1
            else:
                readmap[read][1][(chrom, idx + i)] += 1
 
 
    # note there's a special case here, where idx = 0, and we can't tell which
    # end of the genome it's at.  I'm happy just tossing this one too, instead
    # of worrying about 2 NTs of data.
    else:
        return
 

Updated first_attempt.py

#!/usr/bin/python
from os import chdir
from file_tools import open_file_by_mimetype
from sequence_tools import read_fasta, rev_comp
from numpy import mean
 
def check_seq_by_qscore(seq, meancutoff = 40, mincutoff = 25, minseqlength = 20):
    '''Examine a sequence from a FASTQ file and trim away poor quality bases, or
    discard the entire read.
 
    Accepts a sequence, which is a list of four lines from a FASTQ file:
    @SEQ_ID
    READ SEQUENCE
    +SEQUENCE DESCRIPTION
    QSCORES
 
    see:  http://en.wikipedia.org/wiki/FASTQ_format for formatting details.
 
    meancutoff is a threshold for the average q-score in the read.
    mincutoff is a threshold for the minimum any given NT is allowed
    minseqlength is the shortest the read is allowed to become during trimming.
 
    Returns a checked sequence (possibly trimmed) sequence or False for a
    rejected read.
    '''
 
    seqID = seq[0]
    read = seq[1]
    seqDesc = seq[2]
    qscores = seq[3]
 
    # if there's an "N", toss the read
    if 'N' in read:
        return False
 
    # evaluate the qscore values; start with converting the qscores to a list of
    # their ASCII values for the read the built-in function ord() is used for
    # this conversion, which we'll do with list comprehension:
    qscores = [ord(qscore) for qscore in qscores]
 
    # if the mean qscore is below the cutoff threshold, then toss the read
    if mean(qscores) < meancutoff:
        return False
 
    # if there are any NTs less than the mincutoff, make note of their positions
    # in the read sequence
    indices = []
    for idx, qscore in enumerate(qscores):
        if qscore < mincutoff:
            indices.append(idx)
            if idx < minseqlength:
                return False
    # if any of the reads are before the mincutoff position, then toss the read
    # otherwise, return the sequence with the bad position and subsequent
    # positions trimmed or, if there are no bad positions return the full
    # sequence
 
    if indices == []:
        # this is the case where none of the bases had a bad q-score.  we should
        # convert the qscores back into ASCII values before we send them back
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read, seqDesc, qscores ]
    else:
        # Otherwise, we trim down the read down to get all good positions
        end = indices[0]
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read[:end], seqDesc, qscores[:end] ]
 
 
def map_to_reference_slow(read, plus, minus, readmap):
    '''Accepts a read from single-end Illumina data, as well as two strings
    representing the plus and minus strands of the reference DNA, and a
    dictionary to store the readmap data.  This dictionary should have
    the following form after the first call of this function:
 
 
    readmap = { read: [  {} ,  {}  ] }
 
 
    The first-level keys are the strings of each read.  The values are two
    dictionaries, one for the plus strand, one for the minus strand.  Each
    dictionary is keyed by a tuple of the chromosome and indice, with each
    having a value for the number of times that index has a read "fall" across
    it.
 
    I'm really harshing the Zen of Python here:
        Simple is better than complex.
        Complex is better than complicated.
        Flat is better than nested.
 
    Returns nothing if successful; the readmap is modified in place.
    '''
 
    from random import choice
    # choice() picks an item randomly from a list
 
 
    # we'll assume that our insert size is about 100nt, even though in reality
    # it's probably longer.  100nt will be a faster assumption that 250, which
    # is probably more accurate.
    insertlen = 100
 
 
    # for accounting purposes later, we'll want to know the length of the
    # strands
    strandlen = len(plus)
 
 
    # first is plus, second is minus.
    if not read in readmap.keys():
        readmap[read] = [ {}, {} ]
 
 
    # to find the map position, we're going to have to scan down the length of
    # the genome, recording not just the first, but every incidence of a match.
    # Once we figure out where they all could be, we'll pick one at random and
    # extend it.
 
 
    # The following code is quite sloppy, as we consider each strand separately,
    # then combine them before we choose a single index to map to.  This could
    # be done in a nested loop where we iterate over the list of the strands
    # (plus, minus), but since we've used a list of dictionaries
    # above, it's easier to hard-code the [0] for plus and [1] for minus instead
    # of using conditions inside the hypothetical loop.
 
 
    # idx will start at the first nt of the strand, and we'll use lastidx to
    # keep track of where the last match was each time through the while loop.
    # The plusidxlist is to keep track of all matches.
 
 
    # So, here's the plus strand:
 
    plusidxlist = []
 
 
    for chrom in plus:
        idx = 0
        lastidx = 0
        # the loop will not break until we're done finding matches
        while True:
            idx = plus[chrom].find(read, lastidx)
            # Find can take an argument telling you where to start looking
 
            # recall that find returns -1 if there's no match
            if idx == -1:
                idx = 0
                break
            # gotta add the lastidx each time
            plusidxlist.append((chrom, idx))
            # and now take the position we just appended and increment by one more
            lastidx = idx + 1
 
 
    # Now look at the minus strand:
    minusidxlist = []
 
    for chrom in minus:
        idx = 0
        lastidx = 0
        while True:
            idx = minus[chrom].find(read, lastidx)
            if idx == -1:
                idx = 0
                break
            # except, we're going to store the minus strand as a negative value, so
            # that we can compare to the plus strand
            minusidxlist.append( (chrom, -idx))
            lastidx = idx + 1
 
 
 
 
    # Now we have two lists of matches for each read, and we need to choose one
    # index at random for the actual mapping.  The mappings could be on either,
    # neither, or both strands.  We must consider each case:
 
 
    # check for the neither case, it's not True, then pick one of the indices
    # from all the possibilities
    if (plusidxlist + minusidxlist) != []:
        chrom, idx = choice(plusidxlist + minusidxlist)
    else:
        return
        # by returning, we don't execute any of the rest of the code in this
        # function. Do not pass go, do not collect $200.
 
 
    if idx > 0:
        for i in range(insertlen):
            if (chrom, idx + i) not in readmap[read][0].keys():
                readmap[read][0][(chrom, idx + i)] = 1
            else:
                readmap[read][0][(chrom, idx + i)] += 1
    elif idx < 0:
        for i in range(insertlen):
            if (chrom, idx + i) not in readmap[read][1].keys():
                readmap[read][1][(chrom, idx + i)] = 1
            else:
                readmap[read][1][(chrom, idx + i)] += 1
 
 
    # note there's a special case here, where idx = 0, and we can't tell which
    # end of the genome it's at.  I'm happy just tossing this one too, instead
    # of worrying about 2 NTs of data.
    else:
        return
 
 
 
 
## Begin the main program
if __name__ == "__main__":
 
    # open the FASTQ file for processing
    fastqfn = '../data/SRR064545_sample_rev_1.fastq.gz'
    fastqfh = open_file_by_mimetype(fastqfn, 'r')
 
    # open the reference FASTA file for processing
    refgenomefn = '../data/yeastgenome.fa'
    refgenome = read_fasta(refgenomefn)
 
    plusstrand = refgenome
 
    # and we can use our rev_comp function to get the minus strand
    # Here I'm using a "dictionary comprehension". It's actually a fairly new
    # feature in python, but it works more or less like you'd expect a list
    # comprehension to work, except we use the colon to separate the key and value
    minusstrand = {key: rev_comp(plusstrand[key]) for key in plusstrand}
 
    # for each read worth of data, we'll keep track of each of the four lines of
    # data associated with the read as a list
    seq = [ '', '', '', '' ]
 
    # to store the location of each read, we'll have a dictionary keyed by read with
    # the values of each index it maps to.
    readmap = {}
    numreads = 0
 
    # we start with a while loop that never evaluates to false
    while True:
 
        # then we add a for loop to take four lines at a time
        for i in range(4):
            seq[i] = fastqfh.readline().strip()
        if '' == seq[0]:
            # if we run out of lines in the filehandle, it will keep returning us
            # blank lines; we can use this to exit the loop we're otherwise stuck
            # in.
            print "EXITING READFILE AT CHARACTER NUMBER:", fastqfh.tell()
            break
        else:
            numreads+=1
            if numreads % 100 == 0:
                print ".",
 
        # now that we have the data for a read, let's call the function to check the
        # read by qscores; we'll code this to return False if the read is discarded,
        # allowing the if statement to pass the read on to be mapped if it's not
        # discarded
        checkedseq = check_seq_by_qscore(seq)
        if checkedseq:
            # when we code the next function, we'll want it to consider the sequence
            # of the read, both strands of reference and we'll want the function to
            # add the read mapping to our dictionary of readmaps
            map_to_reference_slow(checkedseq[1][:20], plusstrand, minusstrand, readmap)
 
    # and we're now done with our 3GB data file so let's close the filehandle
    fastqfh.close()
 
    # Output the data
 
    indices = [ {}, {} ]
    for strand in [0, 1]:
        for read in readmap:
            for idx in readmap[read][strand]:
                if idx in indices[strand].keys():
                    indices[strand][idx] += readmap[read][strand][idx]
                else:
                    indices[strand][idx] = readmap[read][strand][idx]
 
 
    outfh = open('outfile.map', 'w')
 
 
    for idx in sorted(indices[0].keys()):
        outfh.write('+' + str(idx) + '\t' + str(indices[0][idx]) + '\n')
    for idx in sorted(indices[1].keys()):
        outfh.write('-' + str(idx) + '\t' + str(indices[1][idx]) + '\n')
 
    outfh.close()
 


I've also put in some code that will output to a file, though our implementation is so limited we won't actually use this data.

Now, as we run this on a small subset of the data, we can get an idea that it's going to take entirely too long. Using our shortened file of 10,000 reads, this takes tens of minutes. I've taken the liberty to calculate it on a comically small version of the data, holding just 250 reads, and with time we can see how it performs:

$ time python first_attempt.py
EXITING READFILE AT CHARACTER NUMBER: 62100

real 0m38.858s
user 0m36.980s
sys 0m0.263s

So, at 38 seconds for 250 reads, it should take something like 3 weeks to get to 11,000,000 reads. Any ideas where we went wrong (okay, maybe the function name was a hint)? But if not, fear not: there's an easy way to find out: the Python Profiler. Profilers exist for most languages, and in short, they help you discover where the majority of the work in your code is taking place. We can invoke the profiler like so:

$ python -m cProfile --sort=time first_attempt.py
EXITING READFILE AT CHARACTER NUMBER: 62100
         1206588 function calls (1206501 primitive calls) in 37.247 seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   214859   23.808    0.000   23.808    0.000 {method 'find' of 'str' objects}
        1    6.470    6.470   37.248   37.248 first_attempt.py:2(<module>)
       17    2.896    0.170    2.896    0.170 sequencetools.py:1(rev_comp)
    43058    2.136    0.000    2.136    0.000 {method 'keys' of 'dict' objects}
   202646    0.599    0.000    0.877    0.000 gzip.py:430(readline)
        1    0.350    0.350    1.416    1.416 sequencetools.py:27(read_fasta)
      239    0.139    0.001   23.887    0.100 first_attempt.py:85(map_to_reference_slow)
   202777    0.086    0.000    0.086    0.000 {method 'startswith' of 'str' objects}
     3694    0.077    0.000    0.077    0.000 {built-in method decompress}
        1    0.062    0.062    0.064    0.064 __init__.py:38(<module>)
   203890    0.059    0.000    0.059    0.000 {method 'strip' of 'str' objects}
        2    0.049    0.024    0.118    0.059 __init__.py:2(<module>)
        2    0.037    0.019    0.217    0.108 __init__.py:1(<module>)
        1    0.030    0.030    0.035    0.035 numeric.py:1(<module>)
   208480    0.027    0.000    0.027    0.000 {method 'append' of 'list' objects}
     3694    0.024    0.000    0.044    0.000 gzip.py:319(_add_read_data)
        2    0.024    0.012    0.024    0.012 {sorted}
     3695    0.023    0.000    0.178    0.000 gzip.py:232(read)
     4009    0.023    0.000    0.023    0.000 {method 'join' of 'str' objects}
    21400    0.019    0.000    0.019    0.000 {method 'write' of 'file' objects}
     3695    0.018    0.000    0.018    0.000 {zlib.crc32}
        1    0.018    0.018    0.342    0.342 __init__.py:106(<module>)
        1    0.018    0.018    0.037    0.037 index_tricks.py:1(<module>)
      250    0.018    0.000    0.048    0.000 first_attempt.py:7(check_seq_by_qscore)
     3695    0.016    0.000    0.146    0.000 gzip.py:269(_read)
        1    0.011    0.011    0.014    0.014 npyio.py:1(<module>)
        1    0.011    0.011    0.011    0.011 __init__.py:44(<module>)
     3702    0.010    0.000    0.010    0.000 {method 'read' of 'file' objects}
        1    0.010    0.010    0.022    0.022 __init__.py:15(<module>)
      251    0.010    0.000    0.010    0.000 {numpy.core.multiarray.array}

If we look at the first few lines of this mess, we can see that principally, we're consuming time in your map_to_reference_slow() function due to the really serious amount of time spent in find() calls that we're making. So, let's reconsider our design, as this approach isn't going to get us to the proverbial promised land.

Second Attempt


In our first attempt, we spent a lot of time finding a read on each strand of the genome. As an alternative to searching the entire genome so many times as a string, let's store the data as dictionary. What do I mean exactly? Well, it's gonna sound crazy, I admit. But, let's make a dictionary of every, say, 20-nucleotide segment of the genome. Granted, our reads are 36 bp long, but since the random 20-mer should appear once in every 4^20, or 1,099,511,627,776 nucleotides (~1 trillion), I think we'll be safe using only the 20nt. These should be very high quality base reads, once we throw out the ones with N's in the reads. In short, we'll see that while we use more memory, cutting out all all those find() calls on a 4,500,000 character string will really save us time. We'll just make a dictionary with every 12-mer observed in the genome, then we'll count how many times a read falls on it.

Mini Informative Interlude
This works, by the way, because of how dictionaries are implemented. There's a big field of computer science that goes into analyzing how long certain operations take, as a function of how big a dataset you're working on, and the Python types have well defined performance characteristics. In particular, we see that dictionaries are implemented so that as they get bigger, on average it still takes a constant amount of time to check whether any particular element is in the dictionary. By comparison, as a string gets bigger, the amount of time it takes to find another string inside that one increases with the length of the first string. (By the way, there was a point a couple days ago where I converted things to a set, and you all (rightly) pointed out that there wouldn't be any duplicates. However, by converting from a list to a set, I was able to speed up a check for whether an item is in it, which is O(n) for lists, but O(1) for sets)

Aaaanyways:

Let's replace our map_to_reference_slow() function with a faster (and much shorter!) one. We'll also need a quick function to generate the dictionary to look up all the read positions with.

def make_kmer_dict(strand, readlen = 20):
    '''Record the position of every read of length readlen in the genome'''
 
    kmerdict = {}
 
    for chrom in strand:
        print "Making dict from ", chrom
        kmerdict[chrom] = {}
        for i in range(len(strand[chrom]) - readlen):
            seq = strand[chrom][i: i + readlen]
            if seq in kmerdict:
                kmerdict[chrom][seq].append(i)
            else:
                kmerdict[chrom][seq] = [i]
 
    return kmerdict

And now for our updated map_to_reference()

def map_to_reference(read, kmerdict, indices, nomaps, insertlen=100,):
 
    # Figure out the correct read length from the data in kmerdict
    achromdict = kmerdict.values()[0]
    aread = achromdict.iterkeys().next()
    readlen = len(aread)
 
 
    read = read[0:readlen]
 
    pluslocs = [(chrom, kmerdict[chrom][read]) \
                for chrom in kmerdict          \
                if read in kmerdict[chrom]]
 
    rev_read = rev_comp(read)
 
    minuslocs = [(chrom, kmerdict[chrom][rev_read]) \
                 for chrom in kmerdict          \
                 if rev_read in kmerdict[chrom]]
    locs = []
 
    for chrom, sites in pluslocs:
        for site in sites:
            locs.append( (chrom, site) )
    for chrom, sites in minuslocs:
        for site in sites:
            locs.append( (chrom, -site) )
 
    if not locs:
        nomaps.add(read)
        return
 
    chrom, idx = choice(locs)
 
    if chrom not in indices:
        indices[chrom] = {}
 
    for i in range(insertlen):
        if (idx + i) in indices:
            indices[chrom][idx + i] += 1
        else:
            indices[chrom][idx + i] = 1
    return
 
 

And our new main body:

second_attempt.py
if __name__ == "__main__":
 
    # open the FASTQ file for processing
    fastqfn = '../data/SRR064545_sample_rev_1.fastq.gz'
    fastqfh = open_file_by_mimetype(fastqfn, 'r')
 
    # open the reference FASTA file for processing
    refgenomefn = '../data/yeastgenome.fa'
    refgenome = read_fasta(refgenomefn)
 
 
    readlen = 12
 
    # for each read worth of data, we'll keep track of each of the four lines of
    # data associated with the read as a list
    seq = [ '', '', '', '' ]
 
    indices = {}
    nomaps = set()
    badreads = []
 
    kmerdict = make_kmer_dict(refgenome, readlen)
    numreads = 0
    print "k-mer dict done"
    print "Mapping reads"
 
    while True:
        for i in range(4):
            seq[i] = fastqfh.readline().strip()
 
        if '' == seq[0]:
            print "EXITING READFILE AT CHARACTER NUMBER:", fastqfh.tell()
            break
 
        # now, check the quality of the read.  If it's good, map it; if not,
        # append it to badreads.
 
        checkedseq = check_seq_by_qscore(seq)
 
        if checkedseq:
            map_to_reference(checkedseq[1], kmerdict, indices, nomaps)
        else:
            badreads.append(seq[1])
 
        numreads += 1
        if numreads % 1000 == 0:
            print '.',
            sys.stdout.flush()
 
 
 
    # and we're now done with our 3GB data file so let's close the filehandle
    fastqfh.close()
 
    # Output the data
 
    print "###########################'\nMapping Summary:"
    print len(badreads), "reads were rejected by the quality check.\n"
    print badreads[0:10], "are the first ten bad reads.\n"
 
 
    print len(nomaps), "reads passed quality check, but failed to map.\n"
    print list(nomaps)[0:10], "are the first ten no-map seeds.\n"
 
 
    print sum(len(chrommap) for chrommap in indices.values())/100,
    print "reads successfully mapped.\n"
 
 
    outfh = open('outfile_fast', 'w')
 
 
    for chrom in indices:
        for i in sorted(indices[chrom].keys()):
            outfh.write(chrom + '\t' + str(i) + '\t' + str(indices[chrom][i]) +
                        '\t' + refgenome[chrom][i:i+20] + '\n')
 
    outfh.close()
 



If we run the profiler on this, we get the following results:
         2610472 function calls (2610014 primitive calls) in 52.099 seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   44.862   44.862   45.249   45.249 secondattempt.py:72(make_kmer_dict)   *******************
        1    2.434    2.434   52.101   52.101 secondattempt.py:1(<module>)
   875164    0.830    0.000    0.830    0.000 {method 'write' of 'file' objects}
     9547    0.801    0.000    0.962    0.000 secondattempt.py:91(map_to_reference)  *******************
    10000    0.437    0.000    1.168    0.000 secondattempt.py:6(check_seq_by_qscore)
    20257    0.420    0.000    0.420    0.000 {range}
     9648    0.274    0.000    0.274    0.000 {numpy.core.multiarray.array}
        1    0.258    0.258    0.449    0.449 sequence_tools.py:20(read_fasta)
       17    0.204    0.012    0.204    0.012 {sorted}
    40004    0.153    0.000    0.230    0.000 gzip.py:430(readline)
     9610    0.130    0.000    0.130    0.000 {method 'mean' of 'numpy.ndarray' objects}
   345172    0.127    0.000    0.127    0.000 {chr}
   242909    0.089    0.000    0.089    0.000 {method 'strip' of 'str' objects}
   202791    0.076    0.000    0.076    0.000 {method 'startswith' of 'str' objects}
        1    0.071    0.071    0.074    0.074 __init__.py:38(<module>)
     9610    0.066    0.000    0.545    0.000 fromnumeric.py:2299(mean)
     9610    0.056    0.000    0.479    0.000 fromnumeric.py:32(_wrapit)
        2    0.056    0.028    0.133    0.066 __init__.py:2(<module>)
        3    0.055    0.018    0.306    0.102 __init__.py:1(<module>)
   346166    0.048    0.000    0.048    0.000 {ord}
     9547    0.041    0.000    0.089    0.000 sequence_tools.py:3(rev_comp)
...

I've put the *'s in there to highlight those two lines, but we see that we're spending the vast majority of our time on making the dictionary. Our 10,000 reads take up only a couple seconds to map. If we scale that up to 11 million reads, instead of 10 thousand, we're starting to talk about tens of minutes, but that's still way better than 3 weeks! By doing some extra work up front, we've saved ourselves a huge amount of time.

Mapping Polymorphisms

Okay, so, keep in mind, we've learned a lot about managing our data and accomplishing a non-trivial task here, but this isn't the goal of our biology in this project. We want to see the reads that don't map precisely, that is, the ones with polymorphism compared to the reference genome. In order to see these particular mismatches, we need to know something much harder to calculate.

Let's stop for a minute and talk about how we might accomplish this more difficult task.





Exercises


1. Parsing Genome Annotations

The standard format for dealing with genome annotations is the GFF (General Feature Format). It will be quite helpful for the next few days (and possibly the rest of your career) to be able to slice and dice these at will, so let's try some simple examples:

a. Download the GFF file for the yeast genome: Yeast Genome GFF
b. Write a program that will read this GFF file into a list of lists that contains the chromosome, start, stop, and strand information for each annotation.
c. Extend your program to write out (either to the command line or a file) a 4 column listing of all of the mRNAs.

 
fh = open('saccharomyces_cerevisiae.gff')
 
annotation_list = []
 
for line in fh:
    if line.startswith("###"):  
        # This marks the end of the gff table, and sequences follow
        break
    if line[0] == "#":
        # This line is a comment
        continue
 
    line = line.strip()
    columns = line.split('\t')
 
    chr = columns[0]
    start = int(columns[3])
    stop = int(columns[4])
    strand = columns[6]
    type = columns[2]
 
    annotation_list.append([chr, start, stop, strand, type])
 
print annotation_list
 
for annotation in annotation_list:
    chr, start, stop, strand, type = annotation
    if type == "mRNA":
        print "%s\t%d\t%d\t%s" % (chr, start, stop, strand)