Using BWA for read mapping: twice the accuracy in half the time

Goal: find mutations! Like they did here

This morning we wrote a basic read mapper that was able to map reads on both strands of the genome but with the restriction that the read had to map precisely along the entire length of the read. In a perfect world where machines never miscalled bases and DNA polymerases never inserted the wrong nucleotide or slipped on the template, this would be a slow but acceptable solution. However, we would like to graduate within a decade and those little polymerase mess-ups make life so much more interesting so lets find a way to map reads with mismatches.

There are several ways to match reads with mismatches but the basic concepts rest on something like key:value dictionaries we used for the second read mapper. If we assume that some smaller parts of the read are going to not contain mismatches to the genome then we can find those positions quickly and then use some other slower method to align the rest. The two main concepts used are suffix arrays and a seed-and-extend approach.

The Burrows-Wheeler Transform

The Burrows-Wheeler Transform is a technique that was developed for the compression of text file on your computer and is the core of the bzip utility, used as an alternate to gzip. It takes a long string of characters, in our case nucleotides, and changes it such that the first character is appended to the end. It does this repeatedly, saving each new string in a table, and then sorts the whole table by the last character. What this does is puts all the 'A' positions next to each other in the table and all the 'T's and so on. Because we know the order of the permutations and how they were sorted, we can now use just the last column in the sorted table as something called a suffix array. When we are mapping a read 'ATGGC', it would be nice not to have to look at each position and to be able to only start matching at the genome positions that start with 'A'. Since we have the sorted transformed genome, we can do exactly that and to make it really fast all we have to do is make a little look up table that tells us where the 'A's start in the table. It is like looking in a phone book for 'Combs' and knowing that the 'C' section is on page 1192 instead of having to look through the phone book page by page. Here are some examples from the bowtie paper showing the table construction, finding a perfect match, and finding a position with a mismatch:


Bowtie and BWA both use this as a starting point but BWA uses a different method for finding mismatches that allows for insertions and deletions. Here is the BWT for you to play with:
def bwt(s):
    assert "\0" not in s, "Input string cannot contain null character ('\0')"
    s += "\0"
    table = sorted(s[i:] + s[:i] for i in range(len(s)))
    first_column = [row[0] for row in table]
    print first_column
    last_column = [row[-1:] for row in table]
    return "".join(last_column)

Getting Started

Install BWA and samtools from source:
BWA and samtools are written in C++ and the source code needs to be compiled. Go to the ProgramFiles that were included in the download and follow these directions:

ubuntu install
$ sudo apt-get install ncurses-dev
$ tar -xvf samtools-0.1.17.tar.bz2
$ cd samtools-0.1.17
$ make
$ sudo cp samtools bcftools/bcftools bcftools/ /usr/local/bin/
$ cd ..
$ tar -xvf bwa-0.5.9.tar.bz2
$ cd bwa-0.5.9
$ make
$ sudo cp bwa /usr/local/bin/
$ which bwa
$ which samtools
$ which bcftools

Let's set up a script in the same manner as this morning:

from map_reads import bwaSE, bwaPE, bwaAln # again note that these functions don't exist yet

And, then we'll want to setup our variables for managing data and files.

#!/usr/bin/env python
read_file = '/Users/aaron/PythonCourse/data/yeast/SRP003355/SRR064545/SRR064545_sample.fastq'
genome_index = '/Users/aaron/PythonCourse/data/yeast/S288C'
genome_reference ='/Users/aaron/PythonCourse/data/yeast/S288C.fasta'

The genome_index is the BW transformed genome which we will have to create.

$ cd PythonCourse/data/yeast
$ bwa index -p S288C S288C.fasta

Note the files that are generated: *.bwt and *.rbwt for both strands of the genome as well as *.sa and *.rsa which are the suffix arrays.

BWA is a little less straight forward to use than bowtie so there are a few extra steps that we need to take in order to get out mapped reads. BWA first takes the reads and finds the preliminary position for each read using the seed-and-extend method. If these reads are paired-end, it takes each end separately. This is the 'aln' step. Next we need to convert the position information to a mapped format and if using paired-end reads, refine the mapping taking the paired information into account. This is the 'samse' or 'sampe' step for single and paired-end reads respectively.

We are going to write code that will first map single end reads and then paired-end reads. For practical purposes, we are going to run the code on the same single end sample file that we used this morning and then download the paired-end mapped reads for further use.

from subprocess import Popen,PIPE
def bwaSE(read_file, sai_file, genome_index, mapped_output, **se_opts):
def bwaPE(read_file_1, read_file_2,  sai_1, sai_2, genome_index, mapped_output, **pe_opts):
def bwaAln(read_file, genome_index, sai_file, **aln_opts):

Variable length argument lists and keyword lists

*args and kwargs are ways to pass a variable number of potentially unknown arguments and optional into a function. The function needs to be declared with these options and when used, they need to have the '*' to mark them as an *args or kwargs. For example:
def m(i, **others):
     print i
     for k in others:
         print k,others[k]
extra_kwargs = {'this':'is','a':'test'}
m(1, **extra_kwargs)
m(**extra_kwargs) # not going to work

Our script is going to run the separate steps of the BWA alignment in a script named:

#!/usr/bin/env python
import sys
import map_reads
def main(argv):
    read_file = '/Users/aaron/PythonCourse/data/yeast/SRP003355/SRR064545/sample/SRR064545_sample_rev_1.fastq.gz'
    genome_index = '/Users/aaron/PythonCourse/data/yeast/S288C'
    read_sai = 'SRR064545_sample_rev_1.sai'
    read_sam = 'SRR064545_sample_rev_1.sam'
    read_group = '@RG\tID:SRR064545\tSM:vac6wt\tPL:ILLUMINA\tPU:vac6wt'
    map_reads.bwaAln(genome_index, read_file, read_sai, log_file='aln.log',**{'-t':'4'})
    map_reads.bwaSE(genome_index, read_sai, read_file, read_sam, log_file='sam.log', **{'-r':read_group})
if __name__ == '__main__':

The first step for BWA is to align each fastq file, seperately for the paired ends, so both map_reads.bwaSE and map_reads.bwaPE are going to use map_reads.bwaAln:
from subprocess import Popen,PIPE
def bwaSE(genome_index, sai_file, read_file, sam_file, log_file='', **se_opts):
    se_arguments = ['bwa', 'samse']
    for opt in se_opts.items():
    se_arguments.extend([genome_index, sai_file, read_file])
    if log_file:
           log_fh = open(log_file, 'w')
           print log_file, 'could not open'
        log_fh = PIPE
        sam_fh = open(sam_file, 'w')
        print sam_file, 'could not open'
    bwa = Popen(se_arguments, stderr=log_fh, stdout=sam_fh)
def bwaAln(genome_index, read_file, sai_file, log_file='', **aln_opts):
    aln_arguments = ['bwa', 'aln']
    for opt in aln_opts.items():
    aln_arguments.extend([genome_index, read_file])
    if log_file:
           log_fh = open(log_file, 'w')
           print log_file, 'could not open'
        log_fh = PIPE
        sai_fh = open(sai_file, 'w')
        print sai_file, 'could not open'
    bwa = Popen(aln_arguments, stderr=log_fh, stdout=sai_fh)

Now when we run this we can use the time utility and compare it to our homebrew read mapper:

$ time python
real    0m0.463s
user    0m0.865s
sys    0m0.129s

So pretty fast and it gives us the non-exact matches. This is represented in the SAM format. For our project, I ran BWA with the reverse complement of the SRA read file in order to make BWA think these are pair-end instead of mate-pair reads. The main difference is the orientation of the reads: pair-end reads face each other and mate-pairs face away from each other. We can look at the distribution of insert sizes by getting information from the SAM file. Download the mapped and sorted files here.

We can access the reads in the compressed BAM file using the samtools view program with subprocess. samtools is a seperate program for doing useful things with read files like sorting, adding information, getting reads that mapped from a particular location and so on. Let go through the BAM file and get the fragment length as defined by the end points of paired reads. The SAM format states that this information is in the 8th field and is 0 if there is no information (like the pairs are not on the same chromosome).

from matplotlib import pyplot as plt
samtools = sp.Popen(['samtools', 'view', 'SRR064545.sorted.bam', 'chrIV'], stdout=sp.PIPE)
inserts_wt = [int(read.split()[8]) for read in samtools.stdout if int(read.split()[8]) >0 and int(read.split()[8]) < 4000]
samtools = sp.Popen(['samtools', 'view', '../SRR064546/SRR064546.sorted.bam', 'chrIV'], stdout=sp.PIPE)
inserts_mut = [int(read.split()[8]) for read in samtools.stdout if int(read.split()[8]) >0 and int(read.split()[8]) < 4000]
(n, bins, patches) = plt.hist(inserts_wt, bins=100)
plt.hist(inserts_mut, bins=bins)

SNP calling

The goal of SNP calling is to find variations in our data and separate the true polymorphisms from the errors created by the experimental process. Each sequencing method has pluses and minuses: 454 creates runs of homopolymers where none exist, Illumina has problems with types of base pair combinations, PacBio has a per-base error rate that is really high and all these methods have position dependent error rates. There are several ways we could imagine filtering our SNPs to reduce our false positive rate. We could filter out regions with too low coverage, with too much coverage, SNPs that only occur in reads from one direction, SNPs that we only see at the end or beginning of reads, SNPs that occur after a certain k-mer, and other filters. All of these are going to improve the specificity of our SNP set.

One of the first things we should always do, even for computational biologists, is to look at the data. There are many many genome browsers you could use to look at mapped reads. As an example we are going to use IGV. Download IGV, import the S288C.fasta genome file, and open the bam file to look at your reads.

Quality of reads by position is one of the most obvious biases in SNP calling. To see for yourself, take the read sample file and plot the mean and stddev of the quality score by position.

import numpy as np
from matplotlib import pyplot as plt
reads = [l.strip() for l in open('data/yeast/SRP003355/SRR064545/sample/SRR064545_sample_rev_1.fastq')]
quality_lines = [reads[j] for i,j in enumerate(range(3,len(reads))) if i % 4 == 0]
quality_scores = np.array([map(ord,list(i)) for i in quality_lines])
x = np.arange(len(quality_scores[0]))
y = [quality_scores[:,i].mean() for i in x]
y_std = [np.std(quality_scores[:,i]) for i in x]
y = np.array(list(reversed(y)))
y_std = np.array(list(reversed(y_std)))

There are other biases in base composition which we will plot as an exercise. An Illumina specific bias is that GpC pairs can lead to false snp calls. Since this should only occur on one strand and we have information from both, we can use a statistical test to see how likely it is for a real snp to only be seen on reads from one strand or the other. First we need to find all the postions that might be snps and then use Fisher's exact test from scipy . samtools mpileup provides a way to look at coverage and read information for each position in the mapped genome, COUNTING FROM 0, and then bcftools will take this and save information about potential variants in a compressed VCFformat. According to the header information in the VCF file there is a field called DP4 which says:
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases". This is exactly what we want for our Fisher's exact test and also to estimate the mutation frequency in the pooled samples. To run samtools and bcftools on each read set will take 30min each but we can test it by just running it on chrI and downloading the full snp calls here.

$ time samtools mpileup -r chrI -uf ../data/yeast/S288C.fasta ../data/yeast/SRR064545.sorted.bam|bcftools view -bvcg - > chrI.raw.bcf
$ bcftools view chrI.raw.bcf | less

Now we need a way to parse and store VCF information:

#!/usr/bin/env python
def parse_line(l, vcf_fields):
    '''splits a line into a dictionary by the VCF4.1 mandatory columns and assumes
    that there is also a FORMAT and at least one SAMPLE_ID
    parts = l.rstrip().split('\t')
    info = parse_info(parts[7])
    genotypes = parts[9:]
    p = parts[:7]
    vcf = dict(zip(vcf_fields, p))
    return vcf
def parse_info(i):
    '''VCF4.1 spec: INFO fields are encoded as a semicolon-separated series of short
    keys with optional values in the format: <key>=<data>[,data].
    function splits the info into a key, data dic with data split into lists and if key is a flag, sets data to [True]
    if i == '.':
        return {}
        pairs = [f.split('=') for f in i.split(';')]
        #print pairs
        for i in range(len(pairs)):
            if len(pairs[i]) == 1:
                #flag, not key value pair so make it a key,value
        d = dict([(k,v.split(',')) for k,v in pairs])
        #d = dict(zip([k for (k,v) in pairs], [v.split(',') for (k,v) in pairs]))
        return d
def get_variants(fh):
    ''' parses a VCF file handle
    returns: variants, a dictionary keyed on (chr,pos)
    variants = {} # dictionary keyed on chr,pos
    header = ''
    for line in fh:
        if line.startswith('#CHROM'):
            header = line
    if not header:
        print 'no header line, not a properly formed VCF'
        return variants
    vcf_fields = header.strip()[1:].split()
    for line in fh:
        position = parse_line(line, vcf_fields)
        variants[(position['CHROM'],position['POS'])] = position
    return variants

Our table for testing read strand bias is going to be structured as:
          Forward  Reverse
ref        8        2
alt        1        5
oddsratio, pvalue = stats.fisher_exact([[8, 2], [1, 5]])
import vcfTools
from scipy import stats
import subprocess as sp
wt  = sp.Popen(['bcftools', 'view', 'wt.raw.bcf'],stdout=sp.PIPE)
wt_snps = vcfTools.get_variants(wt.stdout)
read_bias = []
for pos in wt_snps:
    dp4 = map(int,wt_snps[pos]['INFO']['DP4'])
        o,p = stats.fisher_exact([dp4[:2],dp4[2:]])
from matplotlib import pyplot as plt
plt.hist(read_bias, bins=100)

Thanks to the experimental design, errors created by the sequencing method are shared between the samples and should look like heterozygous snps in each and therefore be excluded from consideration. But for example, use samtools mpileup to look as position chrI 25480 which according to TableS1 is a snp and see for yourself.

Next let's recreate Fig3 by collecting snps into fraction with the alternate allele with a Counter and plotting the counts.

Add an annotation field to each snp for synonymous or non-synonymous by combining the information from the SGD_feature file from S1.1 with the codon translate code from exercise 2 of S6.1. We want to look at the position of each snp and see if it is in a verified ORF and if so, did the mutation change the codon significantly?

The last calculation we need is the Log Odds for each snp found in the vac6 mutant pool. Many of the SNPs called in the vac6 pool are not going to be found in the wt pool but some will be. To get this information we can use samtools mpileup to get information about a single position with, -r chrVII:956111-956111. In the output, the bases at that position are in the 5th column and are marked by a '.' or ',' if matching the reference and 'N', or 'n' if a non-ref base (upper or lower case represents forward or reverse mapped reads). The '^]', '$' and '^>' string mark the beginning and end of a read and can be ignored. The paper states that they calculate the Log Odds but it seems that what they are really reporting is the log10 of the pvalue of Fisher's Exact Test.
           wt      vac6
ref        8        2
alt        1        5
oddsratio, pvalue = stats.fisher_exact([[8, 2], [1, 5]])
lod = np.log10(pvalue)

Now let's put it all together and see if we get the same SNPs!

import numpy as np
import sequenceTools
import vcfTools
import subprocess as sp
from scipy import stats
def get_coding_regions(file,genome):
    roman = dict(zip(map(lambda x: '%s'%(x),range(1,17)),['I','II','III','IV','V','VI','VII','VIII','IX','X','XI','XII','XIII','XIV','XV','XVI']))
    print roman
    lines = [l.strip().split('\t') for l in open(file)]
    orfs = [l for l in lines if l[1] == 'ORF' and l[2] == 'Verified']
    print len(orfs)
    coding = {}
    good_orfs = []
    for chr in genome:
        coding[chr] = np.zeros(genome[chr],dtype=bool)
    for orf in orfs:
        if orf[6] == '2-micron': continue
        p = orf[6].split(' ')
        print orf[6],p[1], orf
        if p[1] in roman:
            chr_key = 'chr'+roman[p[1]]
        start = int(orf[9])-1
        end = int(orf[10])
        print chr_key, start,end
        coding[chr_key][start:end] = True
        orf[9] = start
        orf[10] = end
        orf[6] = chr_key
    return coding, good_orfs
def parse_nuctable(filename):
    fh = open(filename)
    codon_table = {}
    for line in fh:
        line = line.strip()
        if line:  # skips blank lines
            data = line.split()
            aa = data[0]   # something like 'Ala/A'
            oneletter = aa.split('/')[1]
            codons = data[1:]
            for codon in codons:
                codon = codon.strip().replace(',', '')
                # turns 'GCU, ' into 'GCU'
                codon_table[codon] = oneletter
    return codon_table
def seq_to_codons(sequence):
    codons = []
    i = 0
    while i < len(sequence):
        codons.append(sequence[i: i+3])
        i += 3
    return codons
def translate_sequence(nt_sequence,codon_table):
    aa_sequence = ""
    nt_sequence = nt_sequence.replace('T', 'U')
    # nuctable uses the RNA sequence, not the DNA
    for codon in seq_to_codons(nt_sequence):
        aa_sequence += codon_table[codon]
    return aa_sequence
def get_wt_reads(bam,chr,pos):
    region = '%s:%i-%i' % (chr,pos,pos)
    samtools = sp.Popen(['samtools', 'mpileup', '-r', region, bam],stdout=sp.PIPE, stderr=sp.PIPE)
    pile = samtools.communicate()[0]
    total = 0
    total_alt = 0
    if pile:
        coverage = pile.split('\t')[4]
        total = len(coverage)
        non_ref = 'actg'
        total_alt = 0
        for n in non_ref:
            total_alt += sum(np.array(map(lambda x: 1 if x == n else 0, coverage)))
    return total, total_alt
def main():
    ref_file = '../data/yeast/S288C.fasta'
    ref = sequenceTools.read_fasta(ref_file)
    ref = dict([(c.split('\t')[0],v) for c,v in ref.items()])
    genome_lens = dict([(c,len(ref[c])) for c in ref])
    feature_file = '../data/yeast/'
    wt_bam = 'SRR064545.sorted.bam'
    mut_table = open('','w')
    codon_table = parse_nuctable('codon_table')
    coding,orfs = get_coding_regions(feature_file,genome_lens)
    mut_fh = sp.Popen(['bcftools', 'view', 'mut.raw.bcf'],stdout=sp.PIPE)
    mut_snps = vcfTools.get_variants(mut_fh.stdout)
    for k,vcf in mut_snps.items():
        chr,pos = k
        nonsyn = 'No'
        print chr, pos
        if coding[chr][pos]:
            # in coding region
            # check if changes gene
            if len(vcf['ALT']) != 1:
                nonsyn = 'yes'
            for orf in [orf for orf in orfs if orf[6] == chr]:
                start = orf[9]
                end = orf[10]
                strand = orf[11]
                if start <= pos <= end:
                    '''print map(type,[chr,start,pos,end])
                    print type(ref)
                    print ref[chr]
                    print vcf['ALT']
                    print ref[chr][pos+1:end]
                    newseq = ref[chr][start:pos] + vcf['ALT'] + ref[chr][pos+1:end]
                    oldseq = ref[chr][start:end]
                    if len(newseq)%3!=0 or len(oldseq)%3!=0:
                        nonsyn = 'yes'
                    if strand == 'C':
                        #minus strand
                        newseq = sequenceTools.revcomp(newseq)
                        oldseq = sequenceTools.revcomp(oldseq)
                    if translate_sequence(newseq,codon_table) != translate_sequence(oldseq,codon_table):
                        nonsyn = 'Yes'
        wt_reads,wt_alt = get_wt_reads(wt_bam,chr,pos)
        dp4 = map(int,vcf['INFO']['DP4'])
        mut_reads = sum(dp4)
        mut_alt = sum(dp4[2:])
            lod = np.log10(stats.fisher_exact([dp4[:2],dp4[2:]])[1])
            lod = 0
        ref_base = vcf['REF']
        alt_base = vcf['ALT']
        mut_table.write('\t'.join([chr, '%i'%(pos), ref_base, alt_base, '%i'%(wt_alt), '%i'%(wt_reads), '%i'%(mut_alt), '%i'%(mut_reads), '%f'%(lod), nonsyn])+'\n')
if __name__ == '__main__':


1. Plot base composition as a function of read position using the sample reads.

import numpy as np
from matplotlib import pyplot as plt
reads = [l.strip() for l in open('../data/yeast/SRP003355/SRR064545/sample/SRR064545_sample_rev_1.fastq')]
read_lines = [reads[j] for i,j in enumerate(range(1,len(reads))) if i % 4 == 0]
read_len = len(read_lines[0].strip())
bases = 'ACGT'
read_comp = {}
for n in bases:
    read_comp[n] = np.zeros(read_len)
for read in read_lines:
    for n in bases:
        read_comp[n] += np.array(map(lambda x: 1 if x == n else 0, read))
for n in bases:
    read_comp[n] /= len(read_lines)
    plt.plot(range(read_len), read_comp[n][::-1], label=n)
plt.xlabel('Read position')
plt.ylabel('Base composition')
print read_comp

2. Replicate figure S1.b (without the LYW10741 sample)

from matplotlib import pyplot as plt
import subprocess as sp
def plot(inserts_wt, inserts_mut):
    fig = plt.figure(figsize=(10,5))
    p1 = fig.add_subplot(121)
    (n, bins, patches) = p1.hist(inserts_wt, bins=100)
    p1.set_xlabel('Size (kb)')
    p1.set_ylabel('$Count x 10^5$')
    p1.set_title('VAC6 wild-type')
    p2 = fig.add_subplot(122)
    p2.hist(inserts_mut, bins=bins)
    p2.set_xlabel('Size (kb)')
    p2.set_title('VAC6 mutant')
def main():
    samtools = sp.Popen(['samtools', 'view', '../data/yeast/SRP003355/SRR064545/SRR064545.sorted.bam'], stdout=sp.PIPE)
    inserts_wt = [int(read.split()[8]) for read in samtools.stdout if int(read.split()[8]) >0 and int(read.split()[8]) < 4000]
    samtools = sp.Popen(['samtools', 'view', '../data/yeast/SRP003355/SRR064546/SRR064546.sorted.bam'], stdout=sp.PIPE)
    inserts_mut = [int(read.split()[8]) for read in samtools.stdout if int(read.split()[8]) >0 and int(read.split()[8]) < 4000]
    plot(inserts_wt, inserts_mut)
if __name__ == '__main__': main()

3. Fill in the bwaPE stub to map paired-end reads.

def bwaPE(read_file_1, read_file_2,  sai_1, sai_2, genome_index, sam_file, **pe_opts):
    def bwaPE(read_file_1, read_file_2,  sai_1, sai_2, genome_index, sam_file, **pe_opts):
    pe_arguments = ['bwa', 'sampe']
    for opt in se_opts.items():
    pe_arguments.extend([genome_index, sai_1, sai_2, read_file_1, read_file_2])
    if log_file:
           log_fh = open(log_file, 'w')
           print log_file, 'could not open'
        log_fh = PIPE
        sam_fh = open(sam_file, 'w')
        print sam_file, 'could not open'
    bwa = Popen(pe_arguments, stderr=log_fh, stdout=sam_fh)