RNA-Seq and the Tuxedo Suite


Introduction

Yesterday, we talked about BWA, a short read aligner, and how it can be used in a SNP finding pipeline. However, aligning reads to a genome is the basis for almost all high-throughput sequencing applications, and there are several different packages that do the same job, with different pros and cons. Another program that also uses the Burroughs-Wheeler transform is bowtie, with a couple extra tools for doing RNA-seq named tophat and cufflinks. Because of the somewhat whimsical names (which is not uncommon in software), the whole collection of tools is sometimes called the Tuxedo suite. Bowtie is slightly faster than bwa, and is aware of the quality scores from the bases, but doesn't do gapped alignment. If you're working with a strain fairly close to the reference strain for your species, you will likely prefer bowtie, but there is no hard and fast rule.

In our efforts to learn these tools, we're going to go over a small portion of the data from the modENCODE project on Drosophila. These operations typically take tens of minutes to hours on reasonably sized data sets, so instead I'm going to give you a small fraction of the data so things go quickly. In your PythonCourse/data/RNAseq directory, you should have two files, dmel_fem_05d_sample.fastq and dmel_fem_05d_sample.fastq, which are reads taken from male and female adult flies 5 days after hatching, and then taking only reads that map to Chromosome 4 (the smallest autosome in flies). The precise numbers will differ from if you ran this on the whole genome with the full dataset, but the general conclusions should still be roughly the same. I also forgot to include a set of annotations from just chromosome 4 (as opposed to the full fly genome), and so we're going to write a quick Python script to only pull out those annotations from chromosome 4:

outfh = open('dmel-4-transcripts-r5.38.gtf', 'w')
for line in open('dmel-all-transcripts-r5.38.gtf'):
    if line.startswith("4"):
        outfh.write(line)


We'll take a few moments here to install the programs we'll need for the day:

$ cd ~/PythonCourse/programFiles/
$ unzip bowtie-0.12.7-macos-10.5-x86_64.zip
$ mv bowtie-0.12.7 ~/bowtie-0.12.7

The default bowtie download comes with documentation, tutorials, etc, but cufflinks and tophat just have the executables and a few READMEs, so it's probably simplest just to put those in with the bowtie files:

$ tar -xvzf cufflinks-1.0.3.OSX_x86_64.tar.gz
$ mv cufflinks-1.0.3.OSX_x86_64/* ~/bowtie-0.12.7/
$ tar -xvzf tophat-1.3.1.OSX_x86_64.tar.gz
$ mv tophat-1.3.1.OSX_x86_64/* ~/bowtie-0.12.7

Note that this does overwrite the README, LICENSE, and AUTHORS files, but it's likely more convenient just to use the relevant website (bowtie, tophat, and cufflinks). We'll also want to add in a line in our .bash_profile file, similar to when we installed blast:

PATH=$PATH:~/bowtie-0.12.7

And now we're off to the races!

Bowtie


Much like with BLAST and BWA, in order to use Bowtie (and therefore any of the rest of the Tuxedo suite), we'll need to build an index of our genome. We do this with the bowtie-build command. The first argument is the reference genome file, the second is the name we want to use for the genome file, which I've chosen to be the same as the fasta file without the extension. Today, I'm going to just work up the data on chromosome 4 (the sampled data included is only reads that mapped to chromosome 4). The only reason I'm doing this is to make things run faster. The process is exactly the same if you were working with the whole genome, it just takes longer.

$ cd ~/PythonCourse/data/
$ bowtie-build dmel-4-chromosome-r5.38.fa dmel-4-chromosome-r5.38

While this is running, I'll show you the syntax for running Bowtie, although I won't actually run it, because we're actually going to use tophat to do the RNA-seq reads.

$ bowtie dmel-4-chromosome-r5.38 sequence.fastq

That's it. Easy, huh? By default, it writes the alignments to the console, so you will likely want to do the following:
$bowtie dmel-4-chromosome-r5.38 sequence.fastq --refout

which writes the output to a file, ref00000.map. The numbers in that filename will automatically increase if you supply more than one file of reads to the map, and/or if your genome has multiple chromosomes (it'll make one file per chromosome).


Tophat

Let's say you're mapping things to the genome, and you find that suddenly, in the middle of a read, you jump several kb in the genome. If your read was of genomic DNA, this would probably tell you that the genome is pretty messed up. On the other hand, if you're working with RNA-seq data, probably all it means is that you've got a eukaryote and you hit a splice junction. Tophat is a read mapper that is aware of splice junctions.

A typical tophat run looks something like this:

$ tophat -o reads12 dmel-4-chromosome-r5.38 reads_1_1,reads_2_1 reads_1_2,reads_2_2

Let's break this down. We want tophat to:

  • put output in the folder called reads12, which it will make if it doesn't already exist. This is optional, and will default to tophat_out
  • We're using the index file dmel-r5_38
  • We've got four data files: reads_1_1 and reads_1_2 are the mate-pair libraries from our first lane of sequencing, and reads_2_1 and reads_2_2 are mate-pairs from the second lane. The newest generation sequencing technologies put out so much data that one lane of sequencing is enough, but in case you do need multiple, you can just use comma-separated lists.

For those of you clever enough to be working with well-studied model organisms, there's another option you probably want to keep in mind:

$ tophat --no-novel-juncs -G ~/PythonCourse/data/dmel-4-transcripts-r5.38.gtf -o reads12 dmel-r5_38 reads_1_1,reads_2_1 reads_1_2,reads_2_2

This tells tophat that you only want to use already annotated gene models. FIXME FIXME Faster? Better sensitivity?

Tophat produces a few different files of output. For our purposes, the most important of these is accepted_hits.bam, which is a list of read alignments. It's in the binary version of the SAM format, which is not really readable by hand. You can use samtools to convert these to human readable format if you want, although that does take up quite a bit more space.

Scripting tophat


Let's start off by writing a script to run Tophat once, with parameters, and then we'll move on to running it for each of the fly samples we have.

import subprocess as sp
 
program = 'tophat'
refgenomename = '../data/dmel-4-chromosome-r5.38'
refGTFname = '../data/dmel-4-transcripts-r5.38.gtf'
readfilename = ''
 
proc = sp.Popen([program, '--no-novel-juncs', '-G', refGTFname, refgenomename, readfilename] )

Of course, the modENCODE dataset has several independent different conditions, so let's see if we can't automate this:

run_tophat.py
import subprocess as sp
import os
from os import path
 
program = 'tophat'
refgenomename = '../data/dmel-4-chromosome-r5.38'
refGTFname = '../data/dmel-4-transcripts-r5.38.gtf'
 
datadir = '../data/RNAseq'
readfilenames = os.listdir(datadir)
 
#make a new directory for our output data
outdir = '../data/tophat_out/'
 
try:
    os.makedirs(outdir)
except OSError as error:
    print "Warning: Output directory already exists. Data might be overwritten"
    print error
 
procs = []
for readfn in readfilenames:
    outname = path.join(outdir, path.basename(path.splitext(readfn)[0]))
    readfn = path.join(datadir, readfn)
    print "reading from", readfn
    print "writing to", outname
    proc = sp.Popen([program, '--no-novel-juncs', '-G', refGTFname, '-o',
                     outname, refgenomename, readfn] )
 
    procs.append(proc)
 
[proc.wait() for proc in procs]


Cufflinks


It's all great to have the reads aligned to the genome, but there's still a few things we need to be aware of. First off, there's thousands of genes in the genome, and we would like to have a single number for each one. That's still a lot of data points, but it becomes tractable to deal with that number. The community has settled on RPKM (Reads per Kilobase per Million Reads) and FPKM (the paired-end equivalent, where one pair of reads is one fragment). This corrects for a few of the most obvious biases one might encounter in RNAseq. By dividing by the length of the gene, you correct for a longer transcript producing more sequencing reads than a shorter one; and by dividing by the number of reads, you correct for different sequencing runs having different numbers of reads, but potentially even being derived from the same library.

Let's furthermore say that you are interested in whether differential splicing is happening in your sample. This is an even harder problem, since you can't just look at the RPKM for each fragment and compare those: what if you had alternate exons?

Add in the fact that there are a number of biases present in the library preparation process, and you almost start to wonder why you thought RNAseq was a good idea in the first place. At least, you do until someone points out cufflinks, the last major tool in the Tuxedo suite. This builds on the last decade of bioinformatic research to produce a fast, easy-to-use tool that pretty much just works to estimate FPKM values for data.

If we run cufflinks without any arguments it gives us a long list of options:

$ cufflinks

but the most important line is near the beginning:

Usage: cufflinks [options] <hits.sam>

For our purposes, a couple of the options we'll care most about are:
-o/--output-dir write all output files to this directory [ default: ./ ]
-p/--num-threads number of threads used during analysis [ default: 1 ]
-G/--GTF quantitate against reference transcript annotations
 


Cufflinks will then generate three files:
  • transcripts.gtf:
  • isoforms.fpkm_tracking
  • genes.fpkm_tracking

Each of these is human readable, and easily parsable. The cufflinks manual has more details on what each of the columns are.

Like with tophat, we'll want to run this cufflinks over all of the runs:

run_cufflinks.py
import subprocess as sp
import os
from os import path
 
program = 'cufflinks'
refGTFname = '../data/dmel-4-transcripts-r5.38.gtf'
 
datadir = '../data/tophat_out/'
readfilenames = os.listdir(datadir)
 
#make a new directory for our output data
outdir = '../data/cufflinks_out/'
 
try:
    os.makedirs(outdir)
except OSError as error:
    print "Warning: Output directory already exists. Data might be overwritten"
    print error
 
procs = []
 
for readfn in readfilenames:
    outname = path.join(outdir, path.basename(path.splitext(readfn)[0]))
    readfn = path.join(datadir, readfn, 'accepted_hits.bam')
    print "reading from", readfn
    print "writing to", outname
    proc = sp.Popen([program,  '-g', refGTFname, '-o', outname, readfn] )
    procs.append(proc)
 
# Note that we don't actually care about the return values, we just
# want to make sure they finish.
[proc.wait() for proc in procs]
 
 


Auxiliary Cufflinks Programs


In addition to quantifying the expression levels of each gene and transcript (also called an isoform), cufflinks comes with some software to easily tell you what's different between samples. The program cuffcompare generates statistics on sensitivity (the fraction of true positives detected) and specificity (the fraction of true negatives detected), as well as merging the GTF files that cufflinks generates. It's especially important if you don't already have a good reference annotation.

$ cuffcompare

This one we could just as easily do at the command line, but then when your advisor asks what options you gave, it would be nice to be able to look at your code and just know, rather than guessing. Sure, you could write it in your lab notebook, but where's the fun in that?

run_cufftools.py
import subprocess as sp
import os
from os import path
 
cuffcompare = 'cuffcompare'
# I know this seems silly, but one of the servers we have in lab would need
# this to be cufflinks.cuffcompare. Easiest to have it in one obvious place
# at the top so it can be changed as necessary.
datadir = '../data/cufflinks_out'
 
outprefix = path.join(datadir, 'cuffcmp')
 
 
runnames = os.listdir(datadir)
gtfs = [path.join(datadir, name, 'transcripts.gtf') for name in runnames]
 
proc = sp.Popen([cuffcompare, '-o', outprefix] + gtfs)
proc.wait()   #Wait until cuffcompare finishes before running cuffdiff
 
 

Perhaps the most important tool you'll want to know about is cuffdiff, which does some sophisticated statistical tests to tell you about significant changes in expression, splicing, and promoter usage. It will give you a list of yes/no "is this difference statistically significant" calls, one for each gene, transcript, and

Again, we can look at the run-time options by simply running it without any arguments:

$ cuffdiff
Usage:   cuffdiff [options] <transcripts.gtf> <sample1_hits.sam> <sample2_hits.sam> [... sampleN_hits.sam]
   Supply replicate SAMs as comma separated lists for each condition: sample1_rep1.sam,sample1_rep2.sam,...sample1_repM.sam
General Options:
  -o/--output-dir              write all output files to this directory              [ default:     ./ ]
  -T/--time-series             treat samples as a time-series                        [ default:  FALSE ]
  -c/--min-alignment-count     minimum number of alignments in a locus for testing   [ default:   10 ]
  --FDR                        False discovery rate used in testing                  [ default:   0.05 ]
  -M/--mask-file               ignore all alignment within transcripts in this file  [ default:   NULL ]
  -b/--frag-bias-correct       use bias correction - reference fasta required        [ default:   NULL ]
  -u/--multi-read-correct      use 'rescue method' for multi-reads (more accurate)   [ default:  FALSE ]
  -N/--upper-quartile-norm     use upper-quartile normalization                      [ default:  FALSE ]
  -L/--labels                  comma-separated list of condition labels
  -p/--num-threads             number of threads used during quantification          [ default:      1 ]


Now we modify our program:

run_cufftools.py
import subprocess as sp
import os
from os import path
 
cuffcompare = 'cuffcompare'
# I know this seems silly, but one of the servers we have in lab would need
# this to be cufflinks.cuffcompare. Easiest to have it in one obvious place
# at the top so it can be changed as necessary.
cuffdiff = 'cuffdiff'
cuffdatadir = '../data/cufflinks_out'
tophdatadir = '../data/tophat_out'
 
reference = '../data/dmel-4-transcripts-r5.38.gtf'
 
outprefix = path.join(cuffdatadir, 'cuffcmp')
 
 
runnames = os.listdir(cuffdatadir)
gtfs = [path.join(cuffdatadir, name, 'transcripts.gtf') for name in runnames]
 
proc = sp.Popen([cuffcompare, '-o', outprefix, '-r', reference] + gtfs)
proc.wait()   #Wait until cuffcompare finishes before running cuffdiff
 
 
runnames = os.listdir(tophdatadir)
bamfiles = [path.join(tophdatadir, name, 'accepted_hits.bam') for name in runnames]
proc2 = sp.Popen([cuffdiff, '-o', cuffdatadir, outprefix+'.combined.gtf'] + bamfiles)
proc2.wait()



Exercises


1) Simple Plotting

Generate a scatterplot of the expression (in FPKM) between the two samples, with the X coordinate being the expression level in females, and the Y coordinate being expression level in males. Because the values span such a wide range, you will probably want to use loglog(), which takes all the same arguments as plot(), but plots on a log-log scale.

2) Separate out female and male biased genes

Write a program that goes through the output from cuffdiff and writes the gene names of the female biased genes to one file, and the names of the male biased genes to another file.

3) Convert names to something usable
Cufflinks has reported the expression levels for these genes, but they are all listed as things like XLOC_00001, which is just a name that cufflinks chose for them. Using cuffcmp.combined.gtf and the "FBgn <==> FBtr <==> FBpp IDs" table on flybase, convert the XLOC names to their corresponding FBgn IDs. Then, feed the male and female biased genes through the GO-Term finder to see what kinds of things (on chromosome 4) are enriched in males and females.

3) Bonus: Write your own cufflinks

Use samtools to convert accepted_hits.bam into a plaintext sam file. Then, write a program that uses the SAM file and either the GTF generated by cufflinks or the reference gene annotation, calculate the FPKMs for a given gene. If two genes overlap, randomly assign the read to one or the other. How closely does this agree with the value that cufflinks reports?

Once you have that working, do it for all of the genes at once. Ideally, this should only loop through the accepted_hits once.

Excerpts the SAMtools documentation:
Col
Field
Description
1
QNAME
Query template/pair NAME
3
RNAME
Reference Sequence NAME
4
POS
1-based leftmost POSition/coordinate of clipped sequence
8
MPOS
1-based Mate POSition


Solutions


Simple Plotting

This has some bug in it that I'm not sure of yet, but the basic idea is the same:

fh = open('../data/cufflinks_out/gene_exp.diff')
 
 
fh.readline()
# Get rid of the header line
 
fpkms1 = []
fpkms2 = []
 
for line in fh:
    data = line.split()
    # If the values are zero, then a loglog plot will fail.
    fpkm1 = float(data[7])
    fpkm2 = float(data[8])
    if fpkm1 == 0 or fpkm2 ==0:
        continue
    fpkms1.append(fpkm1)
    fpkms2.append(fpkm2)
 
from matplotlib import pyplot as mpl
from numpy import array
 
mpl.loglog(array(fpkms1), array(fpkms2), ',')
 
# Plot a diagonal line corresponding to equal expression
mpl.loglog([min(fpkms1), max(fpkms1)], 
           [min(fpkms1), max(fpkms1)], 'r:')
mpl.show()

Gives us this:
FPKMS.png

Male and Female Biased


malebiased = open('male', 'w')
femalebiased = open('female', 'w')
 
fh = open('../data/cufflinks_out/gene_exp.diff')
 
 
fh.readline()
# Get rid of the header line
 
 
for line in fh:
    data = line.split()
    if data[-1] == 'yes':
        print data[0], data[7], data[8]
        female = float(data[7])
        male = float(data[8])
        if female > male:
            femalebiased.write(data[0] + '\n')
        else:
            malebiased.write(data[0] + '\n')
 
malebiased.close()
femalebiased.close()
 

What's your name, little gene?


import sys
 
xlocdict = {}
fbtrdict = {}
 
for line in open('../data/cufflinks_out/cuffcmp.combined.gtf'):
    XLOC = ""
    FBtr = ""
    for word in line.split():
        if 'XLOC' in word:
            XLOC = word[1:-2]
        elif 'FBtr' in word:
            FBtr = word[1:-2]
    if XLOC == "" or FBtr == "":
        continue
    xlocdict[XLOC] = FBtr
 
for line in open('fbgn_fbtr_fbpp_fb_2011_07.tsv.tsv'):
    line = line.strip()
    if len(line) == 0 or line[0] == "#":
        continue
    line = line.split('\t')
    fbtrdict[line[1]] = line[0]
 
infile = open(sys.argv[1])
 
for line in infile:
    try:
        fbtr = xlocdict[line.strip()]
        print fbtrdict[fbtr]
    except KeyError:
        pass
        #print "Can't find a match for ", line.strip()