Last lecture!

Next-generation sequencing has started to change the way we approche many different problems in biology, changing qualitative and analog results into quantitative digital counts of reads. We looked at RNA fragment counting this morning and now we are going to look at fragments of DNA that were bound to a protein and an antibody. For ChIP data, we used to hybridize DNA fragments to other fragments and compare florescence of the fragments. Now with next-gen sequencing we can count the fragments directly. Using modENCODE data, we are going to look at the position of RNA polII and H3K4Me1 from a collection of melanogaster embryos that are between 4-8 hours old.

Short Read Archive
Depending on the finances of the USA, the NIH might continue to host a short read archive for the raw data and for the moment we can find the information we want here. Another place to look for short read data is the European Nucleotide Archive . Today we are going to browse the SRA for the modENCODE data referenced in the paper. Once you find some data that you want to use, you need to unpackage it from the .sra format into fastq files using a utility from the NIH, sra-toolkit with a command similar to:

$ fastq-dump -A SRR030288 SRR030288.sra

Which will dump the uncompressed fastq file into the current folder.
Let's to the SRA and find the ChIP input data for stage E4-8.

Since this is a time and space consuming process and I'm sure that you could do this on your own, get the bwa mapped bam files from here. While you are there, get the MACS peak calling files here.

MACS and other peak finding software use the properties of the experiment to separate the real peak from the random input. One important feature is that there should be two piles of fragments on opposite strands around a real binding site. Use IGV to look at the distribution of reads around eve, a gene that should be expressed or getting ready to be expressed depending on how well they staged the collections.

Let's look at the peaks around all genes! Get the dmel gene gff using wget or curl -O

$ wget
$ curl -O


1. Classify each gene into transcribing, paused, inactive catagories by finding peaks of polII near the start of the gene and taking the ratio of promoter to gene polII reads.

2. Make a heatmap (make two dimensional array) for the 1kb up and downstream of each gene for each ChIP data set. Then sort all the heatmaps by the highest polII peak 1kb upsteam of the gene start.

3. Go to RedFly and get a list of annotated cis-regulatory modules. Test if either histone modification is enriched in CRMs. A common way to do this is to compare the amount of signal in a region of interest to a similar distribution of regions randomly selected from the genome.

4. Similarly, see if polII signal is associated with either histone modification.