System Calls and Scripting External Programs

By Matt Davis, Updated 7/25/11 by Mike Lawson


  • Manipulating the file system and UNIX environment from Python
  • Executing other external user space programs with os and sys modules
  • Executing external processes with the subprocess module
  • Running BLAST from a Python script


This morning we will learn to use the os and sys modules to interface with the UNIX environment by running external programs to manipulate the filesystem and memory space. Then we will move on to the more sophisticated subprocess modules.

Useful tools from the os module

The UNIX environment that you have access to when you type in the terminal is also available to Python. That means that you don't have to open a separate terminal window just to change directories, create a new file, or see what operating system you program is running on. This also means that you have the capacity to script what would otherwise be exceedingly mundane tasks, such as renaming lots of files or creating various subdirectories for your output from different data sets.

You've already seen one key component of the os module, the function used to open filehandles, open(). Let's import the module and look at some of the others.
import os
# Let's make sure we know which working directory we are in
cwd = os.getcwd()
print cwd, 'is the current working directory.'
# And which files are present in the cwd? Note that the '.' represents the current
# working directory
cwdfiles = os.listdir('.')
# The output of listdir is stored as a list, so let's print it item by item.
for cwdfile in cwdfiles:
    print cwdfile
# And we can look in other directories too. Note that you'll have to change
# the dir below to one that exists on your filesystem.
homefiles = os.listdir('/Users/¬°yournamehere!')
for file in homefiles:
    print file
# We could also first change our cwd to a different directory, and then get the file list.
print os.getcwd(), 'is the now the current working directory'
cwdfiles = os.listdir('.')
for file in cwdfiles:
    print file

We can also use the os module to manipulate the filesystem.

import os
# Let's make some new directories and files
numbers = [1, 2, 3]
letters = ['a', 'b', 'c']
for num in numbers:
for let in letters:
os.mkdir(str(num) + let)
# Let's get all our new directories in a list
diritems = os.listdir('.')
# And let's make some files in them
teachers = ['Mike', 'Peter', 'Aaron']
for diritem in diritems:
# using what we've learned about exceptions and testing
        for teach in teachers:
            for num in numbers:
                    open(teach + '_' + str(num), 'w')
                except IOError:
                    print "caught an error attempting to open the new file", \
                    teach + '_' + str(num)
                    print "Caught a general exception attempting to open the new file", \
                    teach + '_' + str(num)
    except OSError:
        print "caught an error from module OS, possible that", diritem, \
            "isn't a directory after all."
        print "caught a general exception trying to change to directory", diritem

Whew, what a mess. Let's clean up all those unimportant files, leaving only the important ones.
import os
diritems = os.listdir('.')
print diritems
for diritem in diritems:
    #print "diritem is" + str(diritem)
    #print "type is" + str(type(diritem))
        subdiritems = os.listdir('.')
        for subitem in subdiritems:
            if ( subitem.startswith('Mike') ):
    except OSError:
        print "caught an error from module OS, possible that", diritem, "isn't a directory after all."
        print "caught a general exception trying to change to directory", diritem

Okay, enough with files for now. Let's look at how we can execute UNIX commands from python. We'll start with the simplest, dullest, most boringest way: using os.system()

import os
os.system('ls -l')

$ ./
total 16
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1a
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1b
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1c
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2a
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2b
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2c
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3a
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3b
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3c
-rw-r--r-- 1 lawson staff 731 Jul 28 22:52
-rw-r--r-- 1 lawson staff 1126 Jul 28 22:49

Okay, so we can see the output of the UNIX command ls -l from the script, and compare it to the regular terminal:

$ ls -l
total 16
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1a/
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1b/
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 1c/
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2a/
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2b/
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 2c/
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3a/
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3b/
drwxr-xr-x 5 lawson staff 170 Jul 28 22:52 3c/
-rw-r--r-- 1 lawson staff 731 Jul 28 22:52
-rw-r--r-- 1 lawson staff 1126 Jul 28 22:49

A subtle difference of trailing slashes is the only difference in my case. But this is not the same as using os.listdir('.'). The two differences are that os.listdir('.') does not have access to the detailed information that regular command line ls -l does, and that os.listdir('.') returns a list of the directory items, not merely a string of the command line output. But this was a truly trivial example, so let's move on to, well, another truly trivial example.

import os
#remember our script from the first of the week?

$ ./
hello "world", if that is your real name.
That's World, to you

Yep, it's still there. And we can use another program to run it. We could run it hundreds of times this way, if we were truly trivial people. Hrmmm...

Just like opening files, it's a good idea to look for exceptions in the process of executing system calls. This is implemented very simply with os.system(), as the function will return a non-zero value if the call fails to execute.

import os
dirs = ['.', 'asdjfaklsdfjlksadjf']
for d in dirs:
    if ( os.system('ls '+d) != 0 ):
        print "Whoa, that didn't work out, which ls was happy to tell us about."
        print "Well, okay, great. Thanks ls."

$ ./
1a 1b 1c 2a 2b 2c 3a 3b 3c Matt_2 S5.2 test
Well, okay, great. Thanks ls.
ls: asdjfaklsdfjlksadjf: No such file or directory
Whoa, that didn't work out, which ls was happy to tell us about.

But this really is the simple way. Often when we are using python to run other programs, the programs we are executing may finish at different times. When that's the case, we may want to control the order in which they are executed, or merely the number of subprocesses running at once. These days, even laptops have multiple CPUs in them, but we still might not want to dump hundreds of subprocesses on our computer at once. The subprocess module allows us to be a bit more intelligent about how we execute system calls by using a method called subprocess.Popen(). This is somewhat different, however, in that the subprocess we want to execute must be created as an object, which we can then monitor with the methods of this object.

# let's chop the module name down to sp for the purposes of typing fewer characters
import subprocess as sp
# let's create a subprocess without assigning it to a variable, and see what happens
sp.Popen( ['ls', '-l'] )

We're going to use a little UNIX tool called time, which will tell you how long it took to run your script.

$ time python
total 24
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3c
-rw-r--r--@ 1 matt matt 3586 Jun 3 20:07 S5.2
-rwxr-xr-x 1 matt matt 219 Jun 3 23:01
-rw-r--r-- 1 matt matt 529 Jun 3 22:57 test

real 0m0.149s
user 0m0.114s
sys 0m0.028s

And we see that script outputs the same as os.system() did, but with the added "time" info on the last line.

import subprocess as sp
for each in range(0,100):
a = sp.Popen( ['ls', '-l'] )

When we run this code, we see that it takes about about half a second of real-life clock time to run, and about the same for the "system" clock. The number in the middle, the "user" time, is about half, because my laptop has two CPU cores, over which the system time has been roughly divided.

$ time python

real 0m0.510s
user 0m0.269s
sys 0m0.490s

for each in range(0,100):
a = sp.Popen( ['ls', '-l'] )

The wait() method of the subprocess object tells the system to wait until this subprocess has completed before doing anything else in the program. When we look at the time output for this code, we see that it takes much more time on the clock (almost three quarters of a second), but the same amount of user and system time. This is because while the program waits for the process, the second CPU core cannot be crunching away on another process. While this may seem inefficient (it certainly is), it is also courteous to the other processes on your computer. In other words, if you're using your laptop for anything other than running your program, you may want to not grind everything to a halt for the duration of your program's runtime (which for me is often hours and hours). And if you're using a server to run your program on, your coworkers may appreciate it if you do not make use of every CPU on the server while they're trying to get their work done. In truth, the operating system will make its own attempt at balancing the needs of processes, but especially in the case where you're trying to use your laptop or PC as a workstation, this can save you some frustration.

$ time python

real 0m0.735s
user 0m0.272s
sys 0m0.459s

And just to show that it really is the CPU sharing that's driving the effect:

import subprocess as sp
# cut the number of subprocesses in half, then run two sets of them
for each in range(0,50):
a = sp.Popen( ['ls', '-l'] )
b = sp.Popen( ['ls', '-l'] )

And we see that our runtime is almost exactly what it was in the first case.

$ time python

real 0m0.506s
user 0m0.270s
sys 0m0.491s

Okay, now that we know how to run the subprocess, and we know that we have to create a subprocess object if we want to do things like wait(), let's look at other advantages of using this type of object.

import subprocess as sp
# let's say we want to store our output to a file instead of seeing it on the screen
fh = open('outfile', 'a')
proc = sp.Popen(['ls', '-l'], stdout=fh)

And if we run the script and then look at the contents of the file:

$ python
$ cat outfile
total 16
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3c
-rw-r--r--@ 1 matt matt 3586 Jun 3 20:07 S5.2
-rw-r--r-- 1 matt matt 0 Jun 3 23:45 outfile
-rwxr-xr-x 1 matt matt 250 Jun 3 23:45

There's all our output, which can be much more convenient than having it scroll down the screen.

However, what can be even *more* convenient is having it stored in a data structure in memory.

import subprocess as sp
proc = sp.Popen(['ls', '-l'], stdout=sp.PIPE)
# The sp.PIPE is a mechanism that allows the output to be redirected to
# wherever you point it. In this case, to our subprocess object named proc.
# The next step is to get the data out of the object and into a data structure
# that we can organize however we see fit.
# we use the method below to get a tuple of two things: the stdout and the stderr
proc_output = proc.communicate()
print "------------------------"
print "We got a tuple\nWith the first part being:\n", proc_output[0], \
"\nand the second being:\n", proc_output[1]
print "------------------------"
# Once we use the .communicate() method, the stdout and stderr are gone from the proc object.
print "------------------------"
print "and if we don't format anything, we see the tuple like this:\n", proc_output
print "------------------------"
# But we could easily convert this into a friendlier structure
stdoutlist = proc_output[0].split('\n')
print "------------------------"
print "And now we see the first half of the tuple as a list of lines" \
+ "of output from our subprocess:\n", stdoutlist
print "------------------------"

I've intentionally kept this ls -l example set very simple, in terms of what command we are actually executing, because it's important to see how this object type works. However, typically this tool would be used to run things that are going to generate very large sets of data. For you, that might be a systematic search of a database like the PDB, or large-scale BLAST searches, multiple sequence alignments, searches for transcription factor binding sites in genomes, or a host of other common but computationally expensive bioinformatic problems. Or, for the foreseeable future, it may just be the most reliable way to organize small sets of output data into a data structure or log file.


Good ol'e Basic Local Alignment Search Tool, now that's bioinformatics even Grandma can get down with. In its most basic form, BLAST takes a short sequence of either amino acids or nucleotides, searches a database of reference sequences, and returns the most likely matches for the query sequence. It's brilliant, and the only trouble is that we're often interested in BLASTing lotsa query sequences, and much less interested in the mundanity of parsing the output that comes back to us. Fortunately, we can easily script the execution of BLAST such that we can run queries many times with different sequences and settings, and also parse the output back into dictionaries to manipulate and even store longterm on the disk.

The first part of this section will show us how to install a new program in our UNIX environment, and the second part will let us practice using Popen() to script the execution of BLAST.

Installing BLAST

I suspect most everyone here is accustomed to using the NCBI BLAST web interface, but is just another program, whether it's running on an NCBI web server, or on our laptops. Before we get back to python, let's take a minute to install BLAST the old fashioned way, such that we have ready access to it from anywhere on our computers.

In the PythonCourse/programFiles directory, you should find a tar.gz file containing the BLAST executable for your platform.

The .tar.gz file extension tells us that the file has been archived (the .tar part) and compressed (the .gz part). The UNIX command to uncompress and expand the file is:

$ tar zxvf [filename]

For example:

$ tar zxvf ncbi-blast-2.2.25+-universal-macosx.tar.gz

We now have a new directory with all those uncompressed files in it, so let's change to it:

$ cd ncbi-blast-2.2.25+
$ ls
ChangeLog LICENSE README bin/ doc/ ncbi_package_info

We've got a ChangeLog file that will tell us what version of BLAST we've got, a doc directory with documentation, a data directory with the various matrices to use for esimating substituion rates, and then the bin directory that contains the actual binary executable for the program. If we change to the bin directory and take a look, we'll see several files:

$ cd bin
$ ls
blast_formatter* blastdbcmd* blastx** psiblast* segmasker**blastdb_aliastool* blastn* convert2blastmask* makeblastdb* rpsblast* tblastn* windowmasker*blastdbcheck* blastp* dustmasker* makembindex* rpstblastn* tblastx*

These are the executable programs that BLAST uses to do its dirty work. You might recognize some of these names from NCBI web site options (e.g. blastn vs. tblastx, etc). We want to setup a local directory to install these program files into, which we'll do in a new subdirectory of our home directory.

First, let's change back to our home directory:

$ cd
$ pwd

Now, we can simply move the un-tarred BLAST directory into our new directory:

$ mv PythonCourse/programFiles/ncbi-blast-2.2.25+/ .
$ ls ncbi-blast-2.2.25+
ChangeLog LICENSE README bin doc ncbi_package_info

There we go, now the files are present directory beneath our home directory.

Now, in order to be able to use these program from anywhere in our directory structure, we need to add them to our executable PATH. We did something very similar to this during our installation of Aquamacs and Chimera, so this may look familiar. Copy these lines one at a time to the terminal (make sure to exclude the $, it's just there to symbolize your prompt).

$ echo 'PATH=$PATH:~/ncbi-blast-2.2.25+/bin' >> ~/.bash_profile
$ echo 'export PATH' >> ~/.bash_profile

Two more steps until we're ready to roll. First, check your bash profile:
$ less .bash_profile
Your bash profile should look something like this:
alias aquamacs='/Applications/'
alias chimera='/Applications/'
alias blast_formatter='/ncbi-blast-2.2.25+/bin/blast_formatter'
export EDITOR=aquamacs
export PATH
export PATH

Finally, close the terminal window you were just working in, open a new one, and enter the following:
$ source ~/.bash_profile

Now, if all went well, we should be able to execute one of out BLAST programs and get some feedback:

$ blastn
BLAST query/options error: Either a BLAST database or subject sequence(s) must be specified

If you see this message, then you've properly installed BLAST. Good job.

Okay, now that we have these program available, let's start using them.

Since BLAST is going to compare a query sequence to a database of sequences, collected from say, a genome, or a genre of genomes, we first need to create the reference database to BLAST against. Many of these are available for download, but we're going to create our own from the S. cerevisiae genome. Go ahead and navigate to the FTP server for BLAST at, which you can connect to through your favorite web browser, login as a guest, and copy yeast.aa.gz and yeast.nt.gz into your working directory from the FASTA folder.
Since these files came compressed with gzip, we're going to uncompress them into our working directory, which we'll return to (i.e. our S6.1 directory).

$ cd ~/PythonCourse/S6.1

And now to uncompress the files into our working directory.
$ gunzip ~/PythonCourse/data/yeastFASTA/yeast.nt.gz
$ mv ~/PythonCourse/data/yeastFASTA/yeast.nt .
$ gunzip ~/PythonCourse/data/yeastFASTA/yeast.aa.gz
$ mv ~/PythonCourse/data/yeastFASTA/yeast.aa .

And one more step before we can use our new BLAST installation: we have to create database of sequences we just moved to used by BLAST using the makeblastdb command. The -dbtype argument specifies the type of sequence we're formatting (the .nt extension stands for nucleotide), and the 'nucl' tells the program to format the database a nucleic acid sequence. The -in argument specifies the name of the input file, in this case yeast.nt.

$ makeblastdb -dbtype='nucl' -in yeast.nt

Now we have a fully operational BLAST installation and a yeast nucleotide database the search against. We can see that BLAST has created some additional files for our yeast database in our working directory:

$ ls
1a 1c 2b 3a 3c outfile yeast.nt yeast.nt.nin
1b 2a 2c 3b yeast.aa yeast.nt.nhr yeast.nt.nsq

Now let's make a file to query the database with by opening a new text file and pasting in these lines:

>test seq

And then save the file as query.fasta

And.... here we go a scriptin'.

import subprocess as SP
# some arguments for running BLAST
program = 'blastn'
queryseq = 'query.fasta'
database = 'yeast.nt'
# When we create our subprocess below, we're going to supply a list as the first argument.
# The list will contain the name of our program, in this case, blastn
# We'll also use the list to supply the argument flags and their
# arguments for the program.
# This is equivalent to running the following command on the command line:
# blastn -query query.fasta -db yeast.nt
proc = SP.Popen([program, '-query', queryseq, '-db', database], stdout=SP.PIPE)
# Since we used the PIPE to store the blastn output, we can retrieve it with .communicate()
output = proc.communicate()
# note that this is returned to us a tuple, even though we only requested
# to communicate with the stdout when we called Popen().
# To see the output in a reasonable format, try:
outlist = output[0].split('\n')
for line in outlist: print line

You can imagine how you would set a list of query sequences, or a series of databases to query against, or even vary some of the BLAST settings (for example, the minimum E-value to report), and run this again. You also might care which species your hit came from (if you were searching against a larger database), or which chromosome hits were on (to determine synteny, e.g.)

In this case, we've used the PIPE to capture the output, which might be helpful when we want to systematically sort through the results in each BLAST hit, only storing the results in certain circumstances. However, we can also write the results to an output file instead of the PIPE. Let's try that:

import subprocess as SP
# some arguments for running BLAST
program = 'blastn'
queryseq = 'query.fasta'
database = 'yeast.nt'
# open a filehandle for our output file, we'll use the append flag in case
# we want to collect multiple queries worth of output
outfh = open('blastn_output', 'a')
print "Error opening output file: blastn_output"
proc = SP.Popen([program, '-query', queryseq, '-db', database], stdout=outfh)
Now you can cat the blastn_output file and see the results.


1) Modify your Popen() call to set a minimum E-value of 1e-08. Parse the output of your BLAST hits into a dictionary keyed by chromosome, containing the length of the hit, the E-value, and the percent identity of the hit.

HINT: check out the BLAST options by issuing the command blastall --help. There are output options that will greatly facilitate your parsing efforts.

2) Write a script using the genetic code (provided here in parse-able text format) to translate your nucleotide sequence into proteins. Then use blastp instead of blastn to search against the amino acid sequence file we downloaded earlier. Remember to make the blast database first! (protein flag = "prot")



import subprocess as SP
# some arguments for running BLAST
program = 'blastn'
queryseq = 'query.fasta'
database = 'yeast.nt'
# Note the addition of arguments to the blastn call 
proc = SP.Popen([program, '-query', queryseq, '-db', database, 
                 '-evalue', '1e-08', '-outfmt', '6'], 
output = proc.communicate()
outlist = output[0].split('\n')
chrs = {}  # set up our dictionary of chromosomes
for line in outlist:
    line = line.strip().split()
    if line:  # only process non-blank lines
        chr = line[1]
        pident = float(line[2])
        matchlen = int(line[3])
        evalue = float(line[10])
        if chr in chrs.keys():
            chrs[chr].append((pident, matchlen, evalue))
            chrs[chr] = [(pident, matchlen, evalue)]
print chrs

#/usr/bin/env python
import subprocess as SP
from fastaParser import fastaParser
def parse_nuctable(filename):
    fh = open(filename)
    codon_table = {}
    for line in fh:
        line = line.strip()
        if line:  # skips blank lines
            data = line.split()
            aa = data[0]   # something like 'Ala/A'
            oneletter = aa.split('/')[1]
            codons = data[1:]
            for codon in codons:
                codon = codon.strip().replace(',', '')
                # turns 'GCU, ' into 'GCU'
                codon_table[codon] = oneletter
    return codon_table
def seq_to_codons(sequence):
    codons = []
    i = 0
    while i < len(sequence):
        codons.append(sequence[i: i+3])
        i += 3
    return codons
def translate_sequence(nt_sequence):
    aa_sequence = ""
    codon_table = parse_nuctable('nuctable.txt')
    nt_sequence = nt_sequence.replace('T', 'U')
    # nuctable uses the RNA sequence, not the DNA
    for codon in seq_to_codons(nt_sequence):
        aa_sequence += codon_table[codon]
    return aa_sequence
def translate_file(filename):
    """ Translates the given file from a fasta with DNA genes to the
    corresponding aa sequence, outputs it to a file, then outputs the name of
    the file you output to."""
    gene_dict = fastaParser(filename)
    # note that we don't preserve order of genes in the original file
    outname = filename + '.aa.fasta'
    outfile = open(outname, 'w')
    for gene in gene_dict:
        outfile.write('>%s\n' % gene)
        outfile.write('%s\n' % translate_sequence(gene_dict[gene]))
        # doesn't break up the lines nicely... oh well
    return outname
program = 'blastp'
queryseq = 'query.fasta'
database = 'yeast.aa'
queryseq = translate_file(queryseq)
# This was copy/pasted directly from the lecture.  Remember to change your
# directory to point to blastall
# Your path to blastall may be different than mine.  I have the old blast installed.
# use as appropriate
proc2 = SP.Popen([program, '-query', queryseq, '-db', database],stdout=SP.PIPE)
output = proc2.communicate()
print output[0]