Manipulating Files and Processing Text


  • Basic text processing with split, join, and partition
  • Text testing with endswith(), startswith(), find()
  • Text conversion with swapcase(), replace(), upper(), and lower()
  • Opening and closing filehandles
  • Reading from the filehandle with read(), readline(), and readlines()
  • Reading from the filehandle iterable
  • Writing or appending to a file with write() and writelines()
  • Writing to a file with a loop


We've learned so far how we can write programs to make many, many decisions with an ordered logic to process information. What we've lacked thus far is how to input and output large tomes of data. In addition to manipulating large amounts of data with functions that open, read, write, and close files, we'll also benefit from learning about Python's marvelously powerful abilities to process text. Not to malign the now-dead king of text-processing languages, Perl (The King is Dead! Long Live the King!), Python really cleans house with it's unparalleled text-processing abilities with respect to both speed and ease of use.

Basic Text Processing

Systematically manipulating large text files is one of the most common tasks you will encounter. The most basic tools for this task are the built-in Python string methods. These allow us to convert between strings and lists, test the properties of strings, and modify strings.

Informative Interlude: getting ahead of ourselves with methods vs. functions

Later today, we're going to learn all about writing our own functions to process information. These will be sets of logic that consider variables and manipulate them according to the logic that we assign. In a sense, the functions are formally encapsulated manifestations of the sorts of things we've been writing with our scripts all week.

But, as we're going to see with strings, many types of objects have special built-in functions. We call these endemic functions methods, and in a broader discussion of objected-oriented programming practice and theory, we would have much, much more to say about them. However, we're not getting into the object-oriented universe or philosophy here, so you'll have to take as explanation simply that some objects are so routinely manipulated with the same sorts of operations that it pays to have functions dedicated to their processing. In the case of strings and files today, we'll see the methods that routinely operate on these types.

Whereas a function is written to accept variables and arguments to manipulate those variables with, a method already exists for the object under manipulation and is called differently. Whereas a function such as print is called by typing print(string_variable), etc, a method is called by typing a period and the name of the method the end of the object. For example, if print were a method, it would be called like this: string_variable.print(). Notice that there are still () at the end of the name of the method, and methods can accept arguments just like functions. If all this seems eerily familiar, it may be because we've already seen the list methods append() and extend() earlier in the week. All apologies if this seems out of order and confusing, but we'll see how these concepts interoperate in more detail as the week progresses. This is why these paragraphs are in an I.I. after all...


Let's consider the task of converting a character string of a sentence into a list of words separated by spaces and punctuation marks:

delimiter = ","
string_to_split = "I am a well-written sentence, and so I \
 dependably have punctuation. "
list_from_string = string_to_split.split(delimiter)
print "clause one %s" % list_from_string[0]
print "clause two %s" % list_from_string[1]

Note that as we've split with a comma, the comma doesn't appear in our list. We can try out what happens with different arguments to split().

# we don't need to specify the delimiter in a different variable
list_from_string = string_to_split.split(' ')
for word in list_from_string:
     print word
list_from_string = string_to_split.split('a')
for vowel_handicapped_lump in list_from_string:
     print vowel_handicapped_lump

You might also want to take a string and turn it letter-by-letter into a list. Although this isn't done by split(), it fits nicely here:

list_from_string = list(string_to_split)
for letter in list_from_string:
     print letter

split() also can take a second argument (see, as always, the string methods documentation ): you can specify how many times you want to split.

list_from_string = string_to_split.split(' ', 3)
for item in list_from_string:
     print item

Now let's see what happens when two delimiters are next to each other:

list_from_string = string_to_split.split('t')
for consonant_crippled_lump in list_from_string:
     print consonant_crippled_lump

We can see that we have a blank space in our list: "written," in particular, was split into three parts: ["...wri","","en..."]. If delimiters are adjacent to each other, it will find that empty string between them and give it to you at the appropriate spot. It's a very one-hand-clapping-in-a-forest sort of thing.

However, there is an exception to this. If you glanced at the split() documentation, you might have noticed that all of its arguments are, in fact, in brackets. That means that it doesn't need arguments to run: it has a default behavior.

# this should look the same as splitting by spaces
list_from_string = string_to_split.split()
for item in list_from_string:
     print item
# this is not the same as splitting by spaces -- no empty items!
string_to_split = "   this      is    a   different                         string"
list_from_string = string_to_split.split()
for item in list_from_string:
     print item
string_to_split = '''   complete
\t\t whitespace                      chaos
             !!!!!!!!!!!         '''
list_from_string = string_to_split.split()
for item in list_from_string:
     print item

We see that the default behavior of split() is to:
  1. Remove all kinds of whitespace from the beginning and end of the string.
  2. Condense all adjacent whitespaces to single space characters.
  3. Split on those spaces.

This turns out to be really handy. For instance, if you're using someone else's table, and, as happens more often than you might want to think, they've done a poor job delimiting their fields systematically with whitespace, this cleans things up quickly and easily in just one line.

You'll learn to extend this power of whitespace to other characters, sets of characters, and all sorts of exotic delimiters.

The split() method being popular, it has a few hangers-on:

toes = '''went to the market
stayed home
had roast beef
had none
cried wee wee wee all the way home'''
# splitlines splits on linebreaks
list_from_string = toes.splitlines()
for toe in list_from_string:
     print "this little piggy %s" % toe
# from the end of the string
last_toe = "and _this_ little piggy went wee wee wee all the way home"
# when given a second argument, reverse split counts
list_from_string = last_toe.rsplit(' ',7)
for item in list_from_string:
     print item

Though the partition() method isn't named after split(), it's a very similar method. partition() works a lot like split(delimiter,1), taking a delimiter and splitting at the first instance. However, while split(delimiter,1) will return either a list of length two (if it split successfully) or a list of length one (if it didn't), partition() will always return a list of length three. Let's look at the output.

rhyme = '''There was a crooked man
Who walked a crooked mile.
He found a crooked sixpence
Against a crooked stile.
He bought a crooked cat
Which caught a crooked mouse,
And they all lived together
In a crooked little house.'''
# you can split on words as well as single letters and symbols
split_list = rhyme.split('crooked',1)
print "List output:"
for item in split_list:
     print item
partition_list = rhyme.partition('crooked')
print "Partition output:"
for item in partition_list:
     print item

What if the delimiter doesn't occur within the string?

split_list = rhyme.split('happiness',1)
# I mean, this is like the nursery-rhyme
# equivalent of hangin' under the BART tracks in
# west Oakland.
print "List output:"
for item in split_list:
     print item
partition_list = rhyme.partition('happiness')
print "Partition output:"
for item in partition_list:
     print item

This can be useful if you are looking for that second item, but you're not sure if it's going to be there. The string could be user generated or read in from a file, and you want to gracefully do one thing if it's there and another if it's not. split() can be less than graceful about this:

if rhyme.split('happiness')[1]:
# if it's there you're all good
# if it isn't your program will crash
# vs
if rhyme.partition('happiness')[2]:
# parse the wanted information out of it
# wait until the next line


So now we're pretty good at splitting things up, but how do we put things together again? join() takes care of that: it turns lists into strings. Surprisingly enough, it's not a method of lists. It's a string method, and it relies on the delimiter to know how to put lists together. This little surprise renders the syntax of join() to be among the most unintuitive of all syntactic trifles, but we will persevere if we concentrate on the fact that just like split(), join() is a method of strings.

broken = ['hu','m','pty',' du','mpty']
all_the_kings_horses = 'n~n*^'
all_the_kings_men = '>+O'
first_try = all_the_kings_horses.join(broken)
second_try = all_the_kings_men.join(broken)
if (first_try == 'humpty dumpty') or (second_try =='humpty dumpty'):
     print 'hooray!'
     print '''All the king's horses and all the king's men
  couldn't put Humpty together again'''

Like split, join can usefully use the empty string-- it glues the components of the list directly together.

third_try = ''.join(broken)
print third_try
# 'nothing' can put poor Humpty together again

This is in fact the usual way to use join() -- you don't need to declare a separate variable to act as the glue.

fairy_tale_characters = ['witch','rapunzel','prince']
plot = 'hair'.join(fairy_tale_characters)
print plot

Testing Text: startswith(), endswith(), and find()

We just saw how you can use an if statement to test for the presence of a delimiter with partition(). There are other tests you will often be interested in, for example asking if a string begins with, ends with, or contains a substring of interest.

#!/usr/bin/env python
id_number = '1131431a'
# let's see if the id_number string starts with the number one
if (id_number[0] == '1'):
    print "this id starts with a 1!"
# now let's use the string method startswith(
if ( id_number.startswith('1') ):
    print "this id starts with a 1!"
# and here's the endswith() method
if ( id_number.endswith('1') ):
    print "This id number ends with a 1!"
    print "This id number doesn't end with a 1 at all!"
# and these methods can get a little fancier by having multiple things to
# test for if you provide a tuple of characters
if ( id_number.endswith( ('1', 'a') ) ):
    print "this id number ended with either an 'a' or a '1' "

Or maybe we don't care what the string starts or ends with as long as it contains a substring of interest. For this, we can use the find() method, which will return the index of the substring. But be careful when you write if tests using the find() method, as it returns the index of the substring only if the substring is found. Otherwise, find() returns the integer -1, which is not a zero, and thus will pass the if test as True.

beatles = "johnpaulgeorgeandringo"
# the wrong way
if ( beatles.find('paul')):
    print "At least we've got a bassist."
    print "Anyone here play bass?"
# let's do a comparison for -1 instead
if not (beatles.find('paul') == -1):
    print "At least we've got a bassist"
    print "Well, I guess we're a three piece."

Text Conversions

Systematically replacing the instances of a substring with a replacement substring may be a familiar task of tedium. Python has several methods for systematically converting characters in strings. The most general is the method replace().

beatles = 'johnpaulgeorgeandringo'
beatles = beatles.replace('george', 'MATT')
print beatles
# YES! I'm in!
beatles = beatles + "MOREMATT!"
print beatles.replace("MATT", "ADAM!")
print beatles
# and we can tell replace how many replacements to make, starting at the beginning
print beatles.replace("MATT", "ADAM!", 1)
print beatles
# but notice that replace() does not change the string in place; you have
# to reassign the variable to "save" the change

Since Python is case sensitive, as are most UNIX-based bioinformatics programs you'll be interested in using, you may also find yourself wishing that all the text in your data was the same case. There are methods for both testing and converting cases.

# why not use something a touch relevant for a change
if ( blast_hit.isupper() ):
    blast_hit = blast_hit.upper()
# or if you prefer lower case
blast_hit = blast_hit.lower()
# or if you are (or the program you're writing is) indecisive
blast_hit = blast_hit.swapcase()
# and we might also be interested in these methods
if ( blast_hit.isalpha() ):
    print "we got all letters here"
    print "whoa, something doesn't look like nucleotides!"

Files and Filehandles

Now that we can process text, all we need is... more text. And odds are, that text is going to come in the form of a file, so it's high time that we start using them.

Opening filehandles

A filehandle is an object that controls the stream of information between your program and a file stored somewhere on the computer. Filehandles are not filenames, and they are not the files themselves. They are a tool that your program uses to interact with files, nothing more (for instance, deleting a filehandle in your script using the del command does nothing to the file that handle refers to).

We create filehandles in the simplest sense with the open() command:

fh = open('some_file')

where some_file is the path to a file (i.e. the filename) on your filesystem. In general, it is good practice to use absolute path nomenclature (e.g. /Users/aaron/some_file or /home/aaron/some_file), but you can be lazy if you know the file you want is going to be in the same directory as your program.

$ touch
#!/usr/bin/env python
fh = open('')
contents =
print contents

$ ./
#!/usr/bin/env python

fh = open('')
contents =
print contents

As you can see, the read() method of the filehandle just sucks in the whole file in a single string, newlines and all! This is quick and easy, for sure, but it's not necessarily the most orderly way to deal with the contents of a file.

readline(), readlines(), and strip()

Copy the contents of the following snippet to a text file in your directory for this session, and save the file as pdb_head.


Then try the following:

#!/usr/bin/env python
filename = 'pdb_head'
fh = open(filename, 'r')
# the 'r' is for 'read-only', which will keep us from being able to alter
# this file with the filehandle we just created
print fh.readline()
print fh.readline()
lines = fh.readlines()
print lines

$ ./



While this is a bit of a mess, a few things should become apparent:
  1. fh.readline() takes in one line (and since print() also supplies a newline, we've got an extra linebreak after each of the first two print statements.
  2. fh.readlines() (plural!) takes the entire file, from the current read position all the way to the end, giving back a list of lines (again, with newlines intact).
  3. This file has a bunch of whitespace cluttering things up at the end of each line.

All of these complications are easily resolved with the use of the strip() method whenever we actually make use of the lines we read:

#!/usr/bin/env python
filename = 'pdb_head'
fh = open(filename, 'r')
print fh.readline().strip()
print fh.readline().strip()
lines = fh.readlines()
lines[0] = lines[0].strip()
print lines

$ ./

Now the spaces and newlines are gone from the first two, and from the 0th element of the list I printed in the last print statement (since I only bothered to strip() and put back the 0th element).

One crucially important concept of file input in Python is that each time you read something by any of the three methods I've described, you advance the position of the filehandle in the file, which means that you never get the same character or characters twice (unless of course they're in the file twice!)

This is why reading from the filehandle with fh.readline() twice in a row gave two different values; as soon as the line is read, the filehandle has moved to the next line, awaiting another read request. This is an example of an iterable type, meaning that the filehandle is a type of object that knows how to advance itself in anticipation of the next request. That means that to get back to the beginning of the file, you must either close the file with the close() and reopen it, or use the seek() method of the filehandle (which we don't have time to go into -- google is your friend!)

While potentially a bit odd now, this behavior will be essential when we discuss reading file contents with loops.... oh, speaking of...

Reading files in a loop

Certainly one of the most common contexts in which you'll encounter for loops is in working your way through a file. You can just put together two things we've already seen to get to where we need to be:

#!/usr/bin/env python
fh = open('pdb_head')
lines = fh.readlines()
for line in lines:
    fields = []
    print '0th field: %s, 1st field: %s' % (fields[0],fields[1])

$ ./
0th field: HEADER, 1st field: OXI
0th field: TITLE, 1st field: SULF
0th field: COMPND, 1st field: MOL
0th field: COMPND, 1st field: 2 M
0th field: COMPND, 1st field: 3 C

This is starting to get a little fancier, but we're only doing things you've seen before: read all the lines in a file into a list, then iterate over the list, looking for a couple of different parts of the line, stripping off leading and trailing whitespace, then printing the first and second elements of the resulting list.

We can simplify this one more step using the fact that filehandles are iterable, and know what's being asked of them. So we can replace this:

lines = fh.readlines()
for line in lines:


for line in fh:

to exactly the same end.

Writing to Files

Writing output is sorta like doing the dishes. You just did all this work to cook up a fancy program and analyze some data, and the last thing you want to do is put all your answers away into clean little output files. Fortunately, we'll learn about pickle files later, but for now, we'd best make sure you know how to write output to a file.

The default behavior of the filehandle is to open the file supplied in read mode. However, by giving an additional argument, you can either add lines to the bottom of the specified file, or overwrite it entirely:

#!/usr/bin/env python
filename = 'test_out'
fh = open(filename, 'w')
# 'w' flag means "writeable"
fh.write('Historically, this lesson was used as a medium to hurtle insults between')
fh.write(' Matt and our former labmate Brant.\n')
# note that we have to add the '\n' if we want it at the end of the line;
# this is in contrast to the print command's behavior.
filename = 'test_out2'
fh = open(filename, 'a')
# 'a' flag means "append"
fh.write("Unfortunately, I have no beef with Peter, so this section is a bit mundane.\n")

While this script doesn't print anything to the screen, if you run it a few times and look at the contents of test_out vs test_out2, the distinction between the 'w' and 'a' arguments to open() should become clear.

When reading files, the close() method is a good thing to keep in mind, but if you forget it, python will close the file at the end of the program's execution. With writing files, however, python may not make the changes you stipulate right away, so if you plan to evaluate the contents of the file you're writing in the same script (or for instance use that file for something else during the run of that script) it is wise to close the filehandle to ensure that all the write operations you've requested are performed.

While python has no writeline() method, the other two read methods are mirrored for writing to files. The first, write() you've already seen. It takes a string, and puts it in a file. The only difference between this and writelines() is that writelines() takes a list of strings, and writes them all (But beware! If you want those strings to appear on separate lines, they had best all end with a \n!)

#!/usr/bin/env python
filename = 'test_out'
fh = open(filename, 'w')  # 'w' flag means "writeable"
lines = ["Adam is a friendly dude.\n", "You'd better be one too.\n"]
lines.extend(["Or next year, he might use this space\n", "to write a phish song about you.\n"])

And check out the contents of test_out to see your many-line-writing machine in action!


1. Pile of basic split drills:

  • Turn 'Humpty Dumpty sat on a wall' into ['Humpty','Dumpty','sat','on','a', 'wall']
  • Turn 'Humpty Dumpty had a great fall' into ['Humpty Dumpty had a ', ' fall']
  • Turn "All the King's horses" into ["All the King's hor",'e',''] (note: there is still an "s" at the end of "King's")
  • Turn "and all the King's men" into ['and a',''," the King's men"] (note: there is a space at the beginning of " the King's men")
  • Turn "couldn't put Humpty together again" into 'again' (using one line)

2. Pile of basic split, join, and replacement drills:

  • Turn ' Terry RichPrice Matt\n' into Chris\tAdamRoberts\tPeter'
  • Turn 'Matt,Nate,Aaron' into 'MATT\tNATE\tAARON\t'

3. Using the names of all seven instructors and TA's (Nate, Matt, Aaron, Peter, Adam, Aisha, Chris), write each possible pair of names to a file, separated by a line of hyphens (i.e. '-----------------')

4. Reopen the last output file, and read in the file, then write the lines back out (to a new file) in reverse order, in all capital letters.

5. Parse a FASTA file

Copy the text below into a text file and save it as seq.FASTA

Write a script called that will open this file, read the lines, and store the data as a dictionary keyed by gene with values of the sequence. Make sure the sequences are contiguous (i.e. contain no endline characters), and make sure to remove the > from the names of the genes.


1. Pile of basic split drills:

#!/usr/bin/env python

2. Pile of basic split, join, and replacement drills:

#!/usr/bin/env python

3. Using the names of all seven instructors and TA's (Nate, Matt, Aaron, Peter, Adam, Aisha, Chris), write each possible pair of names to a file, separated by a line of hyphens (i.e. '-----------------')

#!/usr/bin/env python
#!/usr/bin/env python
teachers = "Andrew Nate Phil Peter Matt Terry"
splitTeach = teachers.split()
fh = open('TeachersPairwise', 'a') # 'a' flag means "append"
for teach1Ind, teach1 in enumerate(splitTeach):
    for teach2Ind, teach2 in enumerate(splitTeach):
        if not teach1Ind == teach2Ind:
            text = teach2 + '-----------------' + teach1 + '\n'

4. Reopen the last output file, and read in the file, then write the lines back out (to a new file) in reverse order, in all capital letters.

#!/usr/bin/env python
#!/usr/bin/env python
fh = open('TeachersPairwise', 'r') # 'r' flag means "read only"
fileTextAsList = fh.readlines()
fh_output = open('TeachersPairwiseReversed', 'w')
for line in fileTextAsList:
    lineUPPER = line.upper()

5. Parse a FASTA file

#!/usr/bin/env python
current_gene = ""   # Start with an empty string, just in case
genes = {}          # Make an empty dictionary of genes
fh = open('seq.FASTA', 'r')
for line in fh.readlines():
    line = line.strip()  # Clear out leading/trailing whitespace
    if len(line) > 0 and line[0] == ">":   # This one is a new gene
        current_gene = line[1:]
        genes[current_gene] = ""
    else:                # Add onto the current gene
        genes[current_gene] += line
print genes

6. Change the headers in a fasta file.

Note that the fasta headers for the yeast genome look something like this: '>ref|NC_001133| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=I]\n'. While this techniqually conforms to convention, the first part of the header is not very readable and conflicts with several other programs we will be using later. Your goal is to read the yeast genome file and change each header line to look like this: '>chrI [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [ref=NC_001133]\n'

infile = open('yeastgenome.fa', 'r')
outfile = open('yeastgenome_clean.fa')
for line in infile:
    if line.startswith('>'):
        chrom = line.partition('chromosome=')[2]
        # '>ref|NC_001133| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=I]\n'
        # becomes
        # 'I]\n'
        chrom = chrom.split(']')[0]
        # 'I]\n'
        # becomes
        # 'I'
        ref = line.split()[0]
        # ref is now '>ref|NC_001133|'
        ref = ref.split('|')[1]
        # ref is now 'NC_001133'
        outfile.write(' [ref=%s]\n' % ref)