Basic Data Structures


Topics:

  • Lists
  • Tuples
  • Dictionaries

Introduction:

In order to process many data, we have to keep track of many data. Just like books are stored on shelves in libraries, we have to store our data in an orderly fashion such that it can be access and manipulated systematically. Instead of shelves, we have many structures to choose from. This morning we will introduce the two you will use the most. The list is a simple list of items, dictionaries are more like look-up tables with words and corresponding definitions.

Lists


A list provides a way of storing an ordered series of values in a structure referenced by a single variable.
#!/usr/bin/env python
 
#initialize a couple of lists
li = ['duck', 'season']
li2 = ['wabbit', 'season']
li3 = ['...Fire']
 
#print lists
print li
print li2
print li3
print
 
#access single item in list
print li[0], li2[0], li3[0]
print li[:]
print

$ ./data_structures_lists.py
['duck', 'season']
['wabbit', 'season']
['...Fire']

duck wabbit ...Fire

Here, we have put something in a variable, we ask for it back and we get what we put in. Because we enclosed a series of values separated by commas in a pair of square brackets, we actually have what is called a list.

Lists are special and, as such, have lots of really useful features. You can access individual items in a list or entire sections. You can also manipulate your list using built-in methods (more on what this means later this week). For example, we can add to the list by using the append() method.

Adding to a list

#add single item to list
li.append(li2[0])
print li
li.append(li2[1])
print li
print
 
#combine two lists by extension
li.extend(li2)
print li2
print li
print
 
#combine two list by concatenation
li4 = li + li3
print li
print li3
print li4
print
 
#print out slices of list
print li4
print li4[2:]
print li4[:4]
print li4[1:3]
print li4[-1:]
print li4[-2:]
print li4[:-1]
print li4[1:4:2]
print li4[::-1]
 
print

$ ./data_structures_lists.py
['duck', 'season', 'wabbit']
['duck', 'season', 'wabbit', 'season']

['wabbit', 'season']
['duck', 'season', 'wabbit', 'season', 'wabbit', 'season']

['duck', 'season', 'wabbit', 'season', 'wabbit', 'season']
['...Fire']
['duck', 'season', 'wabbit', 'season', 'wabbit', 'season', '...Fire']

['duck', 'season', 'wabbit', 'season', 'wabbit', 'season', '...Fire']
['wabbit', 'season', 'wabbit', 'season', '...Fire']
['duck', 'season', 'wabbit', 'season']
['season', 'wabbit']
...Fire

This script illustrates several of the most common ways to grow lists:

1) The list method append(), which adds a single item to the end of a list

2) The list method extend(), which adds a whole list to the end of the list you ask to extend itself

3) The list concatenation operator, which stitches two things together to make a new whole, without changing either original list.

4) In class, we also talked about ways to add to a list, using the insert() method, which takes two arguments (in order): an index to insert at, and the object to insert. You can also insert by slicing, something like this: li[2:2] = [var_to_insert]. The first one is somewhat clearer, so might be preferred unless you have very particular reasons for doing the other one.

Shrinking a list

#sick of old quote
#initialize new list
li = ['Why', 'do', 'superheroes', 'wear', 'spandex?']
print li
print
 
#remove a particular item from the list
del li[3]
print li
print
 
#remove the last item from the list and save it
word = li.pop()
print li
print word
print
 
# remove the last item from a list by slicing
li = li[:-1]
$ ./data_structures_lists.py
['Why', 'do', 'superheroes', 'wear', 'spandex?']

['Why', 'do', 'superheroes', 'spandex?']

['Why', 'do', 'superheroes']
spandex?

Here, we are removing things from lists in two ways:

1) The built-in function del removes a particular item from the list.

2) The list method pop() removes the last item from the list and returns the variable.

Changing lists in place


In addition to adding things to lists and taking them away again, we can also change lists in place.
#and now for some numbers
#create list of zeros
noLi = 4*[0]
# This is equivalent to:
#  noLi = [0] + [0] + [0] + [0]
# Once you understand how the addition operator works, the multiplication operator
#  works in an analogous way.
#  3*4 is the same as adding 3 to itself 4 times, so
#  [3] * 4 is equivalent to adding the list [3] to itself 4 times.
print noLi
print
 
#modify items in list
mice_brain = 10
rat_brain = 20
human_brain = 500
noLi[2] = mice_brain
noLi[1] = rat_brain
noLi[3] = human_brain
print noLi
print
 
#sort list
noLi.sort()
print noLi
 
#reverse order
noLi.reverse()
print noLi
print
 
 
# sorting into a new list
another_noLi = sorted(noLi)
print another_noLi
 
 
#sort string list
print li
li.sort()
print li
li.reverse()
print li

$ ./data_structures_lists.py
[0, 0, 0, 0]

[0, 500, 10, 20]
[0, 10, 20, 500]
[500, 20, 10, 0]

['Why', 'do', 'superheroes']
['superheroes', 'do', 'Why']

Now, we've modified the list in a couple of important ways:

1) Overwritten items in the list using slices.

2) Sorted the list using the method sort() and the function sorted(). sort() works on the list in place, while sorted() returns a new list.

3) Reversed the order of the list using the method reverse().

These are, by the way, all demonstrations of the mutability of lists: unlike when you change a string or a number, you don't have to perform an operation and store the result because these operations change the list in place.

Characterizing lists

#figure out how long a list is
print li
print len(li)
print noLi
print len(noLi)
print 'Max =', max(noLi)
print 'Min =', min(noLi)
print
 
#iterate over list
for x in li: print x
print
 
#find where something is stored
idx = li.index('Why')
print idx

$ ./data_structures_lists.py
['superheroes', 'do', 'Why']
3
[500, 20, 10, 0]
4
Max = 500
Min = 0

superheroes
do
Why

2

Here, we have started to characterize our lists.

1) The built-in functions len(), max() and min() tell us how many items are in the list and the maximum and minimum values in the list.

2) The list method index() tells us where an item is in the list.

3) We can iterate over each item in the list and print it using the syntax for x in [ ]:


Tuples


Tuples were used in yesterday's exercise and now that we are familiar with lists, things will make more sense. A tuple is essentially a list that you can not change. You can index, slice them and add them together to make new tuples but not use sort(), reverse(), delete or remove items from them. If you ever have a tuple that you want to change, you have to turn it into a list

trpA = ('protein', 'TIM Barrel')
type(trpA)
 
for i in trpA: print i
 
trpA[0] = 'RNA'
 
trpA = list(trpA)
trpA[0] = 'RNA'


Dictionaries


You can imagine a dictionary as just that -- a dictionary. To retrieve information out of it, you look up a word, called a key, and you find information associated with that word, called the key's value.
#!/usr/bin/env python
 
#initialize a couple of dictionaries
names = {'aaron':'hardin', 'nate':'krefman', 'peter':'combs'}
drinks = {'nate':'coffee', 'matt':'caffeine', 'peter':'soda'}
wines = {'red':'cabernet','white':'pinot grigio', \
'sparkling':'blanc de noirs', 'sticky':'muscato'}
 
#print dictionaries
print names
print drinks
print wines
print
 
#find a particular value
print names['aaron']
print drinks['peter']
print wines['sparkling']
print

$ ./data_structures_dictionaries.py
{'aaron': 'hardin', 'nate': 'krefman', 'peter': 'combs'}
{'matt': 'caffeine', 'peter': 'soda', 'nate': 'coffee'}
{'white': 'pinot grigio', 'sparkling': 'blanc de noirs', 'red': 'cabernet', 'sticky': 'muscato'}

hardin
soda
blanc de noirs

To create a dictionary, you write each key-value pair as key:value, divide the pairs with commas, and surround the entire structure with curly braces.

Just like lists, dictionaries are special and have their own useful features.

Adding to a dictionary


#add single item to dictionary
drinks['chris'] = 'water'
print drinks
print
 
#combine two dictionaries
print names
print drinks
names.update(drinks)
print names
print

$ ./data_structures_dictionaries.py
{'chris': 'water', 'matt': 'caffeine', 'peter': 'soda', 'nate': 'coffee'}

{'aaron': 'hardin', 'peter': 'combs', 'nate': 'krefman'}
{'chris': 'water', 'matt': 'caffeine', 'peter': 'soda', 'nate': 'coffee'}
{'aaron': 'hardin', 'chris': 'water', 'peter': 'soda', 'matt': 'caffeine', 'nate': 'coffee'}


We are adding to the dictionaries in two ways:

1) Adding a single new item to the dictionary by defining a new key:value pair

2) Merging two dictionaries together using the update() method. How is this different from the list method append()?

Shrinking a dictionary


#remove a particular item from a dictionary
del names['matt']
print names
print

$ ./data_structures_dictionaries.py
{'aaron': 'hardin', 'chris': 'water', 'peter': 'soda', 'nate': 'coffee'}

We have now deleted a key:value pair from the dictionary. Why can't we use something like the list method pop()?

Changing dictionaries in place


#modify items in dictionary
names['nate'] = 'krefman'
print names

$ ./data_structures_dictionaries.py
{'aaron': 'hardin', 'chris': 'water', 'peter': 'soda', 'nate': 'krefman'}

We have changed a values in place by modifying the key:value pair. Why won't the list method sort() work?

Characterizing dictionaries


#identify components of the list
keys = wines.keys()
values = wines.values()
print keys
print values
print
 
#iterate over keys
for x in keys:
    print 'The category is', x, 'and the varietal is', wines[x]
print
 
#sort keys and iterate over keys
keys.sort()
for x in keys:
    print 'The category is', x, 'and the varietal is', wines[x]
print
 
#find if something is stored
if 'red' in wines: print wines['red']
print
 

$ ./data_structures_dictionaries.py
['white', 'sparkling', 'red', 'sticky']
['pinot grigio', 'blanc de noirs', 'cabernet', 'muscato']

The category is white and the varietal is pinot grigio
The category is sparkling and the varietal is blanc de noirs
The category is red and the varietal is cabernet
The category is sticky and the varietal is muscato

The category is red and the varietal is cabernet
The category is sparkling and the varietal is blanc de noirs
The category is sticky and the varietal is muscato
The category is white and the varietal is pinot grigio

cabernet

Here, we have started to characterize our dictionaries.

1) The dictionary methods keys() and values() return lists contain keys or values. These lists can be stored and acted on as lists.

2) We can iterate over the keys and print the values using the syntax for x in [ ]:

3) We can quickly find out if a particular key already exists. Note that each key must be unique, but multiple keys can have the same value.

Summary So Far...


Lists are:

1) ordered collections of arbitrary variables.
2) accessible by slicing.
3) can be grown or shrunk in place.
4) mutable (can be changed in place).
5) defined with list = [X,Y]

Dictionaries are:

1) unordered collection of arbitrary variables.
2) accessible by keys.
3) can be grown or shrunk in place.
4) mutable.
5) defined with dict = {X:Y}

You will use dictionaries and lists almost exclusively in your coding. However, there are two remaining data structures that you should know about to make your life a little easier:

Sets are unordered and unique bags of variables

Peter will be talking about these data structures this afternoon. He will also teach you how to use combinations of data structures to create infinitely fancy ways of storing data :O)

Questions?



Exercises


1) Get comfortable with new data structures (adapted from Learning Python)
Create a new file. Type in the commands below. Execute the script. Add comments to your script describing what is happening in each line.

L = [1,2,3] + [4,5,6]
print L, L[:], L[:0], L[-2], L[-2:]
L.reverse()
print L
L.sort()
print L
idx = L.index(4)
print idx

print {'a':1, 'b':2}['b']
D = {'x':1, 'y':2, 'z':3}
D['w'] = 0
print D['x'] + D['w']
print D.keys(), D.values(), D.has_key('z')

2) Getting to know your TAs (adapted from Learning Python)
Introduce yourselves to your TAs and find out their names. Create a new file. Define a new list containing the names of the TAs and the names of your teachers. Add appropriate print statements to the file to answer the following:
a. What happens when you try to index out of bounds (eg. L[10])?
b. What about slicing out of bounds (eg. L[-100:100])?
c. What happens when you try to extract a sequence in reverse--with the lower bound greater than the hight bound (e.g. L[3:1])? Hint: try assigning to this slice (L[3:1] = ['?']) and see where the value is put. Do you think this may be the same phenomenon you saw when slicing out of bounds?
d. Add your own name to the middle of the list of TAs by indexing, then add your neighbor's name to the list of TAs using the insert() method that we discussed in class. See what happens if you use indexing to add the string using the index [3:5] or [5:2]. (HINT: The syntax for the insert() method is "listname.insert(index, objectname)")

3) Getting to know your neighbors (adapted from Learning Python)
Introduce yourself to four of your neighbors and find out what lab they work in. Create a new file. Define a new dictionary containing the names of your neighbors as keys and their labs as values. Add appropriate print statements to the file to answer the following:
a. What happens if you try to index a non-existent key (eg print D['Terry'])?
b. What happens if you try to assign to a non-existent key (eg D['Terry'] = 'Alber')?
c. How does this compare to out-of-bound assignments for lists?

4) Pulling it all together
Your boss has asked you to do a small bioinformatics project on LeuT (pdb code 2Q6H), which is the neurotransmitter responsible for transporting antidepressants. To help you out, I am providing a script that will read in a file and save the protein sequence to a list called protSeq. I have commented the code, but there are many things in here you haven't learned yet.
HINT: Protein Databank (PDB) structure files are stored at http://www.pdb.org/. Use the pdb code to find the structure. The structure file can be downloaded from Download Files >> PDB File (text). The structure file must be in the same directory as the script for the script to run properly.

So, copy the code below into a file, and then you can access the list of all amino acids stored as the list variable protSeq. You can start by printing this variable and proving to yourself that the list contains the information you think it does. Then add to the code to answer the following questions:

a. How many total amino acids are in the protein?
b. Print out total count of each amino acid in alphabetical order.


#!/usr/bin/env python
 
#initialize list to store sequence
protSeq = []
#open pdb file
f1 = open('2Q6H.pdb', 'r')
#loop over lines in file
for next in f1:
    #identify lines that contain sequences
    if next[:6] == 'SEQRES':
        #strip away white space and
        #convert line into list
        line = next.strip().split()
        #delete descriptor information
        #at beginning of each line
        del line[:4]
        #loop over amino acids in line
        for aa in line:
            #add to sequence list
            protSeq.append(aa)
#close file
f1.close()
 


ANSWER:

There are 519 amino acids in the protein
ALA = 54
ARG = 21
ASN = 14
ASP = 12
GLN = 6
GLU = 24
GLY = 45
HIS = 6
ILE = 54
LEU = 61
LYS = 19
MET = 13
PHE = 50
PRO = 25
SER = 18
THR = 27
TRP = 16
TYR = 17
VAL = 37


EXTRA CREDIT: Look up the documentation for set(). We'll be learning more about using set() this afternoon but see if you can rewrite part (b) using the new command based on your readings.
#!/usr/bin/env python
 
 
 

Solutions


1) Get comfortable with new data structures

#!/usr/bin/env python
 
L = [1,2,3] + [4,5,6] # store two lists as one list using concatenation
print L, L[:], L[:0], L[-2], L[-2:] # print the whole list, then the whole list, then the whole list up to the first digit (an empty list), then print the 2nd one in the list counting backwards from the last, then print starting from the 2nd one in the list counting backwards until the last one counting forwards (whew!)
L.reverse() # reverse the order of the list
print L
L.sort() # sort the list (function will sort it from low to high)
print L
print L.index(1) # tell me where the number 4 is in the list
 
print {'a':1, 'b':2}['b'] # print the value of b in the dictionary containing the key 'a' with value 1 and 'b' with value 2; note that this dictionary will not be stored or accessible later, but this could be done by defining a variable and setting it equal to {'a':1, 'b':2}, for example see below!
D = {'x':1, 'y':2, 'z':3} # initialize a dictionary containing 3 keys, each with one value
D['w'] = 0 # add the key 'w' to the dictionary with the value 0
print D['x'] + D['w'] # print the sum of the values for the keys 'x' and 'w' (which are 1 and 0, respectively --> the result should be 1!)
print D.keys(), D.values(), D.has_key('z') # print as a list all of the keys in the dictionary called D, then all the values, then tell me (True or False) whether the dictionary D includes a key called 'z'
 

2) Getting to know your TAs

#!/usr/bin/env python
 
print # Sometimes it's more aesthetically pleasing to start with a print statement to separate the program output from the Terminal prompt
listTAs = ['Terry', 'Matt', 'Phil', 'Andrew', 'Peter', 'Nate']
print listTAs
print
 
# What happens if you index out of bounds?
print listTAs[3] # this would be in bounds
# If you index out of bounds, it stops the script immediately and returns an error ("IndexError: list index out of range")
# print listTAs[10] # this print statement would be out of bounds if left uncommented
print
 
# What happens if you slice a list out of bounds?
print listTAs[3:5]  # This would be in bounds
print listTAs[3:14] # If you slice part way out of bounds, it will return a result for the part that is in bounds, as results with the print statement below
print listTAs[31:34] # If you slice a list completely out of bounds, it will return an empty list
 
# What happens if you slice a list in reverse?
print listTAs[5:3] # If you try to slice a list in reverse, it returns an empty list
print
 
# Add your own name to the middle of the list of TAs by indexing, then add your neighbor's name to the list of TAs using the insert() method that we discussed in class. See what happens if you use indexing to add the string using the index [3:5] or [5:2]. (HINT: The syntax for the insert() method is "listname.insert(index, objectname)")
# This part of the exercise goes back to the example we discussed in class about how to add to a list by indexing versus using the insert() method (refer to "Adding to a list" above).
print listTAs
listTAs[2:2] = ['Yourname'] # Add your name at index 2
print listTAs
listTAs.insert(4, 'Yourneighborsname') # Add your neighbor at index 4
print listTAs
print
listTAs[3:5] = ['what does this do?'] # We lost Phil and Yourneighborsname, because this replaces indices 3 and 4 (i.e. 3 through 5, not including 5) with 'what does this do?'
listTAs[5:2] = ['uh...?'] # This just adds 'uh...?' to index 5 since you can only index forward
print listTAs
 
print # Sometimes it's more aesthetically pleasing to end with a print statement to separate the program output from the next Terminal prompt
 

3) Getting to know your neighbors

#!/usr/bin/env python
 
print
#Initialize a dictionary
thirdyrs = {'Brock' :'Nicole King', 'Robyn':'Andy Martin', 'Margaux':'Lin He', 'Nate':'David Drubin'}
print thirdyrs
print
 
# Now lets print out dictionary
for PhDcandidate in thirdyrs:
    print "%s is indentured to %s." % (PhDcandidate, thirdyrs[PhDcandidate])
print
 
 
# What happens if you try to index a non-existent key?
print thirdyrs['Brock'] # This key exists, so it will return a legit value
# print thirdyrs['Terry'] # 'Terry' is a non-existent key, so if left uncommented, the Python interpreter would return a key error and stop the program
print
 
# What happens if you try to assign to a non-existent key?
thirdyrs['Nate'] = 'Georjana Barnes'
print thirdyrs # If you assign a value to a key that already exists, the value is changed in place
thirdyrs['Terry'] = 'Tom Alber'
print thirdyrs # If you assign a value to a non-existent key, Python writes the key and the value you assign it into the dictionary
print
# How does this compare to out-of-bound assignments for lists?
'''This is different from out-of-bound assignments for lists, because it doesn\'t return an error. Instead, it updates the dictionary.'''
 

4) Pulling it all together


#!/usr/bin/env python
 
# To begin with, you should have a file called '2Q6H.pdb.gz' in your PWD that you downloaded from the PDB. This is a zipped file, which you must unpack to use.  On a Mac, you probably just have to double click the file to unpack it.
 
### The following block of text is all from http://intro-prog-bioinfo-2009.wikispaces.com/Session2.1
# Initialize list to store sequence:
protSeq = [ ] # This could just have easily not had a space in the brackets (like '[]').  It's up to you!
f1 = open('2Q6H.pdb', 'r') # Open the pdb file. This opens the file as read-only and allows it to be parsed by manipulating the variable 'f1'.
for next in f1: # # Loop over each line in the file. Note that the variable 'next' could have been anything ('Johnny', 'x', 'lineX', etc) -- the use of the word 'next' was arbitrary.
    if next[:6] == 'SEQRES': # Identify lines that contain sequences, exploiting the fact that they start with the phrase 'SEQRES'.
        line = next.strip().split() # Strip away white space and convert the line into a list.
        del line[:4] # Delete the descriptor information at the beginning of each line.
        for aa in line: # Loop over the amino acids in the line.
            protSeq.append(aa) # Add each amino acid to the list called protSeq.
f1.close( ) # Close the file. Again the space is optional (i.e. 'f1.close()' works too)
 
print
print '''We\'ve now read and dissected the file so it\'s in a convenient, digestable form.  The sequence follows:'''  # Occasional print statements can help you figure out where in your script something went wrong, if something goes wrong. If you read this statement, you'll know you made it through the script above.
print
print protSeq
print
 
print "The length of the protein is", len(protSeq)
 
aaDict = {}
 
for x in protSeq: aaDict[x] = 0
for x in protSeq: aaDict[x] = aaDict[x] + 1
keys = aaDict.keys()
keys.sort()
for x in keys: print x, '=', aaDict[x]
 

Extra credit with set (slightly more efficient):

#!/usr/bin/env python
 
#!/usr/bin/env python
 
# To begin with, you should have a file called '2Q6H.pdb.gz' in your PWD that you downloaded from the PDB. This is a zipped file, which you must unpack to use.  On a Mac, you probably just have to double click the file to unpack it.
 
# Initialize list to store sequence:
protSeq = [ ] # This could just have easily not had a space in the brackets (like '[]').  It's up to you!
f1 = open('2Q6H.pdb', 'r') # Open the pdb file. This opens the file as read-only and allows it to be parsed by manipulating the variable 'f1'.
for next in f1: # # Loop over each line in the file. Note that the variable 'next' could have been anything ('Johnny', 'x', 'lineX', etc) -- the use of the word 'next' was arbitrary.
    if next[:6] == 'SEQRES': # Identify lines that contain sequences, exploiting the fact that they start with the phrase 'SEQRES'.
        line = next.strip().split() # Strip away white space and convert the line into a list.
        del line[:4] # Delete the descriptor information at the beginning of each line.
        for aa in line: # Loop over the amino acids in the line.
            protSeq.append(aa) # Add each amino acid to the list called protSeq.
f1.close( ) # Close the file. Again the space is optional (i.e. 'f1.close()' works too)
 
print
print '''We\'ve now read and dissected the file so it\'s in a convenient, digestable form.  The sequence follows:'''  # Occasional print statements can help you figure out where in your script something went wrong, if something goes wrong. If you read this statement, you'll know you made it through the script above.
print
print protSeq
print
 
# How many total amino acids are in the protein?
print 'There are', len(protSeq), 'amino acids in the protein.'  # This part is almost too easy!
print
 
# Print out the total count of each amino acid in alphabetical order.
# Use Set( ) to turn the list of amino acids in the protein into a set of amino acids. Since a set cannot contain duplicate entries, this just produces a nonredundant set of amino acids. Note that this set will only be complete if all the amino acids are represented in the protein!
setOfAAs = Set(protSeq)
print setOfAAs # So we have a set containing amino acids represented in the protein
print
print 'The protein contains %s unique amino acids.' % (len(setOfAAs)) # And it turns out one is missing! Can you figure out which one?
print
 
Dict = { } # Initialize a dictionary
for uniqueAminoAcid in setOfAAs: Dict[uniqueAminoAcid] = 0 # Write each amino acid in setOfAAs into this dictionary as a key and set its initial value to 0.
for aminoAcid in protSeq: Dict[aminoAcid] = Dict[aminoAcid]+1 # Go through the protein's amino acid sequence and overwrite the value assigned to that amino acid in the dictionary (remember each amino acid is stored as a key) by adding one to its value. This effective stores a count of each time the amino acid appears in the protein.
 
listOfAAs = Dict.keys() # Make a list using the keys in the dictionary.
listOfAAs.sort() # Sort the list alphabetically.
print "\t", 'aa  #' # Print headers for columns that you will print below. The tab isn't necessary, but the result looks prettier.
for alphabeticalAminoAcid in listOfAAs: print "\t", alphabeticalAminoAcid, Dict[alphabeticalAminoAcid] # Print the amino acids along with their values in the dictionary. I included the tab for consistency.
print