Fancy Data Structures


Topics:

  • Tuples
  • Sets
  • Lists of lists
  • Dictionaries of dictionaries
  • Lists of dictionaries, vice-versa
  • Even more complicated nested structures

Introduction


Let's say that you have a dictionary. In fact, let's say that you are a collector of English dictionaries, and you have all of them, from Cawdrey's 1604 A Table Alphabeticall all the way to a hard copy of the ever-contemporary Wiktionary project of Wikipedia. Finally, let us also suppose that you have some interest in the evolution of words over time. After all, why did you go out and buy all those dictionaries in the first place?

Given the number of dictionaries and the number of words, it seems only natural that you would need some knowledge of programming to tackle this project. However, it is here that we run into a problem: how should we best store this immense and complex amount of data? We have many dictionaries, each with many words; could we use what we've learned of lists and strings to store it? We certainly could have a list of strings, but it's not clear how we would associate them with their definitions.... until now!

dictionaries.py
# define dictionaries
wiktionary = {}
 
 
aTableAlphabeticall = {}
 
# insert words, starting with, naturally, 'ventricle'
wiktionary['ventricle'] = '''1. (anatomy) One of two lower chambers of the heart.\
2. (anatomy) One of four cavities in the brain.'''
 
 
aTableAlphabeticall['ventricle'] = 'the stomacke which receiues the meate'

The trouble is, you've got thousands of dictionaries, and you're trying to make sense of them in relation to each other. This way, you'd have thousands upon thousands of lines of code just devoted to defining dictionaries, and afterwards, you've still got to refer to each of them individually in order to do any kind of analysis. This doesn't sound right. Wouldn't it be great to put all of those dictionaries into some kind of big list, perhaps chronologically arranged, so that you could tackle the data entry and analysis systematically?

That's right, it would be great.

Similarly, we very often run into non-trivial data in biology. For instance, you might want a list of binding site positions for each of ten different transcription factors, or we might want to describe the single nucleotide polymorphisms in each of a thousand people. We often want to store several kinds of information about each position in a protein sequence. In order to accomplish these tasks and many others, we need fancier data structures: lists of lists, dictionaries of dictionaries, lists of dictionaries, even lists of dictionaries of dictionaries of lists of dictionaries of lists. Although we try to stay away from the latter.

So, this afternoon will be devoted to teaching you how to create such data structures.

A Bit More on Tuples


As we saw this morning, a tuple is like a frozen list: once you've made a tuple, it is immutable, and you can't change it (although you can make a new one with similar elements. Since the keys of a dictionary need to be immutable, this is useful if you want to key a dictionary based on two things at once. The syntax for a tuple is much like a list, except instead of [square brackets], it's conventional to use (parentheses).

mytuple = (1,2)
# Actually, the parentheses are optional (but helpful for clarity)
mytuple = 1,2
 
mylist = [1,2]
 
mydict = {}
mydict[mytuple] = 'hello'
print mydict
mydict[mylist] = 'world'
print mydict
 

Also, if your tuple only has one item, you need to use a comma to make it clear that the tuple is a tuple and not just a value in parentheses:

tuple_A = ("Is this a tuple?")
tuple_B = ("What about this?", )
 
print tuple_A, type(tuple_A)
print tuple_B, type(tuple_B)

Tuples actually underlie the syntax for the in-place swap:
a = 1
b = 2
a,b = b,a
print a, b
 
#Is equivalent to saying:
a = 1
b = 2
mytuple = (b,a)
b,a = mytuple
 
We won't be using tuples extremely heavily, but be aware that if you ever have a function with multiple simultaneous return values, under the hood, it's actually combining them into a tuple.

Informative Interlude: Mutability, lists, and references


Let's take a moment to discuss in more detail a topic referenced in the morning, mutability. When we say that something is mutable, we mean that we can change the value without destroying the original data structure. Maybe it's more helpful to first discuss things that are immutable, such as strings and integers. In the case of an integer variable, if we want to change the value, it doesn't matter if we destroy the original value, as there's only one datum to change. That is, if x used to equal 5, and now we want it to equal 10, who cares what it used to equal? Similarly with a string that used to be assigned the value "nate", but is now assigned "peter".

On the other hand, lists and dictionaries can get pretty big, and it would be really inefficient in most cases to recreate the whole thing if you only want to change part of it. As it turns out, Python does not delete the original list and recreate another list with the update value, it merely changes the updated value in place. We'll soon see the true power of this difference when we use various functions to manipulate complicated data structures like those described in the rest of this lecture. Instead of having to make resource-consuming copies of our data, we can change the mutable variables directly. As we've already seen, dictionaries are also mutable, but as described below, tuples and strings are not.

Let's look some short code examples to compare the mutability of strings and lists.

#!/usr/bin/python
# mutables.py
 
# Let's make a list
a = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
a[0] = 'X'
print a
 
b = a
print b
 
b[1] = 'Y'
print b, "is the value of b"
print a, "is the value of a"

$ mutables.py
['X', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
['X', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
['X', 'Y', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] is the value of b
['X', 'Y', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] is the value of a


At this point, we're going to dive a little bit into how Python works internally. Imagine that you have a giant warehouse full of stuff, and in the front office of that warehouse is a catalog telling the names of all the stuff, and where each thing is. When you make a new variable, Python does two things: it puts that new thing into the warehouse, and it updates the catalog with a new entry.

When you say "b = a", Python is a bit lazy, and since the thing that corresponds to a already exists, it just puts a new entry into the catalog with the name b and the location of the object. In a very important sense, the thing in the catalog named b is the thing in the catalog named a (and there exists, in fact, a keyword is that will confirm this, although we won't use it at all in this course). It's also possible to have two things in the catalog that look and act the same, but are different.

When a and b are mutable things (like lists, or dictionaries), then changing "a" also changes "b" because they are the same thing. If you were working with an immutable object, like a string, then the only way to "change" a would be to make a new object and then update the catalog to point to that new object.

a = 'abcdefg'
# Makes a string object in the warehouse and an entry called "a" in the catalog that tells you where to find that object
 
b = a   # Makes a duplicate catalog entry
 
a = 'gfedcba'
# Makes a new string object, and updates the entry labelled "a" in the catalog
 
print a
print b
# The entry "b" in the catalog still tells you where to find the original string

As a general rule, whenever you have what Python calls a "container" (like a list, or a dictionary), it doesn't actually contain the object itself, it just contains another mini-catalog that tells you where to find the relevant object.

For reading on this subject, feel free to check out the Python documentation on the Data model and the Execution model. It can be a little hard to parse, though, so if you don't get it, don't worry too much. I've read it at least a half dozen times now and I'm not sure I get all of it. There is a reason this is in an Informative Interlude.


Sets


A set is another nifty data structure. Like a set in mathematics, it has a bunch of elements with no repeats. To build a set, you pass in a list, and it will automatically remove duplicates

myset = set([1,1,2,3,5,8])
# You can also use the new syntax:
myset = {1,1,2,3,5,8}
 
print myset
 
# or you can build it up as you go along:
myset = set()
myset.add(1)
myset.add(1)
myset.add(2)
myset.add(3)
myset.add(5)
myset.add(8)

Sets make doing Venn diagrams of your data really easy:
UCs = set(["UC Berkeley", "UCLA", "UCSF", "UCSB", "UC Riverside", "UC Davis", "UCSD", "UC Santa Cruz", "UC Irvine", "UC Merced"])
bay_area_schools = set(["UC Berkeley", "Stanford", "UCSF", "San Jose State", "Santa Clara University", "University of San Francisco" ])
 
bay_area_UCs = UCs.intersection(bay_area_schools)
print "Bay Area UCs", bay_area_UCs
non_bay_area_UCs = UCs.difference(bay_area_schools)
print "UCs in less awesome places", non_bay_area_UCs
bay_area_non_UCs = bay_area_schools.difference(UCs)
print "Bay area schools that aren't UCs", bay_area_non_UCs
 
california_schools = UCs.union(bay_area_schools)
 

There are also ways to remove things from a set. Let's say that due to rampant grade inflation and falling academic standards, that school south of here loses it's accreditation:
bay_area_schools.remove("Stanford")
# or
bay_area_schools.discard("Stanford")
The difference between remove and discard is that discard doesn't care whether or not the item is already in the set or not, whereas if the item isn't in the set, remove will give an error.

There are other **set** functions, but this covers most of what I use them for, and should put you in good stead as well.

Lists of Lists


Let's create a few lists. Being self-centered, let's make the list of our lives:

#!/usr/bin/env python
 
# make a couple lists
coffee = ['yalis','strada','brewedawakening']
lab = ['stanley','barker','LSA','Koshland','VLSB']

As it turns out, you can pretty much put anything into a list that you like, including other lists. Accessing and changing individual elements has a fairly straightforward syntax:

# make a list of lists!
importantPlaces = [coffee,lab]
 
# another way to write it
importantPlaces = [['yalis','strada','brewedawakening'],
                   ['stanley','barker','LSA', 'Koshland','VLSB']]
# Note that you're allowed to split long lines if it's inside of a parenthesis, bracket, or curly brace
 
aList = importantPlaces[0]
print aList
aPlace = aList[0]
print aPlace
# or
aPlace = importantPlaces[0][0]  # aPlace is 'yalis'
importantPlaces[0][0] = 'peets'
 
anotherPlace = importantPlaces[0][0]

All of the list operations from this morning still apply:
# make a list of lists!
home = ['rockridge','the mission','lake merritt','gourmet ghetto']
importantPlaces.append(home)  # important places is now [coffee,lab,home]
importantPlaces[2].append('bushrod park')  # home now includes 'bushrod park'
 
print importantPlaces[2]
print home

Take note of that very last step: although you made a change while naming importantPlaces, you also changed home. That's because they're one and the same: importantPlaces[2] is actually a location in the warehouse, and Python knows when it sees a location to pretend like that thing is inside the list. How would you make importantPlaces carry a copy of home instead of home itself?

# make a list of lists!
importantPlaces.append([])  # important places is now [coffee,lab,[]]
importantPlaces[2].append(home[0])
importantPlaces[2].append(home[1])
...
 
# or, using a shorthand:
importantPlaces.append(home[:])
importantPlaces[2][3] = 'alcatel' # replace 'gourmet ghetto' with name of landmark liquor store
print importantPlaces
print home

This may seem like a subtlety now, and it some ways it is, but keep it in mind when you start coding and start seeing errors. Often a copy is what you want.

Dictionaries of dictionaries


Dictionaries of dictionaries work the same way. Let's return to the example used in the introduction:
#!/usr/bin/env python
 
# we keep our definitions in a dictionary
definitions = {}
 
# and, as above, we have several dictionaries in this dictionary
definitions['wiktionary'] = {}
definitions['aTableAlphabeticall'] = {}
 
# and each dictionary is full of definitions
definitions['aTableAlphabeticall']['statue'] = 'an image of wood, or any other matter'
definitions['aTableAlphabeticall']['vapor'] = 'moisture, ayre, hote breath, or reaking'


Lists of dictionaries, dictionaries of lists


Can we mix data structures? Of course! The main thing to keep in mind (besides what your data structure actually looks like) is that if you're building something up from scratch, you need to match the square brackets and curly braces:
a = [{},{}, ]  # a is a list of empty dictionaries
b = {'l1': [], 'l2': []} # b is a dictionary of empty lists
c = [{]}  # c is a Syntax Error!
d = {[}] # so is D!
 

Finally, everything that we've learned so far can be elaborated to even more complex data structures. After, all, there are often multiple definitions for a word...
# let's make the ventricle entry a little more sensible
listOfDictionaries[1]['ventricle'] = []
listOfDictionaries[1]['ventricle'].append('One of two lower chambers of the heart')
listOfDictionaries[1]['ventricle'].append('One of four cavities in the brain')


Also, you can put different types of things into the same level of a list or dictionary, but I don't recommend it:
list_of_stuff = [[], 1, "Hello", {'name': 'Peter'}]
 
#This is totally legal, but just try writing a for loop that handles each one correctly!
 
for thing in list_of_stuff:
    print thing
    # This is about the only thing I can think of that would work on all of the things in that list...
 



Exercises


1. a) A dictionary of dictionaries. You want to store sequence information from human, mouse, and rat. Each species has a set of gene names corresponding to a set of sequences-- make a dictionary that has keys 'human', 'mouse', and 'rat', with each key having as a value a dictionary with gene names as keys and sequences as values.

The data are:
Human genes:
'TallnessGene' has sequence 'AATAGCAG'
'SmartnessGene' has sequence 'TGACGGA'

Mouse genes:
'FuzzynessGene' has sequence 'CCCCCCA'
'BeadyLittleEyesGene' has sequence 'ATAGCGC'

Rat genes:
'FuzzynessGene' has sequence 'CCCTCCA'
'BiggerThanMouseGene' has sequence 'GGACAATT'

b) Using the 'keys' method, print the names of the human genes.
c) Print out the sequence of FuzzynessGene in rat and mouse.
d) Print the third nucleotide of the human tallness gene.

2. A list of lists. You want to store the results of three different time series experiments, each with four data points. You should do this by creating a list with three elements. Each of these elements is a four element list. There are a few ways to do this, which are outlined in parts (a), (b), and (c) below.

run1: 2,3,5,5
run2: 2,2,4,5
run3: 3,3,4,6

a) Create an empty list. Use the append method three times to make this a list of three empty lists. For each of these empty lists, use the append method four times to populate the list of lists with the data above.

b) Create a list of three empty lists without using either 'append' or '*' (you will run into a complication of using '*' in this context in the last exercise today). For each of these empty lists, use the append method four times to populate the list of lists with the data above-- feel free to copy and paste this part from your solution to part (a).

c) Create your list of lists, complete with data, all in one line.

3. A dictionary of lists. Here, you have some structural data for several different species-- you want to store the B-factor of each position in several orthologous proteins. I picture this as a dictionary of lists: the dictionary is keyed with the species name, which has as a value a list that relates the position to the B-factor.

Human: 5,4,6,7
Mouse: 8,12,11,14
Rat: 10,11,13,15

a) Create an empty dictionary. Create three key-value pairs with the names as keys and empty lists as values. Use the append method to populate this dictionary of lists with data.

b) Create the populated dictionary all in one line, as in problem 2c.

4. Return to the time point data from problem two. Make three lists called 'run1', 'run2', and 'run3', and from them make a list of lists:
...
run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]
listOfRuns = [run1,run2,run3]


a) Alter the first element of run1. Has anything happened to listOfRuns?
b) Specify listOfRuns similarly, but using copies of run1, run2, and run3 instead of the lists themselves. Now, alter the first element of run1 and see what has happened to listOfRuns.

5. Use '*' (multiplication) to create a list of three empty lists. Use the append method to add the element "i belong first!" to the first list of the list of lists. Print out the list of lists. What has happened? This is hard-- don't worry about getting a TA.

Solutions


1) Dictionary of Dictionaries


#!/usr/bin/env python
 
# set out top-level dictionary
species = {}
species['human'] = {}
species['mouse'] = {}
species['rat'] = {}
 
# add human data
species['human']['Tallness'] = 'AATAGCAG'
species['human']['Smartness'] = 'TGACGGA'
 
# add mouse data
species['mouse']['Fuzzyness'] = 'CCCCCCA'
species['mouse']['BeadyLittleEyes'] = 'ATAGCGC'
 
# add rat data
species['rat']['Fuzzyness'] = 'CCCTCCA'
species['rat']['BiggerThanMouse'] = 'GGACAATT'
 
print species['human'].keys()
 
print species['mouse']['Fuzzyness']
print species['rat']['Fuzzyness']
 
print species['human']['tallness'][2]
 


2) List of Lists


#!/usr/bin/env python
 
List = []
 
List.append([])
List.append([])
List.append([])
 
List[0].append(2)
List[0].append(3)
List[0].append(5)
List[0].append(5)
 
List[1].append(2)
List[1].append(2)
List[1].append(4)
List[1].append(5)
 
List[2].append(3)
List[2].append(3)
List[2].append(4)
List[2].append(6)
 
print List
 
 
# part b
 
#!/usr/bin/env python
 
List = [  [], [], [] ]
 
List[0].append(2)
List[0].append(3)
List[0].append(5)
List[0].append(5)
 
List[1].append(2)
List[1].append(2)
List[1].append(4)
List[1].append(5)
 
List[2].append(3)
List[2].append(3)
List[2].append(4)
List[2].append(6)
 
print List
 
# part c
 
#!/usr/bin/env python
 
List = [[2,3,5,5], [2,2,4,5], [3,3,4,6]]
 
print List
 

3) Dictionary of Lists


#!/usr/bin/env python
 
# part A
dict = {}
dict['human'] = []
dict['mouse'] = []
dict['rat'] = []
 
dict['human'].append(5)
dict['human'].append(4)
dict['human'].append(6)
dict['human'].append(7)
 
dict['mouse'].append(8)
dict['mouse'].append(12)
dict['mouse'].append(11)
dict['mouse'].append(14)
 
dict['rat'].append(10)
dict['rat'].append(11)
dict['rat'].append(13)
dict['rat'].append(15)
 
# part B
dict = {'human':[5,4,6,7],'mouse':[8,12,11,14],'rat':[10,11,13,15]}
 

4) Bottom-up List of Lists


#!/usr/bin/env python
 
# part A
run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]
listOfRuns = [run1,run2,run3]
 
run1[0] = 500
print listOfRuns
# the first element of the first element of listOfRuns is now 500 ins
tead of two
 
# part B
run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]
listOfRuns = [run1[:],run2[:],run3[:]]
run1[0] = 500
print listOfRuns
# list of runs remains the same
 


5) Lists and References


#!/usr/bin/env python
 
listOfLists = [[]] * 3
print listOfLists
listOfLists[0].append('i belong first!')
print listOfLists
 
# the multiplication operation has taken a single empty list and
# duplicated it-- it's now three copies of the same empty list that are all
# equivalent to each other.  By changing one, we change all of them.