We use text all the time in science and computing to store information like:
In Python we store this kind of data in strings.
Strings can have one of two major types in Python:
We'll work with str here, but everything is basically the same using unicode.
Strings are created using either single or double quotes. It doesn't typically matter which kind of quotes you use, but they do need to match.
genus = 'Dipodomys'
species = "spectabilis"
print(genus)
print(species)
Dipodomys spectabilis
If we want to create a string that has multiple lines we can do this using triple quotes.
ds_description = """Dipodomys spectabilis is the
scientific name for the
Banner-tailed Kangaroo Rat."""
print ds_description
Dipodomys spectabilis is the scientific name for the Banner-tailed Kangaroo Rat.
Python uses a single function to determine the length of most things including strings, the len()
function.
latin_binomial = "Dipodomys ordii"
len(latin_binomial)
15
We can combine strings using the + operator.
genus + species + 'weighs about 125 grams.'
'Dipodomysspectabilisweighs about 125 grams.'
If we want spaces between words we need to add them explicitly.
genus + ' ' + species + ' weighs about 125 grams.'
'Dipodomys spectabilis weighs about 125 grams.'
A better way to achieve this type of output in Python is using formatted strings. Everywhere we want to place a variable or a value in the string we place a % followed by a letter that tells it how we want the information formatted (like a string, an integer, a float, etc.) then after the string we add a % and then a comma separated list of the values/variables to insert in parentheses.
output = "%s %s weighs about %d grams." % (genus, species, 125)
print output
Dipodomys spectabilis weighs about 125 grams.
Sometimes in programming we need to change the way a character works, or add a special character to a string. To do this we use escape characters. For example, what if we want to include an apostrophy in a string? If we just add it then things go wrong:
print('The individual's mass is 122 grams.')
File "<ipython-input-7-cd1ab404344a>", line 1 print('The individual's mass is 122 grams.') ^ SyntaxError: invalid syntax
This happens because when Python encounters the apostrophy it thinks we're telling it to end the string and it doesn't understand what all of the stuff coming after the string is.
To tell Python that we actually want an apostrophy we use an escape character, the \ in this case, so instead of typing ' we type '
print('The individual\'s mass is 122 grams.')
The individual's mass is 122 grams.
Other escape characters include:
Doubling up the escape character to get the character itself is the standard approach to handling that character.
In fact, if we look at our multi-line string from above, we'll see that it is actually just a regular string, with some new lines inserted using \n.
ds_description
'Dipodomys spectabilis is the\nscientific name for the\nBanner-tailed Kangaroo Rat.'
Because Python allows both single quotes and double quotes, there is also an easy way to avoid escaping characters in some cases. For example,
print("The individuals's mass is 122 grams.")
print('The original paper states that "The mass of Dipodomys spectabilis is approximately 125 grams."')
The individuals's mass is 122 grams. The original paper states that "The mass of Dipodomys spectabilis is approximately 125 grams."
There is a string module that has a lot of useful functions for working with strings. Functions in this module can change capitalization (upper
, lower
, capitalize
), remove excess whitespace (strip
), find the location of substrings (find
), split a string into pieces (split
), and count the number of occurrences of particular characters (count
).
import string
genus = 'Dipodomys'
species = ' spectabilis'
latin_binomial = 'Dipodomys ordii'
dna_seq = 'atgcagatcctgtgtgtctagctaag'
print("The lower case version of genus is: %s" % string.lower(genus))
print("The upper case version of species is: %s" % string.upper(species))
print("The value of species without the leading whitespace is: %s" % string.strip(species))
print("The location of the start of the first 'tcct' in dna_seq is: %s" % string.find(dna_seq, 'tcct'))
print("The number of a's in dna_seq is: %s" % string.count(dna_seq, 'a'))
genus, species = string.split(latin_binomial)
print("The genus in latin_binomial is: %s" % genus)
print("The species in latin_binomial is: %s" % species)
The lower case version of genus is: dipodomys The upper case version of species is: SPECTABILIS The value of species without the leading whitespace is: spectabilis The location of the start of the first 'tcct' in dna_seq is: 7 The number of a's in dna_seq is: 6 The genus in latin_binomial is: Dipodomys The species in latin_binomial is: ordii
Many kinds of objects in Python carry their own functions with them. These kinds of functions are called methods
.
Instead of doing something to any object, methods do something to the object they are attached to.
To call a method for a particular object we use the object name, followed by a period (called a dot), followed by the name of the method. For example, if we want to make all of the letters in a string capitals we can use the .upper()
method.
genus = "Dipodomys"
upper_cased_genus = genus.upper()
print upper_cased_genus
DIPODOMYS
All of the functions that are available in the strings module are also available as methods.
genus = 'Dipodomys'
species = ' spectabilis'
latin_binomial = 'Dipodomys ordii'
dna_seq = 'atgcagatcctgtgtgtctagctaag'
print("The lower case version of genus is: %s" % genus.lower())
print("The upper case version of species is: %s" % species.upper())
print("The value of species without the leading whitespace is: %s" % species.strip())
print("The location of the start of the first 'tcct' in dna_seq is: %s" % dna_seq.find('tcct'))
print("The number of a's in dna_seq is: %s" % dna_seq.count('a'))
genus, species = latin_binomial.split()
print("The genus in latin_binomial is: %s" % genus)
print("The species in latin_binomial is: %s" % species)
The lower case version of genus is: dipodomys The upper case version of species is: SPECTABILIS The value of species without the leading whitespace is: spectabilis The location of the start of the first 'tcct' in dna_seq is: 7 The number of a's in dna_seq is: 6 The genus in latin_binomial is: Dipodomys The species in latin_binomial is: ordii