A Closer Look at Python: Texts as Lists of Words

You’ve seen some important elements of the Python programming language. Let’s take a few moments to review them systematically.

Lists

What is a text? At one level, it is a sequence of symbols on a page such as this one. At another level, it is a sequence of chapters, made up of a sequence of sections, where each section is a sequence of paragraphs, and so on. However, for our purposes, we will think of a text as nothing more than a sequence of words and punctuation. Here’s how we represent text in Python, in this case the opening sentence of Moby Dick:

>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>>

After the prompt we’ve given a name we made up, sent1, followed by the equals sign, and then some quoted words, separated with commas, and surrounded with brackets. This bracketed material is known as a list in Python: it is how we store a text. We can inspect it by typing the name 1. We can ask for its length 2. We can even apply our own lexical_diversity() function to it 3.

>>> sent1 1
['Call', 'me', 'Ishmael', '.']
>>> len(sent1) 2
4
>>> lexical_diversity(sent1) 3
1.0
>>>

Some more lists have been defined for you, one for the opening sentence of each of our texts, sent2sent9. We inspect two of them here; you can see the rest for yourself using the Python interpreter (if you get an error saying that sent2 is not defined, you need to first type from nltk.book import *).

>>> sent2
['The', 'family', 'of', 'Dashwood', 'had', 'long',
'been', 'settled', 'in', 'Sussex', '.']
>>> sent3
['In', 'the', 'beginning', 'God', 'created', 'the',
'heaven', 'and', 'the', 'earth', '.']
>>>

Note

Your Turn: Make up a few sentences of your own, by typing a name, equals sign, and a list of words, like this: ex1 = ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']. Repeat some of the other Python operations we saw earlier in Computing with Language: Texts and Words, e.g., sorted(ex1), len(set(ex1)), ex1.count('the').

A pleasant surprise is that we can use Python’s addition operator on lists. Adding two lists 1 creates a new list with everything from the first list, followed by everything from the second list:

>>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'] 1
['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']

Note

This special use of the addition operation is called concatenation; it combines the lists together into a single list. We can concatenate sentences to build up a text.

We don’t have to literally type the lists either; we can use short names that refer to pre-defined lists.

>>> sent4 + sent1
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the',
'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']
>>>

What if we want to add a single item to a list? This is known as appending. When we append() to a list, the list itself is updated as a result of the operation.

>>> sent1.append("Some")
>>> sent1
['Call', 'me', 'Ishmael', '.', 'Some']
>>>

Indexing Lists

As we have seen, a text in Python is a list of words, represented using a combination of brackets and quotes. Just as with an ordinary page of text, we can count up the total number of words in text1 with len(text1), and count the occurrences in a text of a particular word—say, heaven—using text1.count('heaven').

With some patience, we can pick out the 1st, 173rd, or even 14,278th word in a printed text. Analogously, we can identify the elements of a Python list by their order of occurrence in the list. The number that represents this position is the item’s index. We instruct Python to show us the item that occurs at an index such as 173 in a text by writing the name of the text followed by the index inside square brackets:

>>> text4[173]
'awaken'
>>>

We can do the converse; given a word, find the index of when it first occurs:

>>> text4.index('awaken')
173
>>>

Indexes are a common way to access the words of a text, or, more generally, the elements of any list. Python permits us to access sublists as well, extracting manageable pieces of language from large texts, a technique known as slicing.

>>> text5[16715:16735]
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good',
'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without',
'buying', 'it']
>>> text6[1600:1625]
['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We',
'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive',
'officer', 'for', 'the', 'week']
>>>

Indexes have some subtleties, and we’ll explore these with the help of an artificial sentence:

>>> sent = ['word1', 'word2', 'word3', 'word4', 'word5',
...         'word6', 'word7', 'word8', 'word9', 'word10']
>>> sent[0]
'word1'
>>> sent[9]
'word10'
>>>

Notice that our indexes start from zero: sent element zero, written sent[0], is the first word, 'word1', whereas sent element 9 is 'word10'. The reason is simple: the moment Python accesses the content of a list from the computer’s memory, it is already at the first element; we have to tell it how many elements forward to go. Thus, zero steps forward leaves it at the first element.

Note

This practice of counting from zero is initially confusing, but typical of modern programming languages. You’ll quickly get the hang of it if you’ve mastered the system of counting centuries where 19XY is a year in the 20th century, or if you live in a country where the floors of a building are numbered from 1, and so walking up n-1 flights of stairs takes you to level n.

Now, if we accidentally use an index that is too large, we get an error:

>>> sent[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: list index out of range
>>>

This time it is not a syntax error, because the program fragment is syntactically correct. Instead, it is a runtime error, and it produces a Traceback message that shows the context of the error, followed by the name of the error, IndexError, and a brief explanation.

Let’s take a closer look at slicing, using our artificial sentence again. Here we verify that the slice 5:8 includes sent elements at indexes 5, 6, and 7:

>>> sent[5:8]
['word6', 'word7', 'word8']
>>> sent[5]
'word6'
>>> sent[6]
'word7'
>>> sent[7]
'word8'
>>>

By convention, m:n means elements mn-1. As the next example shows, we can omit the first number if the slice begins at the start of the list 1, and we can omit the second number if the slice goes to the end 2:

>>> sent[:3] 1
['word1', 'word2', 'word3']
>>> text2[141525:] 2
['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne',
',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',',
'that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of',
'each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between',
'themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.',
'THE', 'END']
>>>

We can modify an element of a list by assigning to one of its index values. In the next example, we put sent[0] on the left of the equals sign 1. We can also replace an entire slice with new material 2. A consequence of this last change is that the list only has four elements, and accessing a later value generates an error 3.

>>> sent[0] = 'First' 1
>>> sent[9] = 'Last'
>>> len(sent)
10
>>> sent[1:9] = ['Second', 'Third'] 2
>>> sent
['First', 'Second', 'Third', 'Last']
>>> sent[9] 3
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: list index out of range
>>>

Note

Your Turn: Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same methods used earlier. Check your understanding by trying the exercises on lists at the end of this chapter.

Variables

From the start of Computing with Language: Texts and Words, you have had access to texts called text1, text2, and so on. It saved a lot of typing to be able to refer to a 250,000-word book with a short name like this! In general, we can make up names for anything we care to calculate. We did this ourselves in the previous sections, e.g., defining a variable sent1, as follows:

>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>>

Such lines have the form: variable = expression. Python will evaluate the expression, and save its result to the variable. This process is called assignment. It does not generate any output; you have to type the variable on a line of its own to inspect its contents. The equals sign is slightly misleading, since information is moving from the right side to the left. It might help to think of it as a left-arrow. The name of the variable can be anything you like, e.g., my_sent, sentence, xyzzy. It must start with a letter, and can include numbers and underscores. Here are some examples of variables and assignments:

>>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',
... 'forth', 'from', 'Camelot', '.']
>>> noun_phrase = my_sent[1:4]
>>> noun_phrase
['bold', 'Sir', 'Robin']
>>> wOrDs = sorted(noun_phrase)
>>> wOrDs
['Robin', 'Sir', 'bold']
>>>

Remember that capitalized words appear before lowercase words in sorted lists.

Note

Notice in the previous example that we split the definition of my_sent over two lines. Python expressions can be split across multiple lines, so long as this happens within any kind of brackets. Python uses the ... prompt to indicate that more input is expected. It doesn’t matter how much indentation is used in these continuation lines, but some indentation usually makes them easier to read.

It is good to choose meaningful variable names to remind you—and to help anyone else who reads your Python code—what your code is meant to do. Python does not try to make sense of the names; it blindly follows your instructions, and does not object if you do something confusing, such as one = 'two' or two = 3. The only restriction is that a variable name cannot be any of Python’s reserved words, such as def, if, not, and import. If you use a reserved word, Python will produce a syntax error:

>>> not = 'Camelot'
File "<stdin>", line 1
    not = 'Camelot'
        ^
SyntaxError: invalid syntax
>>>

We will often use variables to hold intermediate steps of a computation, especially when this makes the code easier to follow. Thus len(set(text1)) could also be written:

>>> vocab = set(text1)
>>> vocab_size = len(vocab)
>>> vocab_size
19317
>>>

Caution!

Take care with your choice of names (or identifiers) for Python variables. First, you should start the name with a letter, optionally followed by digits (0 to 9) or letters. Thus, abc23 is fine, but 23abc will cause a syntax error. Names are case-sensitive, which means that myVar and myvar are distinct variables. Variable names cannot contain whitespace, but you can separate words using an underscore, e.g., my_var. Be careful not to insert a hyphen instead of an underscore: my-var is wrong, since Python interprets the - as a minus sign.

Strings

Some of the methods we used to access the elements of a list also work with individual words, or strings. For example, we can assign a string to a variable 1, index a string 2, and slice a string 3.

>>> name = 'Monty' 1
>>> name[0] 2
'M'
>>> name[:4] 3
'Mont'
>>>

We can also perform multiplication and addition with strings:

>>> name * 2
'MontyMonty'
>>> name + '!'
'Monty!'
>>>

We can join the words of a list to make a single string, or split a string into a list, as follows:

>>> ' '.join(['Monty', 'Python'])
'Monty Python'
>>> 'Monty Python'.split()
['Monty', 'Python']
>>>

We will come back to the topic of strings in Chapter 3. For the time being, we have two important building blocks—lists and strings—and are ready to get back to some language analysis.

Get Natural Language Processing with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.