4.18. Counting Instances of Each Word in a Text File

Problem

You want to count the number of occurrences of each word in a text file.

Solution

Use operator>>, defined in <string>, to read contiguous chunks of text from the input file, and use a map, defined in <map>, to store each word and its frequency in the file. Example 4-27 demonstrates how to do this.

Example 4-27. Counting word frequencies

1   #include <iostream>
2   #include <fstream>
3   #include <map>
4   #include <string>
5
6   typedef std::map<std::string, int> StrIntMap;
7
8   void countWords(std::istream& in, StrIntMap& words) {
9
10     std::string s;
11
12     while (in >> s) {
13        ++words[s];
14     }
15  }       
16
17  int main(int argc, char** argv) {
18
19     if (argc < 2)
20        return(EXIT_FAILURE);
21
22     std::ifstream in(argv[1]);
23
24     if (!in)
25        exit(EXIT_FAILURE);
26
27     StrIntMap w;
28     countWords(in, w);
29
30     for (StrIntMap::iterator p = w.begin();
31          p != w.end(); ++p) {
32        std::cout << p->first << " occurred "
33                  << p->second << " times.\n";
34     }
35  }

Discussion

Example 4-27 looks simple enough, but there is more going on than it appears. Most of the subtleties have to do with maps, so let’s talk about them first.

If you’re not familiar with maps, you should be. A map is a container class template that is part of the STL. It stores key-value pairs in order according to std::less, or your custom comparison function, should you supply one. The kinds of keys and values you can store in it depend only on your imagination; in this example, we are just going to store strings and ints.

I used a typedef on line 6 to make the code cleaner:

typedef map<string, int> StrIntMap;

Thus, a StrIntMap is a map that stores string/int pairs. Each string is a unique word—which is why I’m using it as the key—that has been read in and its associated int is the number of times it occurs. All that’s left is to read in each of the words one-at-a-time, add it to the map if it’s not already there, and increment its associated count value if it is.

This is what countWords does. The essential logic is brief:

while (in >> s) {
   ++words[s];
}

operator>> reads in contiguous chunks of non-whitespace from its lefthand side operand (an istream) and places them in the righthand side operand (a string). Once I’ve read a “word,” all I have to do is update the statistics in the map, and that is done with the following line:

++words[s];

map defines operator[] for retrieving a value given a key (it actually returns a reference to the value itself), so to increment it, just increment the value indexed at the particular key. But something about this might seem a little weird. What if the key isn’t already in the map? Don’t we try to increment a nonexistent index, and crash like we would with an array? No, map does operator[] differently than other STL containers or ordinary, C-style arrays.

In a map, operator[] does two things: if the key does not already exist, it creates a value by using that value’s type’s default constructor and adds that key/value pair to the map, if the key already exists in the map, no modification is made. In both cases, a reference to the value for the specified key is returned, even if that value was just created with its default constructor. This is a handy feature (if you know it’s there), because it eliminates the need for client code to check for a key’s existence before inserting it.

Now, look at lines 32 and 33. The iterator refers to members called first and second--what are those? maps play a trick on you by using another class template to store your name value pairs: the pair class template defined in <utility> (included by <map> already). If you are iterating through the items stored in a map, you will be pointing to pair objects. Working with pairs is simple, the first item in a pair is stored in the first member, and the second is stored in, well, second.

I used operator>> in Example 4-27 to read in contiguous chunks of text from the input stream, which is different than some of the other examples. I did this to demonstrate that it can be done, but you will almost certainly need to customize the behavior based on your definition of a “word” in a text file. For example, consider an excerpt of the output produced by Example 4-27:

with occurred 5 times.
work occurred 3 times.
workers occurred 3 times.
workers. occurred 1 times.
years occurred 2 times.
years. occurred 1 times.

Notice that the periods on the end of words are included as part of each word. Most likely, you will want to change the definition of words to mean only alpha or alphanumeric characters, as I did in Recipe Recipe 4.17 by using some of the character-testing functions in <cctype> and <cwctype>.

C++ Cookbook by D. Ryan Stephens, Christopher Diggins, Jonathan Turkanis, Jeff Cogswell

4.18. Counting Instances of Each Word in a Text File

Problem

Solution

Discussion

See Also

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly