## With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

No credit card required

# 4.18. Counting Instances of Each Word in a Text File

## Problem

You want to count the number of occurrences of each word in a text file.

## Solution

Use `operator>>`, defined in `<string>`, to read contiguous chunks of text from the input file, and use a `map`, defined in `<map>`, to store each word and its frequency in the file. Example 4-27 demonstrates how to do this.

Example 4-27. Counting word frequencies

```1   #include <iostream>
2   #include <fstream>
3   #include <map>
4   #include <string>
5
6   typedef std::map<std::string, int> StrIntMap;
7
8   void countWords(std::istream& in, StrIntMap& words) {
9
10     std::string s;
11
12     while (in >> s) {
13        ++words[s];
14     }
15  }
16
17  int main(int argc, char** argv) {
18
19     if (argc < 2)
20        return(EXIT_FAILURE);
21
22     std::ifstream in(argv[1]);
23
24     if (!in)
25        exit(EXIT_FAILURE);
26
27     StrIntMap w;
28     countWords(in, w);
29
30     for (StrIntMap::iterator p = w.begin();
31          p != w.end(); ++p) {
32        std::cout << p->first << " occurred "
33                  << p->second << " times.\n";
34     }
35  }```

## Discussion

Example 4-27 looks simple enough, but there is more going on than it appears. Most of the subtleties have to do with `map`s, so let's talk about them first.

If you're not familiar with `map`s, you should be. A `map` is a container class template that is part of the STL. It stores key-value pairs in order according to `std::less`, or your custom comparison function, should you supply one. The kinds of keys and values you can store in it depend only on your imagination; in this example, we are just going to store `string`s and `int`s.

I used a `typedef` on line 6 to make the code cleaner:

`typedef map<string, int> StrIntMap;`

Thus, a `StrIntMap` is a `map` that stores `string`/`int` pairs. Each `string` is a unique word—which is why I'm using it as the key—that has been read in and its associated `int` is the number of times it occurs. All that's left is to read in each of the words one-at-a-time, add it to the `map` if it's not already there, and increment its associated count value if it is.

This is what `countWords` does. The essential logic is brief:

```while (in >> s) {
++words[s];
}```

`operator>>` reads in contiguous chunks of non-whitespace from its lefthand side operand (an `istream`) and places them in the righthand side operand (a `string`). Once I've read a "word," all I have to do is update the statistics in the map, and that is done with the following line:

`++words[s];`

`map` defines `operator[]` for retrieving a value given a key (it actually returns a reference to the value itself), so to increment it, just increment the value indexed at the particular key. But something about this might seem a little weird. What if the key isn't already in the map? Don't we try to increment a nonexistent index, and crash like we would with an array? No, `map` does `operator[]` differently than other STL containers or ordinary, C-style arrays.

In a `map`, `operator[]` does two things: if the key does not already exist, it creates a value by using that value's type's default constructor and adds that key/value pair to the map, if the key already exists in the map, no modification is made. In both cases, a reference to the value for the specified key is returned, even if that value was just created with its default constructor. This is a handy feature (if you know it's there), because it eliminates the need for client code to check for a key's existence before inserting it.

Now, look at lines 32 and 33. The iterator refers to members called `first` and `second--`what are those? `map`s play a trick on you by using another class template to store your name value pairs: the `pair` class template defined in `<utility>` (included by `<map>` already). If you are iterating through the items stored in a map, you will be pointing to `pair` objects. Working with `pair`s is simple, the first item in a pair is stored in the `first` member, and the second is stored in, well, `second`.

I used `operator>>` in Example 4-27 to read in contiguous chunks of text from the input stream, which is different than some of the other examples. I did this to demonstrate that it can be done, but you will almost certainly need to customize the behavior based on your definition of a "word" in a text file. For example, consider an excerpt of the output produced by Example 4-27:

```with occurred 5 times.
work occurred 3 times.
workers occurred 3 times.
workers. occurred 1 times.
years occurred 2 times.
years. occurred 1 times.```

Notice that the periods on the end of words are included as part of each word. Most likely, you will want to change the definition of words to mean only alpha or alphanumeric characters, as I did in Recipe Recipe 4.17 by using some of the character-testing functions in `<cctype>` and `<cwctype>`.