4.17. Counting the Number of Characters, Words, and Lines in a Text File

Problem

You have to count the numbers of characters, words, and lines—or some other type of text element—in a text file.

Solution

Use an input stream to read the characters in, one at a time, and increment local statistics as you encounter characters, words, and line breaks. Example 4-26 contains the function countStuff, which does exactly that.

Example 4-26. Calculating statistics about a text file

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cctype>

using namespace std;

void countStuff(istream& in,
                int& chars,
                int& words,
                int& lines) {

   char cur = '\0';
   char last = '\0';
   chars = words = lines = 0;

   while (in.get(cur)) {
      if (cur == '\n' ||
          (cur == '\f' && last == '\r'))
         lines++;
      else
        chars++;
      if (!std::isalnum(cur) &&   // This is the end of a
          std::isalnum(last))     // word
         words++;
      last = cur;
   }
   if (chars > 0) {               // Adjust word and line
      if (std::isalnum(last))     // counts for special
         words++;                 // case
      lines++;
   }
}

int main(int argc, char** argv) {

   if (argc < 2)
      return(EXIT_FAILURE);

   ifstream in(argv[1]);

   if (!in)
      exit(EXIT_FAILURE);

   int c, w, l;

   countStuff(in, c, w, l);
1
   cout << "chars: " << c << '\n';
   cout << "words: " << w << '\n';
   cout << "lines: " << l << '\n';
}

Discussion

The algorithm here is straightforward. Characters are easy: increment the character count each time you call get on the input stream. Lines are only slightly more difficult, since the way a line ends depends on the operating system. Thankfully, it’s usually either a new-line character (\n) or a carriage return line feed sequence (\r\l). By keeping track of the current and last characters, you can easily capture occurrences of this sequence. Words are easy or hard, depending on your definition of a word.

For Example 4-26, I consider a word to be a contiguous sequence of alphanumeric characters. As I look at each character in the input stream, when I encounter a nonalphanumeric character, I look at the previous character to see if it was alphanumeric. If it was, then a word has just ended and I can increment the word count. I can tell if a character is alphanumeric by using isalnum from <cctype>. But that’s not all—you can test characters for a number of different qualities with similar functions. See Table 4-3 for the functions you can use to test character qualities. For wide characters, use the functions of the same name but with a “w” after the “is,” e.g., iswspace. The wide-character versions are declared in the header <cwctype>.

Table 4-3. Character test functions from <cctype> and <cwctype>

Function

Description

isalphaiswalpha

Alpha characters: a-z, A-Z (upper- or lowercase).

isupperiswupper

Alpha characters in uppercase only: A-Z.

isloweriswlower

Alpha characters in lowercase only: a-z.

isdigitiswdigit

Numeric characters: 0-9.

isxdigitiswxdigit

Hexadecimal numeric characters: 0-9, a-f, A-F.

isspaceiswspace

Whitespace characters: ' `, \n, \t, \v, \r, \l.

iscntrliswcntrl

Control characters: ASCII 0-31 and 127.

ispunctiswpunct

Punctuation characters that don’t belong to the previous groups.

isalnumiswalnum

isalpha or isdigit is true.

isprintiswprint

Printable ASCII characters.

isgraphiswgraph

isalpha or isdigit or ispunct is true.

After all characters have been read in and the end of the stream has been reached, there is a bit of adjustment to do. First, the loop only counts line breaks, and not, strictly speaking, lines. Therefore, it will always be one less than the actual number of lines. To make this problem go away I just increment the line count by one if there are more than zero characters in the file. Second, if the stream ends with an alphanumeric character, the test for the end of the last word will never occur because I can’t test the next character. To account for this, I check if the last character in the stream is alphanumeric (also only when there are more than zero characters in the file) and increment the word count by one.

The technique in Example 4-26 of using streams is nearly identical to that described in Recipe 4.14 and Recipe 4.15, but simpler since it’s just inspecting the file and not making any changes.

Get C++ Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.