Counting Lines, Words, and Characters

We have used the word-count utility, wc, a few times before. It is probably one of the oldest, and simplest, tools in the Unix toolbox, and POSIX standardizes it. By default, wc outputs a one-line report of the number of lines, words, and bytes:

$ echo This is a test of the emergency broadcast system | wc    
            Report counts
      1       9      49

Request a subset of those results with the -c (bytes), -l (lines), and -w (words) options:

$ echo Testing one two three | wc -c     
            Count bytes
22

$ echo Testing one two three | wc -l     
            Count lines
1

$ echo Testing one two three | wc -w     
            Count words
4

The -c option originally stood for character count, but with multibyte character-set encodings, such as UTF-8, in modern systems, bytes are no longer synonymous with characters, so POSIX introduced the -m option to count multibyte characters. For 8-bit character data, it is the same as -c.

Although wc is most commonly used with input from a pipeline, it also accepts command-line file arguments, producing a one-line report for each, followed by a summary report:

$ wc /etc/passwd /etc/group              
            Count data in two files
    26     68   1631 /etc/passwd
 10376  10376 160082 /etc/group
 10402  10444 161713 total

Modern versions of wc are locale-aware: set the environment variable LC_CTYPE to the desired locale to influence wc's interpretation of byte sequences as characters and word separators.

In Chapter 5, we will develop a related tool, wf, to report the frequency of occurrence of each word.

Get Classic Shell Scripting now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.