Preface
Several kinds of tasks occur repeatedly when working with text files.
You might want to extract certain lines and discard the rest. Or you may
need to make changes wherever certain patterns appear, but leave the rest of
the file alone. Such jobs are often easy with awk
. The
awk
utility interprets a special-purpose programming
language that makes it easy to handle simple data-reformatting jobs.
The GNU implementation of awk
is called
gawk
; if you invoke it with the proper options or environment variables, it is
fully compatible with the POSIX[1] specification of the awk
language and with
the Unix version of awk
maintained by Brian Kernighan.
This means that all properly written awk
programs should
work with gawk
. So most of the time, we don’t distinguish
between gawk
and other awk
implementations.
Using awk
you can:
Manage small, personal databases
Generate reports
Validate data
Produce indexes and perform other document-preparation tasks
Experiment with algorithms that you can adapt later to other computer languages
In addition, gawk
provides facilities that make
it easy to:
Extract bits and pieces of data for processing
Sort data
Perform simple network communications
Profile and debug
awk
programsExtend the language with functions written in C or C++
This book teaches you about the awk
language and
how you can use it effectively. You should already be familiar with basic system commands,
such as cat
and ls
,[2] as well as basic shell facilities, such as input/output (I/O)
redirection and pipes.
Implementations of the awk
language are available
for many different computing environments. This book, while describing the awk
language in general, also describes the particular implementation of
awk
called gawk
(which stands for “GNU
awk
”). gawk
runs on a broad range of
Unix systems, ranging from Intel-architecture PC-based computers up through
large-scale systems. gawk
has also been ported to Mac OS
X, Microsoft Windows (all versions), and OpenVMS.[3]
History of awk and gawk
The name awk
comes from the initials of its designers: Alfred V. Aho, Peter J.
Weinberger, and Brian W. Kernighan. The original version of awk
was written
in 1977 at AT&T Bell Laboratories. In 1985, a new version made the
programming language more powerful, introducing user-defined functions,
multiple input streams, and computed regular expressions. This new version
became widely available with Unix System V Release 3.1 (1987). The version
in System V Release 4 (1989) added some new features and cleaned up the
behavior in some of the “dark corners” of the language. The specification
for awk
in the POSIX Command Language and Utilities
standard further clarified the language. Both the gawk
designers and the original awk
designers at Bell
Laboratories provided feedback for the POSIX specification.
Paul Rubin wrote gawk
in 1986. Jay Fenlason
completed it, with advice from Richard Stallman. John Woods contributed
parts of the code as well. In 1988 and 1989, David Trueman, with help from
me, thoroughly reworked gawk
for compatibility with the
newer awk
. Circa 1994, I became the primary maintainer.
Current development focuses on bug fixes, performance improvements,
standards compliance, and, occasionally, new
features.
In May 1997, Jürgen Kahrs felt the need for network access from
awk
, and with a little help from me, set about adding
features to do this for gawk
. At that time, he also
wrote the bulk of TCP/IP
Internetworking with gawk (a separate document,
available as part of the gawk
distribution). His code
finally became part of the main gawk
distribution with
gawk
version 3.1.
John Haque rewrote the gawk
internals, in the
process providing an awk
-level debugger. This version
became available as gawk
version 4.0 in 2011.
See Major Contributors to gawk for a full list of those who
have made important contributions to gawk
.
A Rose by Any Other Name
The awk
language has evolved over the years. Full
details are provided in Appendix A. The language
described in this book is often referred to as “new
awk
.” By analogy, the original version of
awk
is referred to as “old
awk
.”
On most current systems, when you run the awk
utility you get some version of new awk
.[4] If your system’s standard awk
is the old
one, you will see something like this if you try the test program:
$ awk 1 /dev/null
error→ awk: syntax error near line 1
error→ awk: bailing out near line 1
In this case, you should find a version of new
awk
, or just install gawk
!
Throughout this book, whenever we refer to a language feature that
should be available in any complete implementation of POSIX
awk
, we simply use the term awk
.
When referring to a feature that is specific to the GNU implementation, we
use the term gawk
.
Using This Book
The term awk
refers to a particular program as
well as to the language you use to tell this program what to do. When we need to be careful, we call the language “the
awk
language,” and the program “the
awk
utility.” This book explains both how to write
programs in the awk
language and how to run the
awk
utility. The term “awk
program”
refers to a program written by you in the awk
programming language.
Primarily, this book explains the features of awk
as defined in the POSIX standard. It does so in the context of the
gawk
implementation. While doing so, it also attempts
to describe important differences between gawk
and
other awk
implementations. Finally, it notes any
gawk
features that are not in the POSIX standard for
awk
.
This book has the difficult task of being both a tutorial and a reference. If you are a novice, feel free to skip over details that seem too complex. You should also ignore the many cross-references; they are for the expert user and for the online Info and HTML versions of the book.
There are sidebars scattered throughout the book. They add a more complete explanation of points that are relevant, but not likely to be of interest on first reading.
Most of the time, the examples use complete awk
programs. Some of the more advanced sections show only the part of the
awk
program that illustrates the concept being described.
Although this book is aimed principally at people who have not been
exposed to awk
, there is a lot of information here that
even the awk
expert should find useful. In particular,
the description of POSIX awk
and the example programs
in Chapter 10 and Chapter 11 should be of interest.
This book is split into several parts, as follows:
Part I, describes the
awk
language and thegawk
program in detail. It starts with the basics, and continues through all of the features ofawk
. It contains the following chapters:Chapter 1, Getting Started with awk, provides the essentials you need to know to begin using
awk
.Chapter 2, Running awk and gawk, describes how to run
gawk
, the meaning of its command-line options, and how it findsawk
program source files.Chapter 3, Regular Expressions, introduces regular expressions in general, and in particular the flavors supported by POSIX
awk
andgawk
.Chapter 4, Reading Input Files, describes how
awk
reads your data. It introduces the concepts of records and fields, as well as thegetline
command. I/O redirection is first described here. Network I/O is also briefly introduced here.Chapter 5, Printing Output, describes how
awk
programs can produce output withprint
andprintf
.Chapter 6, Expressions, describes expressions, which are the basic building blocks for getting most things done in a program.
Chapter 7, Patterns, Actions, and Variables, describes how to write patterns for matching records, actions for doing something when a record is matched, and the predefined variables
awk
andgawk
use.Chapter 8, Arrays in awk, covers
awk
’s one and only data structure: the associative array. Deleting array elements and whole arrays is described, as well as sorting arrays ingawk
. The chapter also describes howgawk
provides arrays of arrays.Chapter 9, Functions, describes the built-in functions
awk
andgawk
provide, as well as how to define your own functions. It also discusses howgawk
lets you call functions indirectly.
Part II, shows how to use
awk
andgawk
for problem solving. There is lots of code here for you to read and learn from. This part contains the following chapters:Chapter 10, A Library of awk Functions, provides a number of functions meant to be used from main
awk
programs.Chapter 11, Practical awk Programs, provides many sample
awk
programs.
Reading these two chapters allows you to see
awk
solving real problems.Part III, focuses on features specific to
gawk
. It contains the following chapters:Chapter 12, Advanced Features of gawk, describes a number of advanced features. Of particular note are the abilities to control the order of array traversal, have two-way communications with another process, perform TCP/IP networking, and profile your
awk
programs.Chapter 13, Internationalization with gawk, describes special features for translating program messages into different languages at runtime.
Chapter 14, Debugging awk Programs, describes the
gawk
debugger.Chapter 15, Arithmetic and Arbitrary-Precision Arithmetic with gawk, describes advanced arithmetic facilities.
Chapter 16, Writing Extensions for gawk, describes how to add new variables and functions to
gawk
by writing extensions in C or C++.
Part IV, provides the following appendices, including the GNU General Public License:
Appendix A, describes how the
awk
language has evolved since its first release to the present. It also describes howgawk
has acquired features over time.Appendix B, describes how to get
gawk
, how to compile it on POSIX-compatible systems, and how to compile and use it on different non-POSIX systems. It also describes how to report bugs ingawk
and where to get other freely availableawk
implementations.
Appendix C, presents the license that covers the
gawk
source code.
The version of this book distributed with gawk
contains additional appendices and other end material. To save space, we
have omitted them from the printed edition. You may find them online, as
follows:
The appendix on implementation notes describes how to disable
gawk
’s extensions, how to contribute new code togawk
, where to find information on some possible future directions forgawk
development, and the design decisions behind the extension API.The appendix on basic concepts provides some very cursory background material for those who are completely unfamiliar with computer programming.
The glossary defines most, if not all, of the significant terms used throughout the book. If you find terms that you aren’t familiar with, try looking them up here.
The GNU FDL is the license that covers this book.
Some of the chapters have exercise sections; these have also been omitted from the print edition but are available online.
Typographical Conventions
This book is written in Texinfo, the GNU documentation formatting language. A single Texinfo source file is used to produce both the printed and online versions of the documentation. Because of this, the typographical conventions are slightly different than in other books you may have read.
Examples you would type at the command line are preceded by the
common shell primary and secondary prompts, ‘$
’ and
‘>
’. Input that you type is shown like
this
.
Output from the command, usually its standard output, appears
like this
. Error messages and other output on the
command’s standard error are preceded by the glyph “error→”. For
example:
$echo hi on stdout
hi on stdout $echo hello on stderr 1>&2
error→ hello on stderr
In the text, almost anything related to programming, such as command
names, variable and function names, and string, numeric and regexp
constants appear in this font
. Code fragments appear in
the same font and quoted, ‘like this
’. Things that are
replaced by the user or programmer appear in this
font
. Options look like this: -f
. Filenames
are indicated like this: /path/to/ourfile
. The first
occurrence of a new term is usually its definition
and appears in the same font as the previous occurrence of “definition” in
this sentence.
Characters that you type at the keyboard look like
this
. In particular, there are special characters called “control characters.” These are characters that you
type by holding down both the CONTROL
key and
another key, at the same time. For example, a
Ctrl-d
is typed by first pressing and holding the
CONTROL
key, next pressing the
d
key, and finally releasing both keys.
For the sake of brevity, throughout this book, we refer to Brian
Kernighan’s version of awk
as “BWK awk
.”
(See Other Freely Available awk Implementations for information on his and other
versions.)
Note
Notes of interest look like this.
Caution
Cautionary or warning notes look like this.
Dark Corners
Dark corners are basically fractal—no matter how much you illuminate, there’s always a smaller but darker one.
—Brian Kernighan
Until the POSIX standard (and Effective awk
Programming), many features of awk
were
either poorly documented or not documented at all. Descriptions of such
features (often called “dark corners”) are noted in this book with
“(d.c.).”
But, as noted by the opening quote, any coverage of dark corners is by definition incomplete.
Extensions to the standard awk
language that
are supported by more than one awk
implementation are
marked “(c.e.)” for “common extension.”
The GNU Project and This Book
The Free Software Foundation (FSF) is a nonprofit organization dedicated to the production and distribution of freely distributable software. It was founded by Richard M. Stallman, the author of the original Emacs editor. GNU Emacs is the most widely used version of Emacs today.
The GNU[5] Project is an ongoing effort on the part of the Free Software
Foundation to create a complete, freely distributable, POSIX-compliant
computing environment. The FSF uses the GNU General Public License (GPL)
to ensure that its software’s source code is always available to the end
user. The GPL applies to the C language source code for
gawk
. To find out more about the FSF and the GNU
Project online, see the GNU Project’s home
page. This book may also be read from GNU’s
website.
The book you are reading is actually free—at least, the information
in it is free to anyone. The machine-readable source code for the book
comes with gawk
.
The book itself has gone through multiple previous editions. Paul
Rubin wrote the very first draft of The GAWK
Manual; it was around 40 pages long. Diane Close and Richard
Stallman improved it, yielding a version that was around 90 pages and
barely described the original, “old” version of
awk
.
I started working with that version in the fall of 1988. As work on
it progressed, the FSF published several preliminary versions (numbered
0.x
). In 1996, edition 1.0 was released with
gawk
3.0.0. The FSF published the first two editions
under the title The GNU Awk User’s Guide. SSC
published two editions of the book under the title Effective
awk Programming, and O’Reilly published the third edition in
2001.
This edition maintains the basic structure of the previous editions.
For FSF edition 4.0, the content was thoroughly reviewed and updated. All
references to gawk
versions prior to 4.0 were removed.
Of significant note for that edition was the addition of Chapter 14.
For FSF edition 4.1 (the fourth edition as published by O’Reilly), the content has been reorganized into parts, and the major new additions are Chapter 15 and Chapter 16.
This book will undoubtedly continue to evolve. If you find an error in the book, please report it! See Reporting Problems and Bugs for information on submitting problem reports electronically.
How to Stay Current
You may have a newer version of gawk
than the one
described here. To find out what has changed, you should first
look at the NEWS
file in the
gawk
distribution, which provides a high-level summary
of the changes in each release.
You can then look at the online version of this book to read about any new features.
Using Code Examples
This book is here to help you get your job done.
Most of the example programs in this book come in the gawk
distribution and are marked in the files as being in the public domain. So,
in general, you may
use the code in this book in your programs and documentation.
Incorporating a significant amount of prose or
example code from this book into your product’s documentation requires
compliance with the GNU FDL.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Effective awk Programming, Fourth Edition, by Arnold Robbins (O’Reilly). Copyright 2015 Free Software Foundation, 978-1-491-90461-9.”
If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at permissions@oreilly.com.
Safari® Books Online
Note
Safari Books Online (www.safaribooksonline.com) is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc. |
1005 Gravenstein Highway North |
Sebastopol, CA 95472 |
800-998-9938 (in the United States or Canada) |
707-829-0515 (international or local) |
707-829-0104 (fax) |
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/effective-awk-programming-4e.
To comment or ask technical questions about this book, send email to bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
The initial draft of The GAWK Manual had the following acknowledgments:
Many people need to be thanked for their assistance in producing this manual. Jay Fenlason contributed many ideas and sample programs. Richard Mlynarik and Robert Chassell gave helpful comments on drafts of this manual. The paper A Supplemental Document for awk by John W. Pierce of the Chemistry Department at UC San Diego, pinpointed several issues relevant both to
awk
implementation and to this manual, that would otherwise have escaped us.
I would like to acknowledge Richard M. Stallman, for his vision of a better world and for his courage in founding the FSF and starting the GNU Project.
The previous edition of this book had the following acknowledgments:
The following people (in alphabetical order) provided helpful comments on various versions of this book: Rick Adams, Dr. Nelson H.F. Beebe, Karl Berry, Dr. Michael Brennan, Rich Burridge, Claire Cloutier, Diane Close, Scott Deifik, Christopher (“Topher”) Eliot, Jeffrey Friedl, Dr. Darrel Hankerson, Michal Jaegermann, Dr. Richard J. LeBlanc, Michael Lijewski, Pat Rankin, Miriam Robbins, Mary Sheehan, and Chuck Toporek.
Robert J. Chassell provided much valuable advice on the use of Texinfo. He also deserves special thanks for convincing me not to title this book How to Gawk Politely. Karl Berry helped significantly with the TeX part of Texinfo.
I would like to thank Marshall and Elaine Hartholz of Seattle and Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet vacation time in their homes, which allowed me to make significant progress on this book and on
gawk
itself.Phil Hughes of SSC contributed in a very important way by loaning me his laptop GNU/Linux system, not once, but twice, which allowed me to do a lot of work while away from home.
David Trueman deserves special credit; he has done a yeoman job of evolving
gawk
so that it performs well and without bugs. Although he is no longer involved withgawk
, working with him on this project was a significant pleasure.The intrepid members of the GNITS mailing list, and most notably Ulrich Drepper, provided invaluable help and feedback for the design of the internationalization features.
Chuck Toporek, Mary Sheehan, and Claire Cloutier of O’Reilly & Associates contributed significant editorial help for this book for the 3.1 release of
gawk
.
Dr. Nelson Beebe, Andreas Buening, Dr. Manuel Collado, Antonio
Colombo, Stephen Davies, Scott Deifik, Akim Demaille, Darrel Hankerson,
Michal Jaegermann, Jürgen Kahrs, Stepan Kasal, John Malmberg, Dave Pitts,
Chet Ramey, Pat Rankin, Andrew Schorr, Corinna Vinschen, and Eli Zaretskii
(in alphabetical order) make up the current gawk
“crack
portability team.” Without their hard work and help,
gawk
would not be nearly the robust, portable program
it is today. It has been and continues to be a pleasure working with this
team of fine people.
Notable code and documentation contributions were made by a number of people. See Major Contributors to gawk for the full list.
Thanks to Andy Oram of O’Reilly Media for initiating the fourth edition and for his support during the work. Thanks to Jasmine Kwityn for her copyediting work.
Thanks to Michael Brennan for the Forewords.
Thanks to Patrice Dumas for the new makeinfo
program. Thanks to Karl Berry, who continues to work to keep the Texinfo
markup language sane.
Robert P.J. Day, Michael Brennan, and Brian Kernighan kindly acted as reviewers for the 2015 edition of this book. Their feedback helped improve the final work.
I would also like to thank Brian Kernighan for his invaluable
assistance during the testing and debugging of gawk
,
and for his ongoing help and advice in clarifying numerous points about
the language. We could not have done nearly as good a job on either
gawk
or its documentation without his help.
Brian is in a class by himself as a programmer and technical author. I have to thank him (yet again) for his ongoing friendship and for being a role model to me for close to 30 years! Having him as a reviewer is an exciting privilege. It has also been extremely humbling...
I must thank my wonderful wife, Miriam, for her patience through the many versions of this project, for her proofreading, and for sharing me with the computer. I would like to thank my parents for their love, and for the grace with which they raised and educated me. Finally, I also must acknowledge my gratitude to G-d, for the many opportunities He has sent my way, as well as for the gifts He has given me with which to take advantage of those opportunities.
[2] These utilities are available on POSIX-compliant systems, as well as on traditional Unix-based systems. If you are using some other operating system, you still need to be familiar with the ideas of I/O redirection and pipes.
[3] Some other, obsolete systems to which gawk
was
once ported are no longer supported and the code for those systems has
been removed.
[4] Only Solaris systems still use an old awk
for
the default awk
utility. A more modern
awk
lives in /usr/xpg6/bin
on these systems.
[5] GNU stands for “GNU’s Not Unix.”
Get Effective awk Programming, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.