Chapter 1. Getting Started

Like the rest of the Internet, the Common Gateway Interface , or CGI, has come a very long way in a very short time. Just a handful of years ago, CGI scripts were more of a novelty than practical; they were associated with hit counters and guestbooks, and were written largely by hobbyists. Today, CGI scripts, written by professional web developers, provide the logic to power much of the vast structure the Internet has become.

History

Despite the attention it now receives, the Internet is not new. In fact, the precursor to today’s Internet began thirty years ago. The Internet began its existence as the ARPAnet, which was funded by the United States Department of Defense to study networking. The Internet grew gradually during its first 25 years, and then suddenly blossomed.

The Internet has always contained a variety of protocols for exchanging information, but when web browsers such as NCSA Mosaic and, later, Netscape Navigator appeared, they spurred an explosive growth. In the last six years, the number of web hosts alone has grown from under a thousand to more than ten million. Now, when people hear the term Internet, most think of the Web. Other protocols, such as those for email, FTP, chat, and news, certainly remain popular, but they have become secondary to the Web, as more people are using web sites as their gateway to access these other services.

The Web was by no means the first technology available for publishing and exchanging information, but there was something different about the Web that prompted its explosive growth. We’d love to tell you that CGI was the sole factor for the Web’s early growth over protocols like FTP and Gopher. But that wouldn’t be true. Probably the real reason the Web gained popularity initially was because it came with pictures. The Web was designed to present multiple forms of media: browsers supported inlined images almost from the start, and HTML supported rudimentary layout control that made information easier to present and read. This control continued to increase as Netscape added support for new extensions to HTML with each successive release of the browser.

Thus initially, the Web grew into a collection of personal home pages and assorted web sites containing a variety of miscellaneous information. However, no one really knew what to do with it, especially businesses. In 1995, a common refrain in corporations was “Sure the Internet is great, but how many people have actually made money online?” How quickly things change.

How CGI Is Used Today

Today, e-commerce has taken off and dot-com startups are appearing everywhere. Several technologies have been fundamental to this progress, and CGI is certainly one of the most important. CGI allows the Web to do things, to be more than a collection of static resources. A static resource is something that does not change from request to request, such as an HTML file or a graphic. A dynamic resource is one that contains information that may vary with each request, depending on any number of conditions including a changing data source (like a database), the identity of the user, or input from the user. By supporting dynamic content, CGI allows web servers to provide online applications that users from around the world on various platforms can all access via a standard client: a web browser.

It is difficult to enumerate all that CGI can do, because it does so much. If you perform a search on a web site, a CGI application is probably processing your information. If you fill out a registration form on the Web, a CGI application is probably processing your information. If you make an online purchase, a CGI application is probably validating your credit card and logging the transaction. If you view a chart online that dynamically displays information graphically, chances are that a CGI application created that chart. Of course, over the last few years other technologies have appeared to handle dynamic tasks like these; we’ll look at some of those in a moment. However, CGI remains the most popular way to do these tasks and more.

Introduction to CGI

CGI can do so much because it is so simple. CGI is a very lightweight interface; it is essentially the minimum that the web server needs to provide in order to allow external processes to create web pages. Typically, when a web server gets a request for a static web page, the web server finds the corresponding HTML file on its filesystem. When a web server gets a request for a CGI script, the web server executes the CGI script as another process (i.e., a separate application); the server passes this process some parameters and collects its output, which it then returns to the client just as if had been fetched from a static file (see Figure 1.1).

How a CGI application is executed

Figure 1-1. How a CGI application is executed

So how does the whole interface work? We’ll spend the remainder of the book answering this question in more detail, but let’s take a basic look now.

Web browsers request dynamic resources such as CGI scripts the same way they request any other resource on the Web: they send a message formatted according to the Hypertext Transport Protocol, or HTTP. We’ll discuss HTTP in Chapter 2. An HTTP request includes a Universal Resource Locator, or URL, and by looking at the URL, the web server determines which resource to return. Typically, CGI scripts share a common directory, like /cgi, or a filename extension, like .cgi. If the web server recognizes that the request is for a CGI script, it executes the script.

Say you wanted to visit the URL, http://www.mikesmechanics.com/cgi/welcome.cgi. At its most basic, Example 1.1 shows a sample HTTP request your web browser might send.

Example 1-1. Sample HTTP Request

GET /cgi/welcome.cgi HTTP/1.1
Host: www.mikesmechanics.com

This GET request identifies the resource to retrieve as /cgi/welcome.cgi. Assuming our server recognizes all files in the /cgi directory tree as CGI scripts, it understands that it should execute the welcome.cgi script instead of returning its contents directly to the browser.

CGI programs get their input from standard input (STDIN) and environment variables. These variables contain information such as the identity of the remote host and user, the value of form elements submitted (if any), etc. They also store the server name, the communication protocol, and the name of the software running the server. We’ll look at each one of these in more detail in Chapter 3.

Once the CGI program starts running, it sends its output back to the web server via standard output (STDOUT). In Perl, this is easy to do because by default, anything you print goes to STDOUT. CGI scripts can either return their own output as a new document or provide a new URL to forward the request elsewhere. CGI scripts print a special line formatted according to HTTP headers to indicate this to the web server. We’ll look at these headers in the next chapter, but here is a sample of what a CGI script returning HTML would output:

Content-type: text/html

CGI scripts actually can return extra header lines if they choose, so to indicate that it has finished sending headers, a CGI script prints a blank line. Finally, if it is outputting a document, it prints the contents of that document, too.

The web server takes the output of the CGI script and adds its own HTTP headers before sending it back to the browser of the user who requested it. Example 1.2 shows a sample response that a web browser would receive from the web server.

Example 1-2. Sample HTTP Response

HTTP/1.1 200 OK
Date: Sat, 18 Mar 2000 20:35:35 GMT
Server: Apache/1.3.9 (Unix)
Last-Modified: Wed, 20 May 1998 14:59:42 GMT
ETag: "74916-656-3562efde"
Content-Length: 2000
Content-Type: text/html

<HTML>
<HEAD>
  <TITLE>Welcome to Mike's Mechanics Database</TITLE>
</HEAD>

<BODY BGCOLOR="#ffffff">
  <IMG SRC="/images/mike.jpg" ALT="Mike's Mechanics">
  <P>Welcome from dyn34.my-isp.net! What will you find here? You'll
    find a list of mechanics from around the country and the type of
    service to expect -- based on user input and suggestions.</P>
  <P>What are you waiting for? Click <A HREF="/cgi/list.cgi">here</A>
    to continue.</P>
  <HR>
  <P>The current time on this server is: Sat Mar 18 10:28:00 2000.</P>
  <P>If you find any problems with this site or have any suggestions,
    please email <A HREF="mailto:webmaster@mikesmechanics.com">
    webmaster@mikesmechanics.com</A>.</P>
</BODY>
</HTML>

The header contains the communication protocol, the date and time of the response, the server name and version, the last time the document was modified, an entity tag used for caching, the length of the response, and the media type of the document—in this case, a text document formatted with HTML. Headers like these are returned with all responses from web servers, and we’ll look at HTTP headers in more detail in the next chapter. However, note that nothing here indicates to the browser whether this response came from the contents of a static HTML file or whether it was generated dynamically by a CGI script. This is as it should be; the browser asked the web server for a resource, and it received a resource. It doesn’t care where the document came from or how the web server generated it.

CGI allows you to generate output that doesn’t look any different to the end user than other responses on the Web. This flexibility allows you to generate anything with a CGI script that the web server could get from a file, including HTML documents, plain text documents, PDF files, or even images like PNGs or GIFs. We’ll look at how to create dynamic images in Chapter 13.

Sample CGI

Let’s look at a sample CGI application, written in Perl, that creates the dynamic output we just saw in Example 1.2. This program, shown in Example 1.3, determines where the user is connecting from and then creates a simple HTML document containing this information, along with the current time. In the next several chapters, we’ll see how to use various CGI modules to make creating such an application even easier; for now, however, we will keep it straightforward.

Example 1-3. welcome.cgi

#!/usr/bin/perl -wT

use strict;

my $time        = localtime;
my $remote_id   = $ENV{REMOTE_HOST} || $ENV{REMOTE_ADDR};
my $admin_email = $ENV{SERVER_ADMIN};

print "Content-type: text/html\n\n";

print <<END_OF_PAGE;
<HTML>
<HEAD>
  <TITLE>Welcome to Mike's Mechanics Database</TITLE>
</HEAD>

<BODY BGCOLOR="#ffffff">
  <IMG SRC="/images/mike.jpg" ALT="Mike's Mechanics">
  <P>Welcome from $remote_host! What will you find here? You'll
    find a list of mechanics from around the country and the type of
    service to expect -- based on user input and suggestions.</P>
  <P>What are you waiting for? Click <A HREF="/cgi/list.cgi">here</A>
    to continue.</P>
  <HR>
  <P>The current time on this server is: $time.</P>
  <P>If you find any problems with this site or have any suggestions,
    please email <A HREF="mailto:$admin_email">$admin_email</A>.</P>
</BODY>
</HTML>
END_OF_PAGE

This program is quite simple. It contains only six commands, although the last one is many lines long. Let’s take a look at how it works. Because this script is our first and is short, we’ll look at it line by line; but as mentioned in the Preface, this book does assume that you are already familiar with Perl. So if you do not know Perl well or if your Perl is a little rusty, you may want to have a Perl reference available to consult as you read this book. We recommend Programming Perl, Third Edition, by Larry Wall, Tom Christiansen, and Jon Orwant (O’Reilly & Associates, Inc.); not only is it the standard Perl tome, but it also has a convenient alphabetical description of Perl’s built-in functions.

The first line of the program looks like the top of most Perl scripts. It tells the server to use the program at /usr/bin/perl to interpret and execute this script. You may not recognize the flags, however: the -wT flags tell Perl to turn on warnings and taint checking. Warnings help locate subtle problems that may not generate syntax errors; enabling this is optional, but it is a very helpful feature. Taint checking should not be considered optional: unless you like living dangerously, you should enable this feature with all of your CGI scripts. We will discuss taint checking more in Chapter 8.

The command use strict tells Perl to enable strict rules for variables, subroutines, and references. If you haven’t used this command before, you should get into the habit of using it with your CGI scripts. Like warnings, it helps locate subtle mistakes, such as typos, that might not otherwise generate a syntax error. Furthermore, the strict pragma encourages good programming practices by forcing you to declare variables and reduce the number of global variables. This produces code that is more maintainable. Finally, as we will see in Chapter 17, the strict pragma is essentially required by FastCGI and mod_perl. If you think you might migrate to either of these technologies in the future, you should begin using strict now.

Now we start the real work. First, we set three variables. The first variable, $time, is set to a string representing the current date and time. The second variable, $remote_id, is set to the identity of the remote machine requesting this page, and we get this information from the environment variables REMOTE_HOST or REMOTE_ADDR. As we mentioned earlier, CGI scripts get all of their information from the web server from environment variables and STDIN. REMOTE_HOST contains the full domain name of the remote machine, but only if reverse domain name lookups have been enabled for the web server—otherwise, it is blank. In this case, we use REMOTE_ADDR instead, which contains the IP address of the remote machine. The final variable, $admin_email, is set to SERVER_ADMIN, which contains the email address of the server’s administrator according to the server’s configuration files. These are just a few environment variables available to CGI scripts. We’ll review these three in more detail along with the rest in Chapter 3.

As we saw earlier, if a CGI script wants to return a new document, it must first output an HTTP header declaring the type of document it is returning. It does this and prints an additional blank line to indicate that it has finished sending headers. It then prints the body of the document.

Instead of using a print statement to send each line to standard output separately, we use a “here” document, which allows us to print a block of text at once. This is a standard Perl feature that’s admittedly a little esoteric; you may not be familiar with this if you have not done other forms of shell programming. This command tells Perl to print all of the following lines until it encounters the END_OF_PAGE token on its own line. It treats the text as if it were enclosed in double quotes, so the variables are evaluated, but double quotes do not need to be escaped. Not only do “here” documents save us from a lot of extra typing, but they also make the program easier to read. However, there are even better ways of outputting HTML, as we’ll see in Chapter 5, and Chapter 6.

That’s all there is to our script, so at this point it exits; the web server adds additional HTTP headers and returns the response to the client as we saw in Example 1.2. This was just a simple example of a CGI script, and don’t worry if you have questions or are unsure about a particular detail. As our numerous references to later chapters indicate, we’ll spend the rest of the book filling in the details.

Invoking CGI Scripts

CGI scripts have their own URLs, just like HTML documents and other resources on the Web. The server is typically configured to map a particular virtual directory (a directory contained within a URL) to CGI scripts, such as /cgi-bin, /cgi, /scripts, etc. Generally, both the location for CGI scripts on the server’s filesystem and the corresponding URL path can be overridden in the server’s configuration. We will see how to do this for the Apache web server a little later in Section 1.4.1.

On Unix, the filesystem differentiates between files that are executable and those that are not. CGI scripts must be executable. Assuming you have a Perl file that you have named my_script.cgi, you would issue the following command from the shell to make a file executable:

chmod 0755 my_script.cgi

Forgetting this step is a common problem. On other operating systems, you may have to enable other settings to enable scripts to run. Refer to the documentation for your web server.

Alternative Technologies

As its title suggests, this book focuses on CGI programs written in Perl. Because Perl and CGI are so often used together, some people are unclear about the distinction. Perl is a programming language, and CGI is an interface that a program uses to handle requests from a web server. There are alternatives both to CGI and to Perl: there are new alternatives to CGI for handling dynamic requests, and CGI applications can be written in a variety of languages.

Why Perl?

Although CGI applications can be written in any almost any language, Perl and CGI scripting have become synonymous to many programmers. As Hassan Schroeder, Sun’s first webmaster, said in his oft-quoted statement, “Perl is the duct tape of the Internet.” Perl is by far the most widely used language for CGI programming, and for many good reasons:

  • Perl is easy to learn because it resembles other popular languages (such as C), because it is forgiving, and because when an error occurs it provides specific and detailed error messages to help you locate the problem quickly.

  • Perl allows rapid development because it is interpreted; the source code does not need to be compiled before execution.

  • Perl is easily portable and available on many platforms.

  • Perl contains extremely powerful string manipulation operators, with regular expression matching and substitution built right into the language.

  • Perl handles and manipulates binary data just as easily as it handles text.

  • Perl does not require strict variable types; numbers, strings, and booleans are simply scalars.

  • Perl interfaces with external applications very easily and provides its own filesystem functions.

  • There are countless open source modules for Perl available on CPAN, ranging from modules for creating dynamic graphics to interfacing with Internet servers and database engines. For more information on CPAN, refer to Appendix B.

Furthermore, Perl is fast. Perl isn’t strictly an interpreted language. When Perl reads a source file, it actually compiles the source into low-level opcodes and then executes them. You do not generally see compilation and execution in Perl as separate steps because they typically occur together: Perl launches, reads a source file, compiles it, runs it, and exits. This process is repeated each time a Perl script is executed, including each time a CGI script is executed. Because Perl is so efficient, however, this process occurs fast enough to handle requests for all but the most heavily trafficked web sites. Note that this is considerably less efficient on Windows systems than on Unix systems because of the additional overhead that creating a new process on Windows entails.

Alternatives to CGI

Several alternatives to CGI have appeared in recent years. They all build upon CGI’s legacy and provide their own approaches to the same underlying goal: responding to queries and presenting dynamic content via HTTP. Most of them also attempt to avoid the main drawback to CGI scripts: creating a separate process to execute the script every time it is requested. Others also try to make less of a distinction between HTML pages and code by moving code into HTML pages. We’ll discuss the theories behind this approach in Chapter 6. Here is a list of some of the major alternatives to CGI:

ASP

Active Server Pages, or ASP, was created by Microsoft for its web server, but it is now available for many servers. The ASP engine is integrated into the web server so it does not require an additional process. It allows programmers to mix code within HTML pages instead of writing separate programs. As we’ll see in Chapter 6, there are modules available that allow us to do similar things using CGI. ASP supports multiple languages; the most popular is Visual Basic, but JavaScript is also supported, and ActiveState offers a version of Perl that can be used on Windows with ASP. There is also a Perl module, Apache::ASP, that supports ASP with mod_perl.

PHP

PHP is a programming language that is similar to Perl, and its interpreter is embedded within the web server. PHP supports embedded code within HTML pages. PHP is supported by the Apache web server.

ColdFusion

Allaire’s ColdFusion creates more of a distinction than PHP between code pages and HTML pages. HTML pages can include additional tags that call ColdFusion functions. A number of standard functions are available with ColdFusion, and developers can create their own controls as extensions. ColdFusion was originally written for Windows, but versions for various Unix platforms are now available as well. The ColdFusion interpreter is integrated into the web server.

Java servlets

Java servlets were created by Sun. Servlets are similar to CGI scripts in that they are code that creates documents. However, servlets, because they use Java, must be compiled as classes before they are run, and servlets are dynamically loaded as classes by the web server when they are run. The interface is quite different than CGI. JavaServer Pages, or JSP, is another technology that allows developers to embed Java in web pages, much like ASP.

FastCGI

FastCGI maintains one or more instances of perl that it runs continuously along with an interface that allows dynamic requests to be passed from the web server to these instances. It avoids the biggest drawback to CGI, which is creating a new process for each request, while still remaining largely compatible with CGI. FastCGI is available for a variety of web servers. We’ll discuss FastCGI further in Chapter 17.

mod_perl

mod_perl is a module for the Apache web server that also avoids creating separate instances of perl for each CGI. Instead of maintaining a separate instance of perl like FastCGI, mod_perl embeds the perl interpreter inside the web server. This gives it a performance advantage and also gives Perl code written for mod_perl access to Apache’s internals. We’ll discuss mod_perl further in Chapter 17.

Despite a proliferation of these competing technologies, CGI continues to be the most popular method for delivering dynamic pages, and, despite what the marketing literature for some of its competitors may claim, CGI will not go away any time soon. Even if you do imagine that you may begin using other technologies down the road, learning CGI is a valuable investment. Because CGI is such a thin interface, learning CGI teaches you how web transactions works at a basic level, which can only further your understanding of other technologies built upon this same foundation. Additionally, CGI is universal. Many alternative technologies require that you install a particular combination of technologies in addition to your web server in order to use them. CGI is supported by virtually every web server “right out of the box” and will continue to be that way far into the future.

Web Server Configuration

Before you can run CGI programs on your server, certain parameters in the server configuration files must be modified. Throughout this book, we will use the Apache web server on a Unix platform in our examples. Apache is by far the most popular web server available, plus it’s open source and available for free. Apache is derived from the NCSA web server, so many configuration details for it are similar to those for other web servers that are also derived from the NCSA server, such as those sold by iPlanet (formerly Netscape).

We assume that you already have access to a working web server, so we won’t cover how to install and initially configure Apache. That lengthy discussion would be well beyond the scope of this book, and that information is already available in another fine book, Apache: The Definitive Guide, by Ben and Peter Laurie (O’Reilly & Associates, Inc.).

Apache is not always installed in the same place on all systems. Throughout this book, we will use the default installation path, which places everything beneath /usr/local/apache. Apache’s subdirectories are:

$ cd /usr/local/apache
$ ls -F
bin/  cgi-bin/  conf/  htdocs/  icons/  include/  libexec/  logs/  man/  proxy/

Depending on how Apache was configured during installation, you may not have some directories, such as libexec or proxy ; this is fine. With some popular Unix and Unix-compatible distributions that include Apache (e.g., some Linux distributions), the subdirectories above may be distributed across the system instead. For example, on RedHat Linux, the subdirectories are remapped, as shown in Table 1.1.

Table 1-1. Alternative Paths to Important Apache Directories

Default Installation Path

Alternative Path (RedHat Linux)

/usr/local/apache/cgi-bin

/home/httpd/cgi-bin

/usr/local/apache/htdocs

/home/httpd/html

/usr/local/apache/conf

/etc/httpd/conf

/usr/local/apache/logs

/var/log/httpd

If this is the case, you will need to translate our instructions to the paths on your system. If Apache is installed on your system, and its directories are not at either of these locations, then ask your system administrator or refer to your system documentation to locate them.

You configure Apache by modifying the configuration files found in the conf directory. These files contain directives that Apache reads when it starts. Older versions of Apache included three files: httpd.conf, srm.conf, and access.conf. However, using the latter two files was never required, and recent distributions of Apache include all of the directives in httpd.conf. This allows you to manage the full configuration in one location without bouncing between files. It also avoids situations where your configuration between files does not match, which can create security problems.

Many sites still use all three configuration files, if only because they have not bothered to combine them. Therefore, here and throughout the book, whenever we discuss Apache configuration, we will specify the alternative name of the file you need to edit if you are using all three files.

Finally, remember that Apache must be told to reread its configuration files whenever you make changes to them. You do not need to do a full server restart, although that also works. If your system has the apachectl command (part of the standard install), you can tell Apache to reread its configuration while it is running with this command:

$ apachectl graceful

This may require superuser (i.e., root) privileges.

Configuring CGI Scripts

Enabling CGI execution with Apache is very simple, although there is a good way to do it and a less good way to do it. Let’s start with the good way, which involves creating a special directory for our CGI scripts.

Configuring by directory

The ScriptAlias directive tells the web server to map a virtual path (the path in a URL) to a directory on the disk and execute any files it finds there as CGI scripts.

To enable CGI scripts for our web server, place this directive in httpd.conf :

ScriptAlias          /cgi        /usr/local/apache/cgi-bin

For example, if a user accesses the URL:

http://your_host.com/cgi/my_script.cgi

then the local program:

/usr/local/apache/cgi-bin/my_script.cgi

will be executed by the server. Note that the cgi path in the URL does not need to be the same as the name of the filesystem directory, cgi-bin . Whether you map the CGI directory to the virtual path called cgi, cgi-bin, or anything else for that matter, is strictly your own preference. You can also have multiple directories hold CGI scripts if you need that feature:

ScriptAlias          /cgi        /usr/local/apache/cgi-bin/
ScriptAlias          /cgi2       /usr/local/apache/alt-cgi-bin/

The directory that holds CGI scripts must be outside the server’s document root. In a standard Apache install, the document root maps to the htdocs directory. All files beneath this directory are browsable. By default, the cgi-bin directory is not beneath htdocs, so if we were to disable our ScriptAlias directive, for example, there would be no way to access the CGI scripts. There is a very good reason for this, and it is not simply to protect yourself from someone accidentally deleting the ScriptAlias directive.

Here is an example why you should not place your CGI script directory within the document root. Say you do decide that you want to have multiple directories for CGI scripts throughout your web site within the document root. You might decide that it would be nice to have a directory for each of your major applications. Say that you have an online widget store that you put in /usr/local/apache/htdocs/widgets and the CGI script directory at /usr/local/apache/htdocs/widgets/cgi. You then add the following directive:

ScriptAlias     /widgets-cgi   /usr/local/apache/htdocs/widgets/cgi

If you were to do this and test it, it would work fine. However, suppose that your company later expands to sell woozles in addition to widgets, so the store needs a more general name. You rename the widgets directory to store, update the ScriptAlias directive, update all related HTML links, and create a symbolic link from widgets to store in order to support those users who bookmarked the old name. Sounds like a good plan, right?

Unfortunately, that last step, the symbolic link, just created a large security hole. The problem is that it is now possible to access your CGI scripts via two different URLs. For example, you may have a CGI script called purchase.cgi that can be accessed either of these two ways:

http://localhost/store-cgi/purchase.cgi
http://localhost/widgets-cgi/purchase.cgi

The first URL will be handled by the ScriptAlias directive; the second will not. If users attempt to access the second URL, instead of being greeted by a web page, they will be greeted with the source code of your CGI script. If you’re lucky, someone will send you an email notifying you of the problem. If you’re not, a mischievous user may start poking around your scripts to find security holes to break into your system to get at more valuable information (like database passwords or credit card numbers).

Any symbolic link above a directory containing CGI scripts allows this security hole.[1] The scenario about renaming a directory and providing a link to its old name is simply one example of a situation when this may occur innocently. If you place your CGI scripts outside of your server’s document root, you never have to worry about someone accidentally exposing your scripts this way.

You may wonder why revealing your source code is such a problem. CGI scripts have certain characteristics that make them quite different than other forms of executables from a security standpoint. They allow remote, anonymous users to run programs on your system. Thus, security should always be an important consideration, and your code must be flawless if you are willing to allow potential attackers to review your source code. Although security through obscurity is not good protection in and of itself, it certainly doesn’t hurt when combined with other forms of security. We will discuss security in much greater detail in Chapter 8.

Configuring by extension

The alternative to configuring CGI scripts via a common directory is to distribute them throughout your document tree and have your web server recognize them by their filename extension, such as .cgi. This is a very bad idea, from the standpoint of both architecture and security.

From an architectural standpoint, you should not do this because having a common directory for all of your CGI scripts helps you manage them. As web sites grow, it may be difficult to keep track of all of the CGI scripts that your site uses. Placing them under a common directory makes them easier to find and promotes creating CGI scripts that are general solutions to multiple problems instead of handfuls of single-use scripts. You can then create subdirectories beneath the main /cgi directory to organize your scripts.

There are two reasons why configuring CGI scripts by extension is insecure. First, it allows anyone who has permissions to update HTML files to create CGI scripts. As we said, CGI scripts require particular security considerations, and you should not allow novice programmers to create scripts on production web servers. Second, it increases the likelihood that someone can view the source code to your CGI scripts. Many text editors create backup files while you are editing a file; some of them create these files in the same directory where you are working. For example, if you were editing a file called top_secret.cgi with emacs, it typically creates a backup file called top_secret.cgi~. If this second file makes it onto the production web server and someone with a lucky hunch attempts to request that file, the web server will not recognize the extension and will simply return the raw source code.

Of course, your text editor ideally should delete these files when you finish working on them, and you really should not be editing files directly on a production web server. But files like this do get left around sometimes, and they might make it to the production web server. Files also get renamed manually sometimes. A developer may wish to make changes to a file but save a backup of this file by making a copy and renaming it with a .bak extension. If a backup file were in a directory configured with ScriptAlias, then it is not displayed; it is treated like any other CGI script and executed, which is a much safer alternative.

So, if your web server happens to be configured to allow CGI scripts anywhere, here is how to fix it. The following line tells the web server to execute any file ending with a .cgi suffix:

AddHandler    cgi-script    .cgi

You can comment it out by preceding it with #, just like in Perl. Without this directive, Apache will treat .cgi files as unknown files and return them according to the default media type—typically plain text. So be sure that you move all of your CGI scripts outside the document root before you remove this directive.

You may also turn off the CGI execute permissions for particular directories by disabling the ExecCGI option. The line to enable it looks like this:

<Directory "/usr/local/apache/htdocs">
  .
  .
  Options Indexes FollowSymLinks ExecCGI
  .
  .
</Directory>

There are probably many other lines above and below the Options directive, and the Options directive on your system may differ. If you remove ExecCGI, then even with the CGI handler directive enabled above, Apache will not execute CGI scripts in the location that this Options directive applies—in this case, the document root, /usr/local/apache/htdocs. Users will instead get an error page telling them “Permission Denied.”

Now that we have our web server set up, and we have gotten a chance to see what CGI can do, we can investigate CGI in more detail. We start the next chapter by reviewing HTTP, the language of the Web and the foundation of CGI.



[1] It is possible to configure Apache to not follow symbolic links, which provides an alternative solution. However, symbolic links in general can be quite useful, and they are enabled by default. The problem in this situation is not with the symbolic link; it is with having the CGI scripts in a browsable location.

Get CGI Programming with Perl, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.