Chapter 8

Six Common CGI Mistakes and How to Avoid Them


CONTENTS


Chapter 7, "Extending HTML's Capabilities with CGI," showed you how to use CGI to increase the effectiveness of a site and gave a detailed example of how to write a CGI script. The rest of this book shows how to use CGI scripts to accomplish tasks such as setting up a chat area, providing a bulletin board, or operating an online "store." But none of these useful functions is possible if CGI isn't working.

The preceding chapter also pointed out some security risks associated with CGI. Because of these risks, some service providers have flatly refused to allow users to put their own CGI scripts on the server. But things are changing. Many Internet Service Providers (ISPs) have decided to allow CGI scripting in order to stay competitive. Some are using CGIWrap. Others are hand-checking scripts before allowing them on and others provide cgi-bin directories for each user to increase accountability and maintain some semblance of control.

When Things Go Wrong

The net result of ISPs allowing CGI on their machines is that many ISPs who have heretofore had only a passing familiarity with CGI (and know only that "it's dangerous") are now enabling directories for CGI and helping programmers get their scripts set up. When problems occur, a round of fingerpointing starts during which the programmer and the service provider blame each other for the fact that the script isn't working.

As more and more ISPs accommodate CGI scripts on their servers, the number of frustrated CGI installers increases. One script archive with thoroughly debugged, well-documented scripts and a good Frequently Asked Questions list (FAQ) still gets over 300 messages a day, most of them complaining, "I can't get your script to run."

Configuration Errors

This section describes what happens when the server is misconfigured. The next section describes how scripts fail. The final section describes the symptoms for each kind of failure and gives a fault-isolation procedure, which identifies the problem and shows how to fix it.

A Script in the Wrong Directory

When the server sees a GET request, it has no idea whether the entity requested is supposed to be a static file or a program. Suppose that an installer puts a Perl script somewhere in the tree of directories rooted at the server's root. When the server finds the file, it recognizes the file as a text file and serves it up, as shown in Figure 8.1.

Figure 8.1: A Perl script "called" from a document directory.

The solution is to move the script from the document directory to the CGI directory. As the Webmaster, find out from the service provider the path to the CGI directory. Often it's called cgi-bin. On some machines there are cgi-bin directories set up for each virtual host.

Another way to locate the cgi-bin directory is to look in the server's srm.conf configuration file. Unless the Webmaster also happens to be the server maintainer, he or she won't be able to write to this file, but he or she can probably read it. The configuration files are located in different places depending upon the type of server and choice of the installer. On the NCSA server and its cousin, Apache, start at usr/local/etc/httpd/conf. Remember, don't change anything in these files. If your service provider has given you write-access to them, it was probably by mistake. These files are the heart and soul of the server. Once installed, they should be changed by authorized maintainers only.

Once you are in the conf directory, enter the following line from the UNIX command prompt:

grep -i cgi *.conf

This line looks for all occurrences of the word cgi in the configuration files. The -i switch makes the search case-independent, so both CGI and cgi match. Ignore any files that end in conf-dist. Those are from the original distribution set and are not used at runtime. Here's a sample of what you might see:

access.conf:<Directory /usr/local/etc/httpd/cgi-bin>
srm.conf:ScriptAlias /cgi-bin/ /usr/local/etc/httpd/cgi-bin/
srm:conf:AddType application/x-httpd-cgi .cgi

The ScriptAlias directive in srm.conf tells you that files that are placed in /usr/local/etc/httpd/cgi-bin/ will appear at the URL: /cgi-bin/ on your server. If you look in that directory, you might find a program called test-cgi. If so, go to the browser and go to URL: /cgi-bin/test-cgi on your server. You see the list of environment variables output by test-cgi.

If there is more than one cgi-bin directory, you should be able to recognize one of them: The directory might have your server name or user ID in the path. Your service provider may have also set up a cgi directory for you inside the master cgi-bin directory. Change to the cgi-bin directory (for example, /usr/loca//etc/httpd/cgi-bin/) and look at the contents. You might find a symbolic link xyz pointing to, say, /users/pages/xyz/cgi-bin. Make sure the directory is writeable by you. This directory is the place to put your CGI scripts.

Tip
In UNIX, you can make a tiny file that "points to" the real file. These pointers are called "symbolic links," "soft links," or "symlinks." To make a symbolic link in the current directory named myFile to a file in another directory, type:
ln -s /home/smith/aFile myFile
You can spot a symbolic link by doing an ls -l on the directory. Symbolic links have an 'l' in the initial position, and show the aliasing in the last field, like this:
lrwxrwxrws 1 root system 29 Mar 19 19:40 wdb ->
/home/mikem/wdb/wdb1.3a2/html

Directory Is Not Enabled for CGI

If you don't see a <ScriptAlias ...> directive that mentions CGI, it is possible your service provider is using access.conf or .htaccess to control where CGI scripts run. Check the srm.conf for the following directive:

AddType application/x-httpd-cgi .cgi

This directive tells the server that if a request is made that ends in .cgi, interpret that request as a request to run the program. The request will be honored if the requested program is in a directory that has been enabled for CGI. It's possible that the file extension may not be .cgi. It's also possible that the service provider has added other file extensions, like .pl (for Perl scripts) and .sh (for shell scripts). In any case, this directive says that in order to run, the file name of the script must end in the prescribed file extension.

Look in access.conf for a <Directory ...> directive that mentions a directory at or above the root of your directory. For example, if your document directory is /users/pages/xyz/, the following directive includes you:

<Directory /users/pages>
.
.
 
</Directory>

Somewhere between the opening <Directory...> and the closing </Directory> find the directive Options and make sure it includes ExecCGI. It might say either ExecCGI or Options All. In either case, your document directory has been enabled for CGI.

If you can't find your directory covered by access.conf, look in your home directory for a file named .htaccess. Note that the leading dot makes the file "invisible." Use the following to see the file:

ls -a .htaccess

Tip
Most operating systems provide a way to make a file invisible or "hidden." In UNIX, a file is hidden if the first character in its name is a period. To see hidden files, request a directory listing with the -a option: ls -a. Hidden files are always visible to the root user.

See Chapter 17, "How to Keep Portions of the Site Private," for a full discussion of the .htaccess file. If you have a .htaccess file, check it to see if you have the Option All or Option ExecCGI directive. If you do, then the .htaccess file is the place to put your scripts.

If you don't see the AddType directive in srm.conf, check again in your .htaccess. A server can use .htaccess to tell the server to only treat the .cgi extension as "magic" if the requested document is in the right directory.

Server configuration allows server maintainers to be very precise in expressing their wishes. With this power comes the ability to make a mistake and cause CGI scripts to fail.

See http://hoohoo.ncsa.uiuc.edu/docs/tutorials/cgi.html for a full discussion on configuring the NCSA server for CGI.

Look in access.conf for a <Directory ...> directive that mentions a directory at or above your cgi-bin directory. For example, if your cgi-bin directory is /usr/local/etc/httpd/cgi-bin/xyz/, the following directive includes you:

<Directory /usr/local/etc/httpd/cgi-bin>
.
.
.
 </Directory>

Somewhere between the opening <Directory... and the closing </Directory> find the directive Options and make sure it includes ExecCGI. It may say either ExecCGI or it may say Options All. In either case, your cgi-bin directory has been enabled for CGI.

Understanding Script Errors

Once you find your cgi-bin directory, run a script from the browser to make sure everything is working. As mentioned earlier, test-cgi is commonly available and thoroughly debugged. It's a good first start.

If test-cgi works but your scripts don't, it's time to roll up your sleeves and find out what's wrong.

Script Not Executable

If most of your experience has been on desktop computers like Macintoshes and PCs, you probably don't think about file permissions. After all, when you click a program in Windows, it runs. If you don't want someone running your programs, you don't let them on your machine.

UNIX File Permissions

UNIX is different. When UNIX was developed, no one had ever heard of a "personal computer." Computers were big and expensive and you had to share them with lots of other people and you wished you had a computer of your own. To meet this need to share the computers resources, the UNIX designers set up the file system to give three different kinds of access to three different groups of people-nine levels of security in all. (In most newer UNIXs, there are more sophisticated mechanisms for access control, but they aren't relevant to most Web sites.)

The three levels of access are read, write, and execute. The three groups of people are the owner, the group, and others.

Take a look at a typical UNIX file. Enter

ls -l /etc/passwd

Despite its name, this is usually not where the encrypted passwords are stored. This name has historical significance only. A typical response to the above command is

-rw-rw-r--_1_root_security_389_Feb 16 16:25_/etc/passwd

The fields that control security are right up front. The first field (a dash) says that /etc/passwd is an ordinary file and not a directory, device or something else. The next three positions describe the permissions of the owner. The owner of /etc/passwd is root, the system superuser. The owner, in this case, has permission to read and write the file. The third position determines whether the owner can execute the file as a program. /etc/passwd isn't a program, so the execute bit is turned off.

The next three permission bits apply to the file's group. In this case, the group is "security," and members of that group can read or write but not execute. The third set of permissions is "others," sometimes called "the world." /etc/passwd is said to be world-readable, because anyone can read it, although not everyone can write or execute it.

Another way to read these permission bits is as three octal (base-8) numbers. In a given set of three bits, the one on the right has a value of one, the one in the middle has a value of two and the one on the left has a value of four. If all three bits are on, the number is 4+2+1=7. Seven is the highest number expressible in a single digit in base-8, just like nine is the highest number able to be expressed in a single digit in base-10 (the decimal system).

Now read those permission bits on /etc/passwd again. The first set is 4+2+0=6. The second set is the same. The third set is 4+0+0=4. So a UNIX expert will say that /etc/passwd has permission 664.

Making UNIX Files Executable

Now issue the following commands:

cd    # to return to your home directory
touch foo.cgi  # to make an empty file named foo.cgi
ls -l foo.cgi

This last command shows the default permissions that you have. They are controlled by your umask, which was set up by the system administrator when your account was established. A typical value of the umask is 133. If your account is set up with a umask of 133, the ls command returns:

-rw-r--r-- .....      foo.cgi 

Note
The UNIX umask gets its name from the fact that it inhibits or "masks" out permission bits that should be off by default. If you enter
umask 000
then no bits are inhibited, and all files created in the future will have permission 777. If you enter
umask 777
you get the opposite effect. All bits are inhibited, and the new default permission is 000. A typical value is
umask 026
which gives default permission bits of 751-the owner can do anything, the group can read and execute, and the rest of the world can just execute.
The system administrator will usually put a umask command in one of the files, like /etc/profile, that all users execute when they log in. You can set your own umask in your own .profile file (or in .cshrc if you use the C shell).

Now try to execute the script. Yes, it's an empty file, but that doesn't matter for now. Type

./foo.cgi

You will get a message that says Execute permission denied. That's not surprising. You saw earlier that the owner's execute permission bit was off. Now type

chmod +x foo.cgi
ls -l foo.cgi

The chmod command tells UNIX to set the execute bits.

Tip
The general syntax for chmod is
chmod new-mode file(s)
where new-mode may be expressed in "who, what, permissions" format. For example, to add execute access to the group permissions use g+x as the new mode. The full list is given in Table 8.1.

Table 8.1  chmod Symbolic Modes

Category
Character
Meaning
Who
u
User (owner) of the file
 
g
Group
 
o
Others
 
a
All
What
-
Remove this permission
 
+
Add this permission
 
=
Set this permission exactly
Permissions
r
Read access
 
w
Write access
 
x
Execute access

There are several other permission bits which can be set with chmod, but the ones in Table 8.1 are those most commonly used on Web sites. See the man page for chmod for more details.

You can also combine symbolic mode entries, like this:

chmod a+x,g+r files

Many experienced users find it faster to specify the permission bits in octal notation. Such a user might type

chmod 751 files

to set the permission on a file to

rwxr-x--x.
-rwxr-xr-x ...............    foo.cgi

Now the file is executable to everyone (owner, group, and others). Just for fun, execute it:

./foo.cgi

Nothing happens (how much did you expect an empty file to do?), but there's no error message. The file is now executable.

Sometimes you will see file permissions given in instructions as octal numbers. Type

chmod 644 foo.cgi
ls -l foo.cgi

and see that the file permissions go back to 644 (owner = read (4) + write (2), group and others = read (4) only).

Now type

ps -ef : grep httpd

On UNIX systems derived from the Berkeley distribution, you will need to type ps -aux | grep httpd. The ps part of this command says to list all the running processes on the machine. The output of the ps -ef command can go on for several pages. The grep httpd part says to show only those lines that mention httpd (the name used for NCSA servers and their kin). On most machines, you'll see a half-dozen or more lines that look like this:

root_11092_1_0__17:06:17_-_0:01_/usr/local/etc/apache/src/httpd
nobody_12444_11092_0_17:09:54_-_0:00_/usr/local/etc/apache/src/httpd
nobody_14496_11092_0_17:09:54_-_0:00_/usr/local/etc/apache/src/httpd
nobody_15518_11092_0_17:09:54_-_0:00_/usr/local/etc/apache/src/httpd
nobody_16040_11092_0_17:09:54_-_0:00_/usr/local/etc/apache/src/httpd

These lines say that there are five copies of the server running. The first one (process ID 11092) was started by user root at 17:06:17. That copy started four others (its process ID appears in the Parent Process ID, which is the third column). If you had to use ps -aux the columns will be a bit different but in either case the column we're interested in is the first one. It says that the servers are running under the authority of user nobody. Not surprisingly, user nobody has almost no authority in the system. (Remember that these servers are going to be run by thousands of complete strangers. How much authority do you give a stranger?)

Note
A few service providers do not give users permission to Telnet into their account. If you are among those unfortunate few, you won't be able to run the exercise described in this section. But the file permissions discussion is still relevant to you. Make sure you are using a version of FTP that allows you to set the permission bits. Then, when you transfer the files into your cgi-bin directory, set the permissions to world-readable and world-executable (755), just as we described above.

To execute a CGI script, the server (running with the authority of nobody) must be able to execute it. With all of this background, you're ready to do just that. Change the directory to your cgi-bin directory. For example, type

cd /usr/local/etc/httpd/cgi-bin/xyz

and look at the permissions of one of your scripts:

ls -l myScript.cgi

If it is not world-executable, change it with:

chmod +x myScript.cgi

or, if you prefer

chmod 755 myScript.cgi

Verify that the script is world-readable and world-executable. If it's not, the server will tell you that you don't have permission to execute that script when you try to access it.

Although it is less frequently a problem, note that the directory that contains the scripts must also be world-readable and world-executable. To see the permissions on a directory, change the current directory to that directory (using the cd command) and type:

ls -ld .

If "others" bits on the permissions are not r-x, change them. If you do not have the authority to change them, contact your system administrator.

Tip
It is sometimes useful to be able to change the permissions of all, or nearly all of the files in a directory tree. To change all of the files, use the chmod -R option (where -R stands for "recursive." To change most of the files, build a set of tests for the find command, and use
find . tests -exec chmod new-mode {} \;

The Script Won't Run

Other scripts run in your cgi-bin directory. Your script is world-readable and world-executable. But still when you run the script from the browser you get an error. Typically, the error informs you about a malformed header or returns the Internal Server Error error message.

Your server is not broken, but your script probably is. To obey HTTP, the first thing the script should send is "Content-type: text/html" followed by an empty line. In Perl, this is done like this:

print "Content-type: text/html\n\n";

To troubleshoot this problem, Telnet in to your account and run the script from the command line. In most cases you'll see a syntax error from Perl. Fix the Perl problem. Once the script runs, try it again from the browser. If it runs successfully from the command line but not from the browser, there is something wrong with the program logic; it is not sending the content-type line. Later on this chapter the section "Checking by Hand" describes how to set up the environment variables and completely mimic the actions of your browser.

Remember to check the error log of the server. If the script runs but produces an error, that error is written to the file handle STDERR. The server redirects that output to the error log. You can find your error log by examining the configuration files or by asking the system administrator.

Tip
If the server has been configured with the default directories, the error log is at /usr/local/etc/httpd/logs/error-log.

The Script Can't Find Perl

Here's a mistake that's easy to make and tough to spot. To understand this problem we need to understand the first line of a Perl script.

When you say to UNIX

./foo.cgi

you are saying, "Look for the file foo.cgi in my current directory, and execute it." If foo.cgi is a compiled binary file, it is loaded into memory and run. If the file is a shell script, it is turned over to the current shell (a command interpreter) and run. But if it's a Perl script, UNIX has no way of knowing. It passes the file to the shell, which quickly responds that it can't make sense of these commands.

The solution comes from an arcane bit of UNIX lore. For a whimsical description of the story, see article 47.02 in UNIX Power Tools by Peek, O'Reilly and Loukides. For a more serious look, see the man page for execve(2). If you start the very first line of a text file with #!, most popular versions of UNIX will look on that line for the name of a program to run and, optionally, a string to pass to that program. To set a file to be run by Perl, type the following:

which perl

Expect a response like

/usr/bin/perl

or possibly

/usr/local/bin/perl

In fact, enter

ls -l /usr/bin/perl

to see how Perl has been installed. Don't be surprised if it is a symbolic link to /usr/local/bin/perl.

Now you know where Perl has been installed. On the very first line of your Perl script, starting with the very first character, type #! followed by the path to Perl. If the Perl installer took the defaults during installation, this line will be:

#!/usr/local/bin/perl

Be sure to type the line exactly as described. This line is read directly by the UNIX kernel, which is a most unforgiving reader.

A sure sign that the kernel is having a problem finding the Perl interpreter is when you run the program from the command line and it responds "not found." You can see the file in an ls listing, so you know it's there. You have specified ./myScript.cgi, so you know it's not a path problem. Look at the first line. The kernel is telling you that it tried to exec the interpreter you named on that line, but that interpreter wasn't where you said it was.

Lines Are Terminated Incorrectly

Here's a tricky little problem that can become troublesome. Many users produce CGI scripts on their desktop machine (a Mac or PC), then use FTP to send the file to their server. Sometimes this process will work fine for weeks and then one day a script is transferred up to the server and fails in bizarre ways.

To understand why this problem occurs, it's necessary to understand how various operating systems terminate lines in a text file. In UNIX, the end-of-line character is a new line, also known as a linefeed. On a Mac, the end-of-line character is a carriage return. Under DOS and Windows, the end-of-line is denoted by a carriage return and a linefeed.

The FTP program supports several types of transfer. The two most common are ASCII (sometimes called text) and binary (also called image). In ASCII transfers, each line of the text is converted to a standard representation called NVT ASCII. NVT ASCII ends each line with a carriage return/linefeed. So if you send from a Macintosh to a UNIX machine, the sending FTP converts from the Mac standard to NVT ASCII and sends. The receiving machine reads the NVT ASCII and saves the file using the UNIX convention, linefeeds only. Similarly, if you send from a PC, the file is sent as NVT ASCII, and the UNIX box converts to its native format.

ASCII transfer is the default, however, what if the Webmaster inadvertently sets the transfer type to binary? (On some versions of FTP, the program attempts to "discover" whether the file is text or binary and may guess wrong.) In binary mode, no conversions are made, so the lines end up on the UNIX machine just like they started on the desktop machine. The most immediate symptom will be that the file will "look funny" in most editors. It may appear to have blank lines between the text lines, or all of the text may be on one long line. The most serious symptom is that the program will fail to execute.

To check the end-of-line characters on a file named foo.cgi on the UNIX machine, type the following:

od -c foo.cgi | more

The first part of this command invokes a dump program named od and asks for the file to be interpreted as characters. The first few lines of typical output looks like this:

0000000    #   !   /   u   s   r   /   l   o   c   a   l   /   b   i   n
0000020    /   p   e   r   l  \n  \n   #       n   a   m   e       o   f
0000040        f   i   l   e       w   h   i   c   h       c   o   n   t
0000060    a   i   n   s       t   h   e       o   r   d   e   r   e   d
0000100        l   i   s   t       o   f       p   a   g   e   s  \n   $
0000120    t   h   e   L   i   s   t   F   i   l   e       =       "   .
0000140    /   t   h   e   L   i   s   t   "   ;  \n  \n   #       n   a
0000160    m   e       o   f       s   t   r   i   n   g       w   h   I
0000200    c   h       n   a   m   e   s       t   h   e       P   r   e

Look closely at the characters at the end of each line. If the file is set up correctly for UNIX, they should be \n, which means newline in UNIX. If the lines are terminated with \r\n or just \r, the file won't run correctly.

The solution, of course, is to retransmit the file this time making sure that FTP is set to ASCII transfer. A workaround is to use the UNIX command tr to translate the characters to their correct format.

If the file comes from a Macintosh (each line ends in a return) type the following:

tr "\r" "\n" < foo.cgi > out.cgi
mv out.cgi foo.cgi

The tr command translates the characters in the first string (a return) to the characters in the second string (a newline). The tr command reads from standard input and writes to standard output. Be careful not to name the output file the same as the input file or the file will be emptied. The second line moves the file from the temporary name we gave it back to its original name.

If the file comes from an MS-DOS computer then each line will end with a carriage return followed by a newline (\r\n). Because UNIX wants the newline, all you need to do is delete the return:

tr -d "\r" < foo.cgi > out.cgi
mv out.cgi foo.cgi

An Explanation of the Error Codes

The error codes and messages that the server returns can be useful in identifying the cause of a problem. Experienced Webmasters learn to associate common error codes with certain problems. Remember that the error message for a given code may vary somewhat from server to server, based on the configuration set up by the local administrator.

What To Do About 400-Series Errors

Recall that the 400 series of errors mean that the server thinks the client has made a mistake.

401 Unauthorized

This message means that the file is protected (typically by a .htaccess file) and the user did not send the proper authorization. Most browsers interpret a 401 and display a dialog box prompting the user asking for a username and password.

403 Forbidden

The most likely explanation for this error is that the file or directory permissions do not allow read- or execute-privileges by the server. If the server is running as a non-privileged user like nobody, the CGI files and directories must be set to world-readable and world-executable.

Another explanation is that the system administrator has not configured this directory for CGI. Check the earlier process for how to confirm that the server is properly con-figured.

404 Not Found

This message means what it sounds like. Either the script is not where you thought it was, or when the script ran it tried to access another file and it wasn't where you thought it was. If you're sure you're getting to the script, put

print "Content-type: text/html\n\n" ;

near the top of the script, load the script up with print statements so you can see how far its getting and find the reference to the URL that isn't there.

What To Do About 500-Series Errors

The 500-series of error codes means that the server thinks that it has made a mistake. The real culprit is almost always a script error.

500 Malformed Header

An error 500 means that the header did not start with the "Content-type" line required by HTTP. Here are some things to check:

If the error log or message mentions execve, it is almost certain that the kernel cannot find Perl. Check the first line again.

Tip
When debugging, if the script runs from the command line but fails when run from the browser, the problem is most likely in your environment variables (or in STDIN, if you are using POST). The script assumes something about the environment that isn't true, and it throws an error.
If this happens, temporarily switch the ACTION in your script to test-cgi and rerun the script. test-cgi will report all the environment variables. Now use the output of test-cgi to compare the actual values of the environment variables with the assumptions made by the code.

An enhanced test-cgi is available from Chris Schanzle at http://speckle.ncsl. nist.gov/~chris/test-cgi. This version dumps STDIN if the method is POST.

501 Cannot POST to Non-script Area

The most frequent culprit in this instance is that the directory is not enabled for CGI (or that the script is in the wrong directory). If you try to GET a script in such a directory, you will get the source. With POST the server knows you are trying to run the script but it has no permission to run programs in that directory.

Remember that there are two ways for the system administrator to enable CGI. If your administrator has chosen to use the ExecCGI option (with the "magic" CGI type) your file names must conform to that naming convention. Usually the required extension is .cgi. If your script is named foo.pl, try renaming it to foo.cgi.

Checking by Hand

When a script fails to execute properly from the server, it is often necessary to "run it by hand," taking control from the browser (and sometimes from the server) in order to see the results of each step. This section shows three ways of doing this.

From the Command Line

When troubleshooting CGI scripts, experienced developers often tell neophytes to "run it from the command line." In saying this, the experienced developers mean they should use Telnet to log into their account on the server, change to their cgi-bin directory and type the name of the program. If their PATH variable is not set up to look for scripts in the current directory, they will need to preface the script name with "./" to tell the shell where the program is.

If the first line is set up to point to the Perl interpreter, Perl takes control and checks the syntax of the file. Because Perl checks the program at startup time, many kinds of errors are avoided at runtime (when the developer is not around, and the site visitor is alone with the script).

If Perl finds an error, it stops and prints the error. Sometimes one error will cause a cascade of others, so most programmers check the first error or two, then rerun the program.

Once the program is running, simple invocation becomes less useful. Most scripts ask early on:

if ($ENV{REQUEST_METHOD} eq "POST"} or
if ($ENV{REQUEST_METHOD} eq "GET"} ....

Because simple invocation from the command line does not set any environment variables, the script will fail. Depending on the program, it may just exit, crash, or politely respond with an HTML message that it was not started by the preferred method. As Chapter 7, "Extending HTML's Capabilities with CGI," showed (with formmail), it is possible to set up a script so that it handles either GET or POST requests.

To set environment variables, you have to know which shell you are running. If your prompt is a dollar sign, you are running the Bourne shell, the Korn shell, or possibly BASH. They all use the same command to set environment variables. To set environment variables in any of those shells, type:

export REQUEST_METHOD=GET

Be sure to type the string just as it is shown here. Putting spaces around the equals sign will cause an error.

If your prompt is a percent sign, you are running the C shell. To set environment variables in the C shell, type:

setenv REQUEST_METHOD GET

Look over the script and see what environment variables it requires. For GET, it almost certainly needs REQUEST_METHOD, because most well-written scripts check to see if the user is calling it by GET or POST; and QUERY_STRING, because that is how the information gets to the script. Remember to encode QUERY_STRING. If you don't need escaped characters, you can say something like this:

export QUERY_STRING=name=John+T.+Smith&address=1234+Jones+Street.

To see what your page is sending, look in the URL field at the top of the page after you have attempted to access the script.

If your script expects to be run by POST, set it up this way:

export REQUEST_METHOD=POST
export CONTENT_LENGTH=1024
echo "name=John+T.+Smith&address=1234+Jones+Street" | myScript.cgi

Don't worry about making CONTENT_LENGTH the exact number of characters in STDIN. Just make it large enough to handle all the characters you send it. In the same way, don't worry about sending in all the fields from a form. Send in enough to check the basic processing. If you do decide to put in all the data, save it to a file so you can save time by typing:

export REQUEST_METHOD=POST
export CONTENT_LENGTH=1024
myScript.cgi < myData

Note that you don't have to keep reentering the environment variables. Once set, they stay set until you leave that shell. If you are working in your login shell, they stay around until you explicitly change them or until you log out.

In this way, the basic environment of the script is set up and you can watch it run. Put print statements in the script to make sure it's following the path you think it is. Check the results from calls to functions to make sure they are succeeding as you expect. (It's not a bad idea to leave some of those checks in the scripts to handle the response ex-plicitly.)

You can also check the scripts in the browser by printing a "Content-type" line early on. You may want to set up a standard set of HTML-related subroutines, like these from this file named html.cgi.

# ===============================================================
# This subroutine takes a single input parameter and uses it as
# the <TITLE> and the first-level header.
# ===============================================================
sub html_header
{
  $document_title =$_[0];
  print "Content-type: text/html\n\n";
  print "<HTML>\n";
  print "<HEAD>\n";
  print "<TITLE>$document_title</TITLE>\n";
  print "</HEAD>\n";
  print "<BODY bgcolor=\"#CCCC99\" TEXT=\"#000000\" 
LINK=\"#DD0000\" VLINK=\"#009966\">\n";
  print "<H2>$document_title</H2>\n";
  print "<P>\n";
}
sub html_trailer
{
  print "</BODY>\n";
  print "<HTML>\n";
}
sub die
{
  print "Content-type: text/html\n\n";
  print "<HTML>\n";
  print "<HEAD>\n";
  print "<TITLE>Error</TITLE>\n";
  print "</HEAD>\n";
  print "<BODY bgcolor=\"#CCCC99\" TEXT=\"#000000\" 
LINK=\"#DD0000\" VLINK=\"#009966\">\n";
  print "<H1>An Error has occured</H1>\n";
  print "<P>\n";
  print @_;
  print "\n";
  print "</BODY>\n";
  print "</HTML>\n";
};

Now to quickly get a script to print, put the following lines near the top:

require "html.cgi";
&html_header("Test");
and at a point just above where the script exits, add
&html_trailer;

By Telnet

For more complex problems, consider running the script from Telnet and bypassing the browser. Suppose your server is called www.xyz.com and the server is set up to expect messages on port 80. To troubleshoot the script at /cgi-bin/foo.cgi with a query string of "This is my query", type

telnet www.xyz.com 80

Wait for the server to respond, then type

GET /cgi-bin/foo.cgi?This+is+my+query HTTP/1.0

The server runs the script, sends back the results, then closes the connection. This method has the advantage of showing the headers coming back.

To exercise a script with POST, type

telnet www.xyz.com 80

Wait for the server to respond, then type

POST /cgi-bin/foo.cgi HTTP/1.0
Content-type: text/plain
Content-length: 45
name=John+T.+Smith&address=1234+Jones+Street

The server runs the foo.cgi script, sends back the result, and closes the connection. If you failed to send a Content-type line as the first line back from the script, the server will throw an error 500. The first line in the server's response shows the error code. For example:

HTTP/1.0 500 Server error
Date: Mon, 12 Feb 1996 03:22:14 GMT
Server: Apache/1.0.2
Content-type: text/html
<HEAD><TITLE>Server Error</TITLE></HEAD>
<BODY><H1>Server Error</H1>
The server encountered an internal error or 
misconfiguration and was unable to complete
your request.<P>
Please contact the server administrator,
morganm@dse.com and inform them of the time the error occurred,
and anything you might have done that may have
caused the error.<P>
</BODY>
Connection closed.

With CGItap

Running scripts from the command line or from Telnet can give insight but can be time-consuming. Various tools are emerging that simplify the process. Once such utility is CGItap, available from ScendTek Internet Corporation at http://scendtek.com/cgitap/. CGItap is a small Perl script that can run on any machine. It intercepts the dialog between the client and the server and reports the following sections:

The CGI Script Output is the raw output. Although the HTTP headers are stripped off, the remaining information will show the Content-type line if it is present.

Knowing how to run a script from the command line and from Telnet is essential for a Webmaster. For day-to-day work, a program like CGItap can be invaluable.

Avoiding the Pitfalls

It is better to avoid the problems we've discussed in this chapter than to allow them to occur and then detect them. Here's a process that helps avoid most of the errors discussed in this chapter.

Setting Up the Development Machine

Configure the development machine to be as close to the live server as possible. Use the same domain names, the same configuration files and the same directory structures.

Working from the Command Line

Start building CGI scripts from templates and libraries. This book emphasizes understanding the underlying mechanisms. Once you understand them, move on to an environment that does not require retyping code and permits reuse of existing designs.

During development, work from the command line. Develop shell scripts that exercise the code. Don't work too much on getting the output HTML right until the program logic is correct.

Limiting Complexity

In 1976 Tom McCabe published a paper entitled "A Complexity Measure" in IEEE Transactions on Software Engineering (SE-2, No. 4, pp. 308-320) arguing that a program's complexity, as measured by its control flow, is a major factor in determining the quality of the program. The lower the complexity, the better, because developers can grasp the program and see their mistakes.

McCabe's Complexity Measure works like this:

  1. Measure the complexity of each subroutine separately.
  2. Start with 1 for the straight line path through the routine.
  3. Add 1 for each occurrence of if, while, for, and, and or.
  4. Add 1 for each case in a case statement. If the case statement doesn't have a default case, add 1 more.

If the routine scores five or below, it's probably simple enough. If it scores between six and ten, think about ways to simplify it. If its complexity metric is above ten, consider rewriting it. It's almost doomed to be buggy, and it will probably cost less to rewrite it than to fix it.

Tip
Numerous software utilities are available to compute metrics like McCabe's Complexity Measure. Check out http://www.swbs.idirect.com/, which describes C-DOC from Software Blacksmiths, Inc.

Performing Regression Testing on the Subroutines

Develop a set of regression tests for each routine that exercises each independent path of the program. Set up "scaffolding" scripts to test each routine separately. Put each such regression test in a shell script. Once you are satisfied with a certain level of performance, save the results in a "golden" file. From then on, always compare the output of the script with the output of the "golden" test run. For example, Listing 8.1 shows a high-level scaffolding file called test01.sh.

Note
The term "scaffolding" comes from the building construction industry. Scaffolding is used during the construction process to allow workers to reach parts of the building that would otherwise be inaccessible. Software "scaffolding" can be built as low-level routines to temporarily substitute for the real routine so that overall logic and design can be tested. Such low-level test routines are called stubs. High-level scaffolding is used to call low-level routines, in order to exercise them under controlled conditions while watching their inputs and outputs. High-level scaffolding is also known as a driver or sometimes as a test harness.
In regression testing, a golden unit is one which has been checked by hand is known to be correct. Future versions of the software are likely to be correct if they produce the same output as the golden unit (and are known to be incorrect if they produce different output).


Listing 8.1  test01.sh-A Driver That Takes the Place of the Client and Server and Runs a CGI Script Directly

#!/bin/ksh
export REQUEST_METHOD=POST
export CONTENT_LENGTH=1024
/usr/local/etc/httpd/cgi-bin/xyz/myScript.cgi < 
test01.dat > test01.results
diff test01.golden test01.results

This script sets up and runs the script myScript.cgi in the xyz project directory. It uses POST to read its input from the data file test01.dat and writes its results to test01.results. Then it compares the results of this run with the results of the "golden run" and shows any differences.

There should be a test for each path through the code. For example, every if statement generates two paths: one if the condition is true and one if it is false. Test at and near limits. If something special happens when a variable is exactly one, test with the variable set to zero, one and two.

To keep from having to test a huge number of cases, test each subroutine separately. Suppose the program runs in three steps:

  1. Validate input
  2. Process input
  3. Format data

Furthermore, suppose that the each module has a complexity of five. If you test a subroutine one at a time, this program can be tested with fifteen tests (or maybe a few more to cover special cases and limits). Tested as a whole, this program might have a complexity of 125, and might need between 130 and 150 tests to determine if it's still functioning correctly.

Putting Everything Under Configuration Control

Once a module is working, check it into the configuration control system, along with the test routines, golden files, and test inputs. Put a README in the file to document the versions of any binary files like Perl. Make a rule that whenever a module is checked out, it is not checked back in on the main path without passing all regression tests. (Checking it back in on a branch is okay under certain circumstances.)

Once all the subroutines are working, integrate the whole program. Build regression tests for it, and put the subroutines all under configuration control.

Many of the projects described in this book require more than one script. Build regression tests for the whole system and put all of the scripts and their tests under configuration control.

Testing All Software Three Ways

The regression testing described in the preceding section is functional testing. Its purpose is to make sure that the software works the way it's supposed to. Another kind of testing is stress testing-throwing input at things that the software was never explicitly specified to handle. Here are some ideas for stress testing:

Keep written records of the tests so that if there's ever a question about what the system was able to do on a given date, you can document the tests from the archives.

Third, engage in load testing. Set up your test server with two or three times the number of servers you allow and set them all to exercise the new software. Set up all the regression tests to run in a continuous loop. If the software has any common files it must read or write, load testing will shake out concurrency issues. Watch the system performance during load testing. Use UNIX tools, like vmstat, to see where the time is going. If your UNIX is derived from System V, use sar to examine the same topics. Look for hot spots in the code and think of ways to optimize them.

Note
When a program runs, the available time (sometimes known as wall time since it is measured by the clock on the wall, as opposed to CPU time) only goes five places:
  • Active CPU cycles
  • Waiting for the CPU
  • Waiting for disk I/O (other than paging or swapping) to complete
  • Waiting for other I/O to complete
  • Waiting for paging or swapping to complete
Knowing where the time is going is the first step to speeding up a program. If most of the time is spent in the CPU, make the program more efficient, or run fewer programs, or get a faster computer. If the system is paging or swapping, consider adding real memory. If the system is bottlenecked on the disk or other I/O, consider adding more and faster resources in those areas.
Most UNIX vendors have manuals and seminars on how to optimize programs running on their operating system. For general comments, look at System Performance Tuning (O'Reilly & Associates, Inc., 1990) by Mike Loukides.

During testing, you will find "hot beds" of defects. Track the defect density by subroutine and by program. When the defect density crosses some threshold, throw out the code and rewrite it. You will find it less expensive to trash bad code than to maintain it.

Once you've made all the changes you need to so that the software performs acceptably under stress and load, go back to the beginning and run a full regression suite. Once it passes all tests, check in all the test software along with the code under test, so you can re-create the test environment at any time.

Releasing Alpha and Beta Test Versions

Once the product seems to work and passes the developer's tests, give it to a friendly in-house test team. Depending upon the software, the testers can be administrative staff, family members, or friends. Ask them to interact with the software and try to break it. (Twelve-year-olds are an excellent resource for these alpha test teams. They can break anything.)

Keep written records of the defects found during alpha testing, as well as recommendations from the testers for improvements. Fix the problems and run a full set of regression tests to make sure nothing broke in one part while you were fixing another part.

After you and your alpha testers are satisfied that the product works, offer it to one client at a discounted rate. You are now beginning beta testing. Make it absolutely clear that this software is going out for its first test. Give the client Customer Trouble Reports (CTRs) and make sure they know how to fill them out. Consider putting the CTR online, so visitors can report problems. Analyze the error log daily to see whether the software is malfunctioning.

Teams Make the Difference

Does this business of testing and retesting sound like a lot of work? It is. But it's not nearly as much work as fixing defects after the software is released.

Think of software development this way. Before the software is released, you can develop it during working hours, at your own pace and take your time to make sure everything is right. It never seems like there's enough time. The deadline always looms large. But compare that environment to fixing fielded software. Once it breaks, the customer is hesitant to trust it again and he wants it fixed now. While you're working on it, hundreds, or perhaps, thousands of people are using it, breaking it again, and getting frustrated. Your next project is languishing on the disk, slipping behind schedule because you're tied up fixing the last seven systems you shipped. Not a pretty picture. Clearly to survive in this business, a Webmaster must assemble a team of software developers who share the responsibility for specifying, designing, coding, integrating, and testing CGI-based software systems. These teams must develop, document, and improve repeatable processes, which results in shipping quality software products consistently.

In his book, Code Complete (Microsoft Press, 1993), Steve McConnell reports the results of a highly disciplined coding and testing process called "cleanroom development." He reports that "productivity for a fully checked out 80,000-line cleanroom project was 740 lines of code per work-month. The industry average rate for fully checked out code is closer to 150 lines per month." He quotes cleanroom pioneer Harlan Mills as saying that "after a team has completed three or four cleanroom development projects, it should be able to reduce the density of errors in its code by a factor of 100 and simultaneously increase its productivity by a factor of 10."

The finest programs in the world are worthless if they cannot be run. This chapter addresses problems that occur in the CGI script as well as problems that occur in the server configuration. It lists the error codes that can be returned by the server, and shows what kinds of problems cause each error.

This chapter also shows how to run CGI scripts by hand, bypassing the client and even the server so that the input and the output are both visible and controllable. The final section shows a step-by-step set of procedures that can reduce the defect rate in delivered code by a factor of ten.