Chapter 3 Designing CGI Applications

by Jeffry Dwight

CONTENTS

CGI Script Structure
Planning Your Script
Standard CGI Environment Variables
CGI Script Portability
- Platform Independence
- Server Independence
CGI Libraries
- Personal Libraries
- Public Libraries
CGI Limitations

A CGI application is much more like a system utility than a full-blown application. In general, scripts are task-oriented rather than process-oriented. That is, a CGI application has a single job to do-it initializes, does its job, and then terminates. This makes it easy to chart data flow and program logic. Even in a GUI environment, the application doesn't have to worry much about being event-driven: The inputs and outputs are defined, and the program will probably have a top-down structure with simple subroutines.

Programming is a discipline, an art, and a science. The mechanics of the chosen language, coupled with the parameters of the operating system and the CGI environment, make up the science. The conception, the execution, and the elegance (if any) can be either art or science. But the discipline isn't subject to artistic fancy and is platform-independent. This chapter deals mostly with programming discipline, concentrating on how to apply that discipline to your CGI scripts.

Chapter 4 "Understanding Basic CGI Elements," covers script elements in detail. In particular, you'll find a complete discussion of environment variables and parsing. I'll touch on these issues briefly, but only as they relate to script structure and planning.

In this chapter, I'll cover

CGI script structure
Planning your script
Standard CGI environment variables
CGI script portability
CGI libraries
CGI limitations

CGI Script Structure

When your script is invoked by the server, the server passes information to the script in one of two ways: GET or POST. These two methods are known as request methods. The request method used is passed to your script via the environment variable called-appropriately enough-REQUEST_METHOD.

GET is a request for data-the same method used for obtaining static documents. The GET method sends request information as parameters tacked onto the end of the URL. These parameters are passed to your CGI program in the environment variable QUERY_STRING.
For example, if your script is called myprog.exe and you invoke it from a link with the form <a href="cgi-bin/myprog.exe?lname= blow&fname=joe">, the REQUEST_METHOD will be the string GET and the QUERY_STRING will contain lname=blow&fname=joe. I discuss the format of QUERY_STRING in the following sidebar.
The question mark separates the name of the script from the beginning of the QUERY_STRING. On some servers, the question mark is mandatory, even if no QUERY_STRING follows it. On other servers, a forward slash may be allowed either instead of, or in addition to, the question mark. If the slash is used, the server passes the information to the script using the PATH_INFO variable instead of the QUERY_STRING variable.
A POST operation occurs when the browser sends data from a fill-in form to the server. With POST, the QUERY_STRING may or may not be blank, depending on your server. If any information is present, it will be formatted exactly as with GET and passed exactly the same way.
The data from a POSTed query gets passed from the server to the script using STDIN. Because STDIN is a stream and the script needs to know how much valid data is waiting, the server also supplies another variable, CONTENT_LENGTH, to indicate the size in bytes of the incoming data. The format for POSTed data is variable1=value1&variable2= value2&etc.
Your program must examine the REQUEST_METHOD environment variable to know whether or not to read STDIN. The CONTENT_LENGTH variable is typically useful only when the REQUEST_METHOD is POST.

URL Encoding

The HTTP 1.0 specification calls for URL data to be encoded in such a way that it can be used on almost any hardware and software platform. Information specified this way is called URL-encoded; almost everything passed to your script by the server will be URL-encoded.

Parameters passed as part of QUERY_STRING or PATH_INFO will take the form variable1=value1&variable2=value2 and so forth, for each variable defined in your form.

Variables are separated by the ampersand (&). If you want to send a real ampersand, it must be escaped-that is, encoded as a two-digit hexadecimal value representing the character. Escapes are indicated in URL-encoded strings by the percent sign (%). Thus, %25 represents the percent sign itself. (25 is the hexadecimal, or base 16, representation of the ASCII value for the percent sign.) All characters above 127 (7F hex) or below 33 (21 hex) are escaped. This includes the space character, which is escaped as %20. Also, the plus sign (+) needs to be interpreted as a space character.

Before your script can deal with the data, it must parse and decode it. Fortunately, these are fairly simple tasks in most programming languages. Your script scans through the string looking for an ampersand. When an ampersand is found, your script chops off the string up to that point and calls it a variable. The variable's name is everything up to the equal sign in the string; the variable's value is everything after the equal sign. Your script then continues parsing the original string for the next ampersand, and so on, until the original string is exhausted.

After the variables are separated, you can safely decode them, as follows:

Replace all plus signs with spaces.
Replace all %## (percent sign followed by two hex digits) with the corresponding ASCII character.
It's important that the script scan through the string linearly rather than recursively because the characters the script decodes may be plus signs or percent signs.

When the server passes data to your form with the POST method, the script checks the environment variable called CONTENT_TYPE. If CONTENT_TYPE is application/x-www-form-urlencoded, your data needs to be decoded before use.

The basic structure of a CGI application is simple and straightforward: initialization, processing, output, and termination. Because this chapter deals with concepts, flow, and programming discipline, I'll use pseudocode rather than a specific language for the examples.

Ideally, a script has the following form (with appropriate subroutines for do-initialize, do-process, and do-output):

Program begins
Call do-initialize
Call do-process
Call do-output
Program ends

Real life is rarely this simple, but I'll give the nod to proper form while acknowledging that you'll seldom see it.

Initialization

The first thing your script must do when it starts is determine its input, environment, and state. Basic operating-system environment information can be obtained the usual way: from the system registry in NT, from standard environment variables in UNIX, from INI files in Windows, and so forth.

State information will come from the input rather than the operating environment or static variables. Remember: Each time CGI scripts are invoked, it's as if they've never been invoked before. The scripts don't stay running between calls. Everything must be initialized from scratch, as follows:

Determine how the script was invoked. Typically, this involves reading the environment variable REQUEST_METHOD and parsing it for the word GET or the word POST.

NOTE

Although GET and POST are the only currently defined operations that apply to CGI, you may encounter PUT or HEAD from time to time if your server supports it and the user's browser uses it. PUT was offered as an alternative to POST, but never received approved RFC status and isn't in general use. HEAD is used by some browsers to retrieve just the headers of an HTML document and isn't applicable to CGI programming; other oddball request methods may be out there, too. Your code should check explicitly for GET and POST and refuse anything else. Don't assume that if the request method isn't GET then it must be POST, or vice versa.

Retrieve the input data. If the method was GET, you must obtain, parse, and decode the QUERY_STRING environment variable. If the method was POST, you must check QUERY_STRING and also parse STDIN. If the CONTENT_TYPE environment variable is set to application/x-www-form-urlencoded, the stream from STDIN needs to be unencoded too.

The following is the initialization phase in pseudocode:

retrieve any operating system environment values desired
allocate temporary storage for variables
if environment variable REQUEST_METHOD equals "GET" then
        retrieve contents of environment variable QUERY_STRING;
        if QUERY_STRING is not null, parse it and decode it;
else if REQUEST_METHOD equals "POST" then
        retrieve contents of environment variable QUERY_STRING;
        if QUERY_STRING is not null, parse it and decode it;
        retrieve value of environment variable CONTENT_LENGTH;
        if CONTENT_LENGTH is greater than zero, read CONTENT_LENGTH bytes from STDIN;
        parse STDIN data into separate variables;
        retrieve contents of environment variable CONTENT_TYPE;
        if CONTENT_TYPE equals application/x-www-form-urlencoded, then decode parsed variables;
else if REQUEST_METHOD is neither "GET" nor "POST" then
        report an error;
        deallocate temporary storage;
        terminate
end if

Processing

After initializing its environment by reading and parsing its input, the script is ready to get to work. What happens in this section is much less rigidly defined than during initialization. During initialization, the parameters are known (or can be discovered), and the tasks are more or less the same for every script you'll write. The processing phase, however, is the heart of your script, and what you do here will depend almost entirely on the script's objectives.

Process the input data. What you do here will depend on your script. For instance, you may ignore all the input and just output the date; you may spit back the input in neatly formatted HTML; you may hunt up information in a database and display it; or you may do something never thought of before. Processing the data means, generally, transforming it somehow. In classical data processing terminology, this is called the transform step because in batch-oriented processing, the program reads a record, applies some rule to it (transforming it), and then writes it back out. CGI programs rarely, if ever, qualify as classical data processing, but the idea is the same. This is the stage of your program that differentiates it from all other CGI programs-where you take the inputs and make something new from them.
Output the results. In a simple CGI script, the output is usually just a header and some HTML. More complex scripts might output graphics, graphics mixed with text, or all the information necessary to call the script again with some additional information. A common and rather elegant technique is to call a script once by using GET, which can be done from a standard <a href> tag. The script senses that it was called with GET and creates an HTML form on the fly-complete with hidden variables and code necessary to call the script again, this time with POST.

Row, Row, Row Your Script…

In the UNIX world, a character stream is a special kind of file. STDIN and STDOUT are character streams by default. The operating system helpfully parses streams for you, making sure that everything going through is proper 7-bit ASCII or an approved control code.
Seven-bit? Yes. For HTML, this doesn't matter. However, if your script sends graphical data, using a character-oriented stream means instant death. The solution is to switch the stream over to binary mode. In C, you do this with the setmode() function: setmode(fileno(stdout), O_BINARY). You can change horses in midstream with the complementary setmode(fileno(stdout), O_TEXT). A typical graphics script will output the headers in character mode, and then switch to binary mode for the graphical data.
In the NT world, streams behave the same way for compatibility reasons. A nice simple \n in your output gets converted to \r\n for you when you write to STDOUT. This doesn't happen with regular NT system calls, such as WriteFile(); you must specify \r\n explicitly if you want CRLF.
Alternate words for character mode and binary mode are cooked and raw, respectively-those in the know will use these terms instead of the more common ones.
Whatever words you use and on whatever platform, there's another problem with streams: by default, they're buffered, which means that the operating system hangs onto the data until a line-terminating character is seen, the buffer fills up, or the stream is closed. This means that if you mix buffered printf() statements with unbuffered fwrite() or fprintf() statements, things will probably come out jumbled, even though they may all write to STDOUT. Printf() writes buffered to the stream; file-oriented routines output directly. The result is an out-of-order mess.You may lay the blame for this straight at the feet of the god known as Backward Compatibility. Beyond the existence of many old programs, streams have no reason to default to buffered and cooked. These should be options that you turn on when you want them-not turn off when you don't. Fortunately, you can propitiate this god of Backward Compatibility with the simple incantation, setvbuf(stdout, NULL, _IONBF, 0), which turns off all buffering for the STDOUT stream.
Another solution is to avoid mixing types of output statements; even so, that won't make your cooked output raw, so it's a good idea to turn off buffering anyway. Many servers and browsers are cranky and dislike receiving input in drabs and twaddles.

NOTE

Those who speak mainly UNIX will frown at the term CRLF, while those who program on other platforms might not recognize \n or \r\n. CRLF, meet \r\n. \r is how C programmers specify a carriage return (CR) character; \n is how C programmers specify a line feed (LF) character. (That's Chr$(10) for LF and Chr$(13) for CR to you Basic programmers.)

The following is a pseudocode representation of a simple processing phase whose objective is to recapitulate all the environment variables gathered in the initialization phase:

output header "content-type: text/html\n"
output required blank line to terminate header "\n"
output "<HTML>"
output "<H1>Variable Report</H1>"
output "<UL>"
for each variable known
        output "<LI>"
        output variable-name
        output "="
        output variable-value
loop until all variables printed
output "</UL>"
output "</HTML>"

This has the effect of creating a simple HTML document containing a bulleted list. Each item in the list is a variable, expressed as name=value.

Termination

Termination is nothing more than cleaning up after yourself and quitting. If you've locked any files, you must release them before letting the program end. If you've allocated memory, semaphores, or other objects, you must free them. Failure to do so may result in a "one-shot wonder" of a script-one that works only the first time, but breaks on every subsequent call. Worse yet, your script may hinder, or even break, other scripts or even the server itself by failing to free up resources and release locks.

On some platforms-most noticeably Windows NT and, to a lesser extent, UNIX-your file handles and memory objects are closed and reclaimed when your process terminates. Even so, it's unwise to rely on the operating system to clean up your mess. For instance, under NT, the behavior of the file system is undefined when a program locks all or part of a file and then terminates without releasing the locks.

Make sure that your error-exit routine-if you have one (and you should)-knows about your script's resources and cleans up just as thoroughly as the main exit routine does.

Planning Your Script

Now that you've seen a script's basic structure, you're ready to learn how to plan a script from the ground up.

In the old days (circa 1950), planning a program meant reams of paper, protractors, rulers, chalkboards, punch tape, stacks of cards, and endless cups of coffee during endless meetings with other white-coated technicians, each of whom could parse your machine code and compare cycles and pre-fetch queue efficiency in his head. Programs emerging from this method of planning tended to be brutal, short, and ugly-but amazingly efficient.

Later on, in the 1970s, planning a program meant reading dozens of weighty tomes, each of which went on at great length and obscurity about discipline, flow charts, data flow versus program logic, the foolishness (or wisdom) of top-down design, inheritablity, reusability, encapsulation, data integrity, and so forth. One then drank endless cups of coffee while attending endless meetings and shouting quotations from the books at the other participants.

Finally, someone would notice that the customer was about to sign with another company, and nip off to write the program while everyone else was still arguing. The resultant program was almost always put into use immediately and got the job done. The only thing with a longer life cycle than a program written this way is the ongoing, raging discussion about its inadequacies and lack of adherence to proper rules of structure.

In still more recent times, circa 1980, programs got designed by folks wearing blue jeans, sandals, glasses, and long hair. They seldom attended meetings, and if they did, they never paid attention. They doodled, talked to themselves, kept odd hours, and drank either Mountain Dew or mineral water, depending on whether they worked in California or not. They were known to consume mass quantities of almost raw red meat, and to call for pizza at least once a day-often, first thing in the morning.

Sometimes they appeared to be working, but it was hard to tell because the lights were always off in their offices. Eventually they emerged with a program-arrived at by mystical processes unknown to common man-and when asked for the documentation, would tap their foreheads and smile knowingly, and then go home to sleep. These programs provided the foundation of the modern software industry. Scary, huh?

In the 1990s, program design most often happens after a short meeting or two to discuss requirements and milestones in a non-smoking, caffeine-free, hypoallergenic environment. Then someone gets assigned to chart the data flow and program logic, usually using Visio or some such program to drag perfect little boxes and lines around on-screen until the project manager is happy. Another person or team usually codes the project, translating each of the little boxes into a single, simple subroutine that performs a single, simple task. Then the documentation and user-training team moves in, and the programmers can go home until the next project.

Although each approach has its advantages and drawbacks, CGI scripts benefit most from the 1980s model with the discipline of the 1970s, the documentation of the 1990s, and a dash of efficiency from the old days. Follow these steps:

Take your time defining the program's task. Think it through thoroughly. Write it down and trace the program logic. (Doodling is fine; Visio is overkill.) When you're satisfied that you understand the input and output and the transform process you'll have to do, proceed.
Order a pizza and a good supply of your favorite beverage, lock yourself in for the night, and come out the next day with a finished program. The glasses, jeans, and long hair are optional accessories. Don't forget to document your code while writing it.
Test, test, test. Use every browser known to mankind and every sort of input you can think of. Especially test for the situations in which users enter 32K of data in a 10-byte field or they enter control codes where you're expecting plain text.
Document the program as a whole, too--not just the individual steps within it--so that others who have to maintain or adapt your code will understand what you were trying to do.

Step 1, of course, is this section's topic, so let's look at that process in more depth:

If your script will handle form variables, plan out each one: its name, expected length, and data type.
As you copy variables from QUERY_STRING or STDIN, check for proper type and length. A favorite trick of UNIX hackers is to overflow the input buffer purposely. Because of the way some scripting languages (notably sh and bash) allocate memory for variables, this sometimes gives the hacker access to areas of memory that should be protected, letting them place executable instructions in your script's heap or stack space.
Use sensible variable names. A pointer to the QUERY_STRING environment variable should be called something like pQueryString, not p2. This not only helps debugging at the beginning, but makes maintenance and modification much easier. No matter how brilliant a coder you are, chances are good that a year from now you won't remember that p1 points to CONTENT_TYPE while p2 points to QUERY_STRING.
Distinguish between system-level parameters that affect how your program operates, and user-level parameters that provide instance-specific information. For example, in a script to send e-mail, don't let the user specify the IP number of the SMTP host. This information shouldn't even appear on the form in a hidden variable. It is instance independent and therefore should be a system-level parameter. In NT, store this information in the registry. In UNIX, store it in a configuration file or system environment variable.
If your script will shell out to the system to launch another program or script, don't pass user-supplied variables unchecked. Especially in UNIX systems, where the system() call can contain pipe or redirection characters, leaving variables unchecked can spell disaster. Clever users and malicious hackers can copy sensitive information or destroy data this way. If you can't avoid system() calls altogether, plan for them carefully. Define exactly what can get passed as a parameter, and know which bits will come from the user. Include an algorithm to parse for suspect character strings and exclude them.
If your script will access external files, plan how you'll handle concur-rency. You may lock part or all of a data file; you may establish a semaphore; or you may use a file as a semaphore. If you take chances, you'll be sorry. Never assume that just because your script is the only program to access a given file that you don't need to worry about concurrency. Five copies of your script might be running at the same time, satisfying requests from five different users.

NOTE

Programmers use semaphores to coordinate among multiple programs, multiple instances of the same program, or even among routines within a single program. Some operating systems have support for semaphores built-in; others require the programmers to develop a semaphore strategy.

In the simplest sense, a semaphore is like a toggle switch whose state can be checked: Is the switch on? If so, do this; if not, do that. Often, files are used as semaphores (does the file exist? If so, do this; if not, do that). A more sophisticated method is to try to lock a file for exclusive access (if you can get the lock, do this; if not, wait a bit and try again).

In CGI programming, semaphores are used most often to coordinate among multiple instances of the same CGI script. If, for instance, your script must update a file, it can't assume that the file is available at all times. What if another instance of the same script is in the middle of updating the file right then? The second process must wait until the first one is finished, or else the file will become hopelessly corrupted. The solution is to use a semaphore. Your script checks to make sure that the semaphore is clear. If not, it goes into a short loop, checking the semaphore periodically. After the semaphore is clear, it sets the semaphore so that no other program will interfere. It then performs its critical section--in this case, writing to a file--and clears the semaphore again. Other instances can then each take a turn. The semaphore thus provides a way to manage concurrency safely.

If you lock files, use the least-restrictive lock required. If you're only reading a data file, lock out writes while you're reading and release the file immediately afterward. If you're updating a record, lock just that one record (or byte range). Ideally, your locking logic should immediately surround the actual I/O calls. Don't open a file at the beginning of your program and lock it until you terminate. If you must do this, open the file but leave it unlocked until you're actually about to use it. This will allow other applications or other instances of your script to work smoothly and quickly.
Prepare graceful exits for unexpected events. If, for instance, your program requires exclusive access to a particular resource, be prepared to make it wait a reasonable amount of time and then die gracefully. Never code a wait-forever call. When your program dies from a fatal error, make sure that it reports the error before going to heaven. Error reports should use plain, sensible language. When possible, also write the error to a log file so that the system administrator knows of it.
If you're using a GUI language for your CGI script, don't let untrapped errors result in a message box on-screen. This is a server application; chances are excellent that no one will be around to notice and clear the error, and your application will hang until the next time an administrator chances by. Trap all errors! Work around those you can live with, and treat all others as fatal.
Write pseudocode for your routines at least to the point of general logical structure before firing up the editor. It often helps to build stub routines so you can use the actual calls in your program while you're still developing. A stub routine is a quick-and-dirty routine that doesn't actually process anything; it just accepts the inputs the final routine will be expecting, and outputs a return code consistent with what the final routine would produce.
For complex projects, a data flow chart can be invaluable. Data flow should remain distinct from logic flow; your data travels in a path through the program and is "owned" by various pieces along the way, no matter how it's transformed by the subroutines.
Try to encapsulate private data and processing. Your routines should have a defined input and output-one door in, one door out, and you know who's going through the door. How your routines accomplish their tasks isn't any of the calling routine's business. This is called the black box approach. What happens inside the box can't be seen from the outside and has no effect on it. For example, a properly encapsulated lookup routine that uses flat-file tables can be swapped for one that talks to a relational back-end database without changing any of the rest of your program.
Document your program as you go along. Self-documenting code is the best approach, with generous use of comments and extra blank lines to break up the code. If you use sensible, descriptive names for your variables and functions, half your work is already done. But good documentation doesn't just tell what a piece of code does; it tells why. For example, "Assign value of REQUEST_METHOD to pRequestMethod" tells what your code does. "Determine if you were invoked by GET or POST" tells why you wrote that bit of code and, ideally, leads directly to the next bit of code and documentation: "If invoked via GET, do this," or "If invoked via POST, do this."
Define your output beforehand as carefully as you plan the input. Your messages to the user should be standardized. For instance, don't report a file-locking problem as Couldn't obtain lock. Please try again later, while reporting a stack overflow error as ERR4332. Your success messages should be consistent as well. Don't return You are the first visitor to this site since 1/1/96 one time, and You are visitor number 2 since 01-01-96 the next.
If you chart your data flow and group your functions logically, each type of message will be produced by the appropriate routine for that type. If you hack the code with error messages and early-out success messages sandwiched into your program's logic flow, you'll end up with something that looks inconsistent to the end user and looks like a mess to anyone who has to maintain your code.

NOTE

An early-out algorithm is one that tests for the exception, or least-significant case, and exits with a predefined answer rather than exercise the algorithm to determine the answer. For example, division algorithms usually test for a divide by two operation, and do a shift instead of divide.

Standard CGI Environment Variables

Here's a brief overview of the standard environment variables you're likely to encounter. Each server implements the majority of them consistently, but there are variations, exceptions, and additions. In general, you're more likely to find a new, otherwise undocumented variable rather than a documented variable omitted. The only way to be sure, though, is to check your server's documentation.

Chapter 4 "Understanding Basic CGI Elements," deals with each variable in some depth. This section is taken from the NCSA specifications and is the closest thing to "standard" as you'll find. In case you've misplaced the URL for the NCSA CGI specification, here it is again:

http://www.w3.org/hypertext/WWW/CGI/

The following environment variables are set each time the server launches an instance of your script, and are private and specific to that instance:

AUTH_TYPE If the server supports basic authentication and if the script is protected, this variable will provide the authentication type. The information is protocol- and server-specific. An example AUTH_TYPE is BASIC.
CONTENT_LENGTH If the request includes data via the POST method, this variable will be set to the length of valid data supplied in bytes through STDIN-for example, 72.
CONTENT_TYPE If the request includes data, this variable will specify the type of data as a MIME header-for example, application/x-www-form-urlencoded.
GATEWAY_INTERFACE This provides the version number of the CGI interface supported by the server in the format CGI/version-number-for example, CGI/1.1.
HTTP_ACCEPT This provides a list of MIME types, comma-delimited, that are acceptable to the client browser-for example, image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, and */*. This list actually comes from the browser itself; the server just passes it on to the CGI script.
HTTP_USER_AGENT This supplies the name, possibly including a version number or other proprietary data, of the client's browser, such as Mozilla/2.0b3 (WinNT; I).
PATH_INFO This shows any extra path information, supplied by the client, tacked onto the end of the virtual path. This is often used as a parameter to the script. For example, with the URL http://www.yourcompany.com/cgi-bin/myscript.pl/dir1/dir2, the script is myscript.pl and the PATH_INFO is /dir1/dir2.
PATH_TRANSLATED Supported by only some servers, this variable contains the translation of the virtual path to the script being executed (that is, the virtual path mapped to a physical path). For example, if the absolute path to your Web server root is /usr/local/etc/httpd/htdocs and your cgi-bin folder is in the root level of your Web server (that is, http://www.mycorp.com/cgi-bin), a script with the URL http://www.mycorp.com/cgi-bin/search.cgi would have the PATH_TRANSLATED variable set to /usr/local/etc/httpd/htdocs/cgi-bin/search.cgi.
QUERY_STRING This shows any extra information, supplied by the client, tacked onto the end of an URL and separated from the script name with a question mark. For example, http://www.yourcompany.com/hello.html?name=joe&id=45 yields a QUERY_STRING of name=joe&id=45.
REMOTE_ADDR This provides the IP address of the client making the request-for example, 199.1.166.171. This information is always available.
REMOTE_HOST This furnishes the resolved host name of the client making the request-for example, dial-up102.abc.def.com. Often, this information is unavailable for one of two reasons: the caller's IP isn't properly mapped to a host name via DNS, or the Webmaster at your site has disabled IP lookups. Webmasters often turn off lookups because they mean an extra step for the server to perform after each connect, and this slows down the server.
REMOTE_IDENT If the server and client support RFC 931, this variable will contain the identification information supplied by the remote user's computer. Very few servers and clients still support this protocol. The information is almost worthless because the user can set the information to be anything he wants. Don't use this variable even if it's supported by your server.
REMOTE_USER If AUTH_TYPE is set, this variable will contain the user name provided by the user and validated by the server.

NOTE

AUTH_TYPE and REMOTE_USER are set only after a user successfully authenticates (usually via a user name and password) his identity to the server. Hence, these variables are useful only when restricted areas are established, and then only in those areas.

REQUEST_METHOD This supplies the method by which the script was invoked. Only GET and POST are meaningful for scripts using the HTTP/1.0 protocol.
SCRIPT_NAME This is the name of the script file being invoked. It's useful for self-referencing scripts. For example, you can use this information to generate the proper URL for a script that gets invoked via GET, only to turn around and output a form that, when submitted, will reinvoke the same script via POST. By using this variable instead of hard-coding your script's name or location, you make maintenance much easier. For example, /cgi-bin/myscript.exe.
SERVER_NAME This is your Web server's host name, alias, or IP address. It's reliable for use in generating URLs that refer to your server at run-time. For example, www.yourcompany.com.
SERVER_PORT This is the port number for this connection-for example, 80.
SERVER_PROTOCOL This is the name/version of the protocol used by this request-for example, HTTP/1.0.
SERVER_SOFTWARE This is the name/version of the HTTP server that launched your script-for example, HTTPS/1.1.

CGI Script Portability

CGI programmers face two portability issues: platform independence and server independence. By platform independence, I mean the capability of the code to run without modification on a hardware platform or operating system different from the one for which it was written. Server independence is the capability of the code to run without modification on another server using the same operating system.

Platform Independence

The best way to keep your CGI script portable is to use a commonly available language and avoid platform-specific code. It sounds simple, right? In practice, this means using either C or Perl and not doing anything much beyond formatting text and outputting graphics.

Does this leave Visual Basic, AppleScript, and UNIX shell scripts out in the cold? Yes, I'm afraid so-for now. However, platform independence isn't the only criterion to consider when selecting a CGI platform. There's also speed of coding, ease of maintenance, and ability to perform the chosen task.

Certain types of operations simply aren't portable. If you develop for 16-bit Windows, for instance, you'll have great difficulty finding equivalents on other platforms for the VBX and DLL functions you use. If you develop for 32-bit Windows NT, you'll find that all your asynchronous Winsock calls are meaningless in a UNIX environment. If your shell script does a system() call to launch grep and pipe the output back to your program, you'll find nothing remotely similar in the NT environment. And AppleScript is good only on Macs-period!

If one of your mandates is the capability to move code among platforms with a minimum of modification, you'll probably have the best success with C. Write your code using the standard functions from the ANSI C libraries, and avoid making other operating system calls. Unfortunately, following this rule will limit your scripts to very basic functionality. If you wrap your platform-dependent code in self-contained routines, however, you minimize the work needed to port from one platform to the next. As you saw earlier in the section "Planning Your Script," when talking about encapsulation, a properly designed program can have any module replaced in its entirety without affecting the rest of the program. Using these guidelines, you may have to replace a subroutine or two, and you'll certainly have to recompile; however, your program will be portable.

Perl scripts are certainly easier to maintain than C programs, mainly because there's no compile step. You can change the program quickly when you figure out what needs to be changed. And there's the rub: Perl is annoyingly obtuse, and the libraries tend to be much less uniform-even between versions on the same platform-than do C libraries. Also, Perl for NT is fairly new and still quirky (as if anything related to Perl can be called more quirky than another part).

If, however, you dream of bit masks, think two-letter code words are more descriptive than named functions, and believe in your heart that programming syntax should be as convoluted and chock full of punctuation as possible, then you and Perl are soul mates. You won't have much trouble porting your application among platforms once you identify the platform-dependencies and find (or write) libraries for the standard functions.

Server Independence

Far more important than platform independence (unless you're writing scripts only for your own pleasure) is server independence. Server independence is fairly easy to achieve, but for some reason seems to be a stumbling block to beginning script writers. To be server independent, your script must run without modification on any server using the same operating system. Only server-independent programs can be useful as shareware or freeware, and without a doubt, server independence is a requirement for commercial software.

Most programmers think of obvious issues, such as not assuming that the server has a static IP address. The following are some other rules of server independence that, although obvious once stated, nevertheless get overlooked time and time again:

Don't assume your environment. For example, just because the temp directory was C:\TEMP on your development system, don't assume that it will be the same wherever your script runs. Never hard code directories or file names. This goes double for Perl scripts, where this travesty of proper programming happens most often. If your Perl script to tally hits needs to exclude a range of IP addresses from the total, please don't hard code the addresses into the program and say "Change this line" in the comments. Use a configuration file.
Don't assume privileges. On a UNIX box, the server (and, therefore, your script) may run as the user nobody, as root, or as any privilege level in between. On an NT machine, too, CGI programs inherit the server's security attributes. Check for access rights and examine return codes carefully so you can present intelligible error information to the user in case your script fails because it can't access a resource.
Don't assume consistency of CGI variables. Some servers pass regular environment variables (for instance, PATH and LIB variables) along with CGI environment variables; however, the ones they pass depend on the run-time environment. Server configuration can also affect the number and the format of CGI variables. Be prepared for environment-dependent input, and have your program act accordingly.
Don't assume version-specific information. Test for it and include work-arounds or sensible error messages telling the user what to upgrade and why. Both server version and operation system version can affect your script's environment.
Don't assume LAN or WAN configurations. In the Windows NT world, the server can be NT Workstation or NT Server; it may be stand-alone, part of a workgroup, or part of a domain. DNS (Domain Name Services) may or may not be available; lookups may be limited to a static hosts file. In the UNIX world, don't assume anything about the configurations of daemons such as inetd, sendmail, or the system environment, and don't assume directory names. Use a configuration file for the items that you can't discover with system calls, and give the script maintainer instructions for editing it.
Don't assume the availability of system objects. As with privilege level, check for the existence of such objects as databases, messaging queues, and hardware drivers, and check for output-explicit messages when something can't be found or is misconfigured. Nothing is more irritating than downloading a new script, installing it, and getting only Runtime error #203 for the output.

CGI Libraries

When you talk about CGI libraries, there are two possibilities: libraries of code you develop and want to reuse in other projects, and publicly available libraries of programs, routines, and information.

Personal Libraries

If you follow the advice given earlier in the "Planning Your Script" section about writing your code in a black-box fashion, you'll soon discover that you're building a library of routines that you'll use over and over. For instance, after you puzzle out how to parse out URL-encoded data, you don't need to do it again. And when you have a basic main() function written, it will probably serve for every CGI program you ever write. This is also true for generic routines, such as querying a database, parsing input, and reporting runtime errors.

How you manage your personal library depends on the programming language you use. With C and assembler, you can precompile code into actual .lib files, with which you can then link your programs. Although possible, this likely is overkill for CGI and doesn't work for interpreted languages, such as Perl and Visual Basic. (Although Perl and VB can call compiled libraries, you can't link with them in a static fashion the way you can with C.) The advantage of using compiled libraries is that you don't have to recompile all your programs when you change code in the library. If the library is loaded at runtime (a DLL), you don't need to change anything. If the library is linked staticly, all you need to do is relink.

Another solution is to maintain separate source files and simply include them with each project. You might have a single, fairly large file that contains the most common routines while putting seldom-used routines in files of their own. Keeping the files in source format adds a little overhead at compile time, but not enough to worry about-especially when compared to the time savings you gain by writing the code only once. The disadvantage of this approach is that when you change your library code, you must recompile all your programs to take advantage of the change.

Nothing can keep you from incorporating public-domain routines into your personal library either. As long as you make sure that the copyright and license allow you to use and modify the source code without royalties or other stipulations, you should strip out the interesting bits and toss them into your library.

Well-designed and well-documented programs provide the basis for new programs. If you're careful to isolate the program-specific parts into subroutines, there's no reason not to cannibalize an entire program's structure for your next project.

You can also develop platform-specific versions of certain subroutines and, if your compiler will allow it, automatically include the correct ones for each type of build. At the worst, you'll have to manually specify which subroutines you want.

The key to making your code reusable this way is to make it as generic as possible. Not so generic that, for instance, a currency printing routine needs to handle both yen and dollars, but generic enough that any program that needs to print out dollar amounts can call that subroutine. As you upgrade, swat bugs, and add capabilities, keep each function's inputs and outputs the same, even when you change what happens inside the subroutine. This is the black-box approach in action. By keeping the calling convention and the parameters the same, you're free to upgrade any piece of code without fear of breaking older programs that call your function.

Another technique to consider is using function stubs. Say that you decide eventually that a single routine to print both yen and dollars is actually the most efficient way to go. But you already have separate subroutines, and your old programs wouldn't know to pass the additional parameter to the new routine. Rather than go back and modify each program that calls the old routines, just "stub out" the routines in your library so that the only thing they do is call the new, combined routine with the correct parameters. In some languages, you can do this by redefining the routine declarations; in others, you actually need to code a call and pay the price of some additional overhead. But even so, the price is far less than that of breaking all your old programs.

Public Libraries

The Internet is rich with public-domain sample code, libraries, and precompiled programs. Although most of what you'll find is UNIX-oriented (because it has been around longer), there's nevertheless no shortage of routines for Windows NT.

Here's a list of some of the best sites on the Internet with a brief description of what you'll find at each site. This list is far from exhaustive. Hundreds of sites are dedicated to, or contain information about, CGI programming. Hop onto your Web browser and visit your favorite search engine. Tell it to search for "CGI" or "CGI libraries" and you'll see what I mean. To save you the tedium of wading through all the hits, I've explored them for you. The following are the ones that struck me as most useful:

http://www.ics.uci.edu/pub/websoft/libwww-perl/ This is the University of California's public offering, libwww-perl. Based on Perl version 4.036, this library contains many useful routines. If you're planning to program in Perl, this library is worth the download just for ideas and techniques.
http://www.bio.cam.ac.uk/web/form.html A Perl 4 library for CGI from Steven E. Brenner, cgi-lib.pl is now considered a classic. It's also available from many other sites.
http://www-genome.wi.mit.edu/WWW/tools/scripting/cgi-utils.html cgi-utils.pl is an extension to cgi-lib.pl from Lincoln D. Stein at the Whitehead Institute, MIT Center for Genome Research.
http://www-genome.wi.mit.edu/ftp/pub/software/WWW/cgi_docs.html cgi.pm is a Perl 5 library for creating forms and parsing CGI input.
http://www-genome.wi.mit.edu/WWW/tools/scripting/CGIperl/ This is a nice list of Perl links and utilities.
ftp://ftp.w3.org/pub/www/src WWWDaemon_3.0.tar.Z Cgiparse, a shell-scripting utility, is part of the CERN server distribution. Cgiparse can also be used with Perl and C.
http://siva.cshl.org/gd/gd.html A C library for producing GIF images on the fly, gd enables your program to create images complete with lines, arcs, text, multiple colors, and cut and paste from other images and flood fills, which gets written out to a file. Your program can then suck this image data in and include it in your program's output. Although these libraries are difficult to master, the rewards are well worth it. Many map-related Web sites use these routines to generate map location points on the fly.
http://www-genome.wi.mit.edu/ftp/pub/software/WWW/GD.html GD.pm, a Perl wrapper and extender for gd, is written by Thomas Boutell of Cold Spring Harbor Labs.
http://www.iserver.com/cgi/library.html This is Internet Servers Inc.'s wonderful little CGI library. Among the treasures you'll find here are samples of image maps, building a Web index, server-push animation, and a guest book.
http://raps.eit.com/wsk/dist/doc/libcgi/libcgi.html This is an incredibly useful collection of C routines to perform almost any common CGI task. These routines come to you courtesy of EIT (Enterprise Integration Technologies).
http://www.charm.net/~web/Vlib/Providers/CGI.html This collection of links and utilities will help you build an editor, use C++ with predefined classes, join a CGI programmer's mailing list, and, best of all, browse a selection of Clickables-Plug-and-Play CGI Scripts.
http://www.greyware.com/greyware/software/ Greyware Automation Products provides a rich list of shareware and freeware programs for Windows NT. Of special interest are the free SSI utilities and the CGI-wrapper program, CGIShell, which lets you use Visual Basic, Delphi, or other GUI programming environments with the freeware EMWAC HTTP server.
http://canon.bhs.com/cgi-shl/dbml.exe?action=query&template=/NTWebNet/appctr&udir=NEW Although not specifically geared to CGI, the NT Application Center-sponsored by Beverly Hills Software-provides some wonderful applications, some of which are CGI-related. In particular, you'll find EMWAC's software, Perl for NT and Perl libraries, and SMTP mailers.
http://mfginfo.com/htm/website.htm Manufacturer's Information Net provides a rich set of links to NT utilities, many of which are CGI-related. Of special interest are links to back-end database interfaces and many Internet server components.
http://cervantes.comptons.com/software/software.htm This is Kevin Athey's list of NT software, including a great little page-hit counter that's become very popular. You'll also enjoy his other software.
http://website.ora.com/software/ Bob Denny, author of WebSite, has probably done more than any other individual to popularize HTTP servers on the Windows NT platform. At this site, you'll find a collection of tools, including Perl for NT, VB routines for use with the WebSite server, and other interesting items.
http://www.applets.com/ Easily the winner for the "What's Cool" contest, this site has all the latest Java applets, often including source code and mini-tutorials. If you plan to write Java, this is the first place to visit for inspiration and education.
http://www.earthweb.com/java/ Another first-rate Java site demonstrating EarthWeb's achievements, it includes source code for many applets.
http://www.gamelan.com/ This is EarthWeb's Gamelan page: "The Directory and Registry of Java Resources." Developed and maintained in conjunction with Sun Microsystems (the inventors of Java), this site lists hundreds of Java applets.
http://www.javasoft.com/applets/applets.html This is Sun Microsystem's own Java applets page. Although often too busy to be of any practical use, this site nevertheless is the definitive source for Java information. It's worth the wait to get through. Also see http://www.javasoft.com/ itself for the Java specifications and white papers.

I could go on listing sites forever, it seems, but that's enough to get you started.

CGI Limitations

By far, the biggest limitation of CGI is its statelessness. As you learned in Chapter 1 "Introducing CGI," an HTTP Web server doesn't remember callers between requests. In fact, what appears to the user as a single page may actually be made up of dozens of independent requests-either all to the same server or to many different servers. In each case, the server fulfills the request, then hangs up and forgets the user ever dropped by.

The capability to remember what a caller was doing the last time through is called remembering the user's state. HTTP, and therefore CGI, doesn't maintain state information automatically. The closest things to state information in a Web transaction are the user's browser cache and a CGI program's cleverness. For example, if a user leaves a required field empty when filling out a form, the CGI program can't pop up a warning box and refuse to accept the input. The program's only choices are to output a warning message and ask the user to click the browser's back button; or output the entire form again, filling in the value of the fields that were supplied and letting the user try again, either correcting mistakes or supplying the missing information.

There are several workarounds for this problem, none of them terribly satisfactory. One idea is to maintain a file containing the most recent information from all users. When a new request comes through, hunt up the user in the file and assume the correct program state based on what the user did the last time. The problems with this idea are that it's very hard to identify a Web user, and a user may not complete the action, yet visit again tomorrow for some other purpose. An incredible amount of effort has gone into algorithms to maintain state only for a limited time period-a period that's long enough to be useful, but short enough not to cause errors. However, these solutions are terribly inefficient and ignore the other problem-identifying the user in the first place.

You can't rely on the user to provide his identity. Not only do some want to remain anonymous, but even those who want you to know their names can misspell it from time to time. Okay, then, what about using the IP address as the identifier? Not good. Everyone going through a proxy uses the same IP address. Which particular employee of Large Company, Ltd., is calling at the moment? You can't tell. Not only that, but many people these days get their IP addresses assigned dynamically each time they dial in. You certainly don't want to give Joe Blow privileges to Jane Doe's data just because Joe got Jane's old IP address this time.

The only reliable form of identity mapping is that provided by the server, using a name-and-password scheme. Even so, users simply won't put up with entering a name and password for each request, so the server caches the data and uses one of those algorithms mentioned earlier to determine when the cache has gone invalid.

Assuming that the CEO of your company hasn't used his first name or something equally guessable as his password, and that no one has rifled through his secretary's drawer or looked at the yellow sticky note on his monitor, you can be reasonably sure that when the server tells you it's the CEO, then it's the CEO. So then what? Your CGI program still has to go through hoops to keep your CEO from answering the same questions repeatedly as he queries your database. Each response from your CGI program must contain all the information necessary to go backward or forward from that point. It's ugly and tiresome, but necessary.

The second main limitation inherent in CGI programs is related to the way the HTTP spec is designed around delivery of documents. HTTP was never intended for long exchanges or interactivity. This means that when your CGI program wants to do something, such as generate a server-pushed graphic, it must keep the connection open. It does this by pretending that multiple images are really part of the same image.

The poor user's browser keeps displaying its "connection active" signal, thinking it's still in the middle of retrieving a single document. From the browser's point of view, the document just happens to be extraordinarily long. From your script's point of view, the document is actually made up of dozens-perhaps hundreds-of separate images, each one funneled through the pipe in sequence and marked as the next part of a gigantic file that doesn't really exist anywhere.

Perhaps when the next iteration of the HTTP specification is released, and when browsers and servers are updated to take advantage of a keep-alive protocol, we'll see some real innovation. In the meantime, CGI is what it is, warts and all. Although CGI is occasionally inelegant, it's nevertheless still very useful-and a lot of fun.