by Jeffry Dwight
A CGI application is much more like a system utility than a full-blown application. In general, scripts are task-oriented rather than process-oriented. That is, a CGI application has a single job to do-it initializes, does its job, and then terminates. This makes it easy to chart data flow and program logic. Even in a GUI environment, the application doesn't have to worry much about being event-driven: The inputs and outputs are defined, and the program will probably have a top-down structure with simple subroutines.
Programming is a discipline, an art, and a science. The mechanics of the chosen language, coupled with the parameters of the operating system and the CGI environment, make up the science. The conception, the execution, and the elegance (if any) can be either art or science. But the discipline isn't subject to artistic fancy and is platform-independent. This chapter deals mostly with programming discipline, concentrating on how to apply that discipline to your CGI scripts.
Chapter 4 "Understanding Basic CGI Elements," covers script elements in detail. In particular, you'll find a complete discussion of environment variables and parsing. I'll touch on these issues briefly, but only as they relate to script structure and planning.
In this chapter, I'll cover
When your script is invoked by the server, the server passes information to the script in one of two ways: GET or POST. These two methods are known as request methods. The request method used is passed to your script via the environment variable called-appropriately enough-REQUEST_METHOD.
URL Encoding |
The HTTP 1.0 specification calls for URL data to be encoded in such a way that it can be used on almost any hardware and software platform. Information specified this way is called URL-encoded; almost everything passed to your script by the server will be URL-encoded. Parameters passed as part of QUERY_STRING or PATH_INFO will take the form variable1=value1&variable2=value2 and so forth, for each variable defined in your form. Variables are separated by the ampersand (&). If you want to send a real ampersand, it must be escaped-that is, encoded as a two-digit hexadecimal value representing the character. Escapes are indicated in URL-encoded strings by the percent sign (%). Thus, %25 represents the percent sign itself. (25 is the hexadecimal, or base 16, representation of the ASCII value for the percent sign.) All characters above 127 (7F hex) or below 33 (21 hex) are escaped. This includes the space character, which is escaped as %20. Also, the plus sign (+) needs to be interpreted as a space character. Before your script can deal with the data, it must parse and decode it. Fortunately, these are fairly simple tasks in most programming languages. Your script scans through the string looking for an ampersand. When an ampersand is found, your script chops off the string up to that point and calls it a variable. The variable's name is everything up to the equal sign in the string; the variable's value is everything after the equal sign. Your script then continues parsing the original string for the next ampersand, and so on, until the original string is exhausted. After the variables are separated, you can safely decode them, as follows: When the server passes data to your form with the POST method, the script checks the environment variable called CONTENT_TYPE. If CONTENT_TYPE is application/x-www-form-urlencoded, your data needs to be decoded before use. |
The basic structure of a CGI application is simple and straightforward: initialization, processing, output, and termination. Because this chapter deals with concepts, flow, and programming discipline, I'll use pseudocode rather than a specific language for the examples.
Ideally, a script has the following form (with appropriate subroutines for do-initialize, do-process, and do-output):
Real life is rarely this simple, but I'll give the nod to proper form while acknowledging that you'll seldom see it.
The first thing your script must do when it starts is determine its input, environment, and state. Basic operating-system environment information can be obtained the usual way: from the system registry in NT, from standard environment variables in UNIX, from INI files in Windows, and so forth.
State information will come from the input rather than the operating environment or static variables. Remember: Each time CGI scripts are invoked, it's as if they've never been invoked before. The scripts don't stay running between calls. Everything must be initialized from scratch, as follows:
NOTE |
Although GET and POST are the only currently defined operations that apply to CGI, you may encounter PUT or HEAD from time to time if your server supports it and the user's browser uses it. PUT was offered as an alternative to POST, but never received approved RFC status and isn't in general use. HEAD is used by some browsers to retrieve just the headers of an HTML document and isn't applicable to CGI programming; other oddball request methods may be out there, too. Your code should check explicitly for GET and POST and refuse anything else. Don't assume that if the request method isn't GET then it must be POST, or vice versa. |
The following is the initialization phase in pseudocode:
retrieve any operating system environment values desired
allocate temporary storage for variables
if environment variable REQUEST_METHOD equals "GET" then
retrieve contents of environment variable QUERY_STRING;
if QUERY_STRING is not null, parse it and decode it;
else if REQUEST_METHOD equals "POST" then
retrieve contents of environment variable QUERY_STRING;
if QUERY_STRING is not null, parse it and decode it;
retrieve value of environment variable CONTENT_LENGTH;
if CONTENT_LENGTH is greater than zero, read CONTENT_LENGTH bytes from STDIN;
parse STDIN data into separate variables;
retrieve contents of environment variable CONTENT_TYPE;
if CONTENT_TYPE equals application/x-www-form-urlencoded, then decode parsed variables;
else if REQUEST_METHOD is neither "GET" nor "POST" then
report an error;
deallocate temporary storage;
terminate
end if
After initializing its environment by reading and parsing its input, the script is ready to get to work. What happens in this section is much less rigidly defined than during initialization. During initialization, the parameters are known (or can be discovered), and the tasks are more or less the same for every script you'll write. The processing phase, however, is the heart of your script, and what you do here will depend almost entirely on the script's objectives.
Row, Row, Row Your Script |
In the UNIX world, a character stream is a special kind of file. STDIN and STDOUT are character streams by default. The operating system helpfully parses streams for you, making sure that everything going through is proper 7-bit ASCII or an approved control code. |
NOTE |
Those who speak mainly UNIX will frown at the term CRLF, while those who program on other platforms might not recognize \n or \r\n. CRLF, meet \r\n. \r is how C programmers specify a carriage return (CR) character; \n is how C programmers specify a line feed (LF) character. (That's Chr$(10) for LF and Chr$(13) for CR to you Basic programmers.) |
The following is a pseudocode representation of a simple processing phase whose objective is to recapitulate all the environment variables gathered in the initialization phase:
output header "content-type: text/html\n"
output required blank line to terminate header "\n"
output "<HTML>"
output "<H1>Variable Report</H1>"
output "<UL>"
for each variable known
output "<LI>"
output variable-name
output "="
output variable-value
loop until all variables printed
output "</UL>"
output "</HTML>"
This has the effect of creating a simple HTML document containing a bulleted list. Each item in the list is a variable, expressed as name=value.
Termination is nothing more than cleaning up after yourself and quitting. If you've locked any files, you must release them before letting the program end. If you've allocated memory, semaphores, or other objects, you must free them. Failure to do so may result in a "one-shot wonder" of a script-one that works only the first time, but breaks on every subsequent call. Worse yet, your script may hinder, or even break, other scripts or even the server itself by failing to free up resources and release locks.
On some platforms-most noticeably Windows NT and, to a lesser extent, UNIX-your file handles and memory objects are closed and reclaimed when your process terminates. Even so, it's unwise to rely on the operating system to clean up your mess. For instance, under NT, the behavior of the file system is undefined when a program locks all or part of a file and then terminates without releasing the locks.
Make sure that your error-exit routine-if you have one (and you should)-knows about your script's resources and cleans up just as thoroughly as the main exit routine does.
Now that you've seen a script's basic structure, you're ready to learn how to plan a script from the ground up.
In the old days (circa 1950), planning a program meant reams of paper, protractors, rulers, chalkboards, punch tape, stacks of cards, and endless cups of coffee during endless meetings with other white-coated technicians, each of whom could parse your machine code and compare cycles and pre-fetch queue efficiency in his head. Programs emerging from this method of planning tended to be brutal, short, and ugly-but amazingly efficient.
Later on, in the 1970s, planning a program meant reading dozens of weighty tomes, each of which went on at great length and obscurity about discipline, flow charts, data flow versus program logic, the foolishness (or wisdom) of top-down design, inheritablity, reusability, encapsulation, data integrity, and so forth. One then drank endless cups of coffee while attending endless meetings and shouting quotations from the books at the other participants.
Finally, someone would notice that the customer was about to sign with another company, and nip off to write the program while everyone else was still arguing. The resultant program was almost always put into use immediately and got the job done. The only thing with a longer life cycle than a program written this way is the ongoing, raging discussion about its inadequacies and lack of adherence to proper rules of structure.
In still more recent times, circa 1980, programs got designed by folks wearing blue jeans, sandals, glasses, and long hair. They seldom attended meetings, and if they did, they never paid attention. They doodled, talked to themselves, kept odd hours, and drank either Mountain Dew or mineral water, depending on whether they worked in California or not. They were known to consume mass quantities of almost raw red meat, and to call for pizza at least once a day-often, first thing in the morning.
Sometimes they appeared to be working, but it was hard to tell because the lights were always off in their offices. Eventually they emerged with a program-arrived at by mystical processes unknown to common man-and when asked for the documentation, would tap their foreheads and smile knowingly, and then go home to sleep. These programs provided the foundation of the modern software industry. Scary, huh?
In the 1990s, program design most often happens after a short meeting or two to discuss requirements and milestones in a non-smoking, caffeine-free, hypoallergenic environment. Then someone gets assigned to chart the data flow and program logic, usually using Visio or some such program to drag perfect little boxes and lines around on-screen until the project manager is happy. Another person or team usually codes the project, translating each of the little boxes into a single, simple subroutine that performs a single, simple task. Then the documentation and user-training team moves in, and the programmers can go home until the next project.
Although each approach has its advantages and drawbacks, CGI scripts benefit most from the 1980s model with the discipline of the 1970s, the documentation of the 1990s, and a dash of efficiency from the old days. Follow these steps:
Step 1, of course, is this section's topic, so let's look at that process in more depth:
NOTE |
Programmers use semaphores to coordinate among multiple programs, multiple instances of the same program, or even among routines within a single program. Some operating systems have support for semaphores built-in; others require the programmers to develop a semaphore strategy. In the simplest sense, a semaphore is like a toggle switch whose state can be checked: Is the switch on? If so, do this; if not, do that. Often, files are used as semaphores (does the file exist? If so, do this; if not, do that). A more sophisticated method is to try to lock a file for exclusive access (if you can get the lock, do this; if not, wait a bit and try again). In CGI programming, semaphores are used most often to coordinate among multiple instances of the same CGI script. If, for instance, your script must update a file, it can't assume that the file is available at all times. What if another instance of the same script is in the middle of updating the file right then? The second process must wait until the first one is finished, or else the file will become hopelessly corrupted. The solution is to use a semaphore. Your script checks to make sure that the semaphore is clear. If not, it goes into a short loop, checking the semaphore periodically. After the semaphore is clear, it sets the semaphore so that no other program will interfere. It then performs its critical section--in this case, writing to a file--and clears the semaphore again. Other instances can then each take a turn. The semaphore thus provides a way to manage concurrency safely. |
NOTE |
An early-out algorithm is one that tests for the exception, or least-significant case, and exits with a predefined answer rather than exercise the algorithm to determine the answer. For example, division algorithms usually test for a divide by two operation, and do a shift instead of divide. |
Here's a brief overview of the standard environment variables you're likely to encounter. Each server implements the majority of them consistently, but there are variations, exceptions, and additions. In general, you're more likely to find a new, otherwise undocumented variable rather than a documented variable omitted. The only way to be sure, though, is to check your server's documentation.
Chapter 4 "Understanding Basic CGI Elements," deals with each variable in some depth. This section is taken from the NCSA specifications and is the closest thing to "standard" as you'll find. In case you've misplaced the URL for the NCSA CGI specification, here it is again:
http://www.w3.org/hypertext/WWW/CGI/
The following environment variables are set each time the server launches an instance of your script, and are private and specific to that instance:
NOTE |
AUTH_TYPE and REMOTE_USER are set only after a user successfully authenticates (usually via a user name and password) his identity to the server. Hence, these variables are useful only when restricted areas are established, and then only in those areas. |
CGI programmers face two portability issues: platform independence and server independence. By platform independence, I mean the capability of the code to run without modification on a hardware platform or operating system different from the one for which it was written. Server independence is the capability of the code to run without modification on another server using the same operating system.
The best way to keep your CGI script portable is to use a commonly available language and avoid platform-specific code. It sounds simple, right? In practice, this means using either C or Perl and not doing anything much beyond formatting text and outputting graphics.
Does this leave Visual Basic, AppleScript, and UNIX shell scripts out in the cold? Yes, I'm afraid so-for now. However, platform independence isn't the only criterion to consider when selecting a CGI platform. There's also speed of coding, ease of maintenance, and ability to perform the chosen task.
Certain types of operations simply aren't portable. If you develop for 16-bit Windows, for instance, you'll have great difficulty finding equivalents on other platforms for the VBX and DLL functions you use. If you develop for 32-bit Windows NT, you'll find that all your asynchronous Winsock calls are meaningless in a UNIX environment. If your shell script does a system() call to launch grep and pipe the output back to your program, you'll find nothing remotely similar in the NT environment. And AppleScript is good only on Macs-period!
If one of your mandates is the capability to move code among platforms with a minimum of modification, you'll probably have the best success with C. Write your code using the standard functions from the ANSI C libraries, and avoid making other operating system calls. Unfortunately, following this rule will limit your scripts to very basic functionality. If you wrap your platform-dependent code in self-contained routines, however, you minimize the work needed to port from one platform to the next. As you saw earlier in the section "Planning Your Script," when talking about encapsulation, a properly designed program can have any module replaced in its entirety without affecting the rest of the program. Using these guidelines, you may have to replace a subroutine or two, and you'll certainly have to recompile; however, your program will be portable.
Perl scripts are certainly easier to maintain than C programs, mainly because there's no compile step. You can change the program quickly when you figure out what needs to be changed. And there's the rub: Perl is annoyingly obtuse, and the libraries tend to be much less uniform-even between versions on the same platform-than do C libraries. Also, Perl for NT is fairly new and still quirky (as if anything related to Perl can be called more quirky than another part).
If, however, you dream of bit masks, think two-letter code words are more descriptive than named functions, and believe in your heart that programming syntax should be as convoluted and chock full of punctuation as possible, then you and Perl are soul mates. You won't have much trouble porting your application among platforms once you identify the platform-dependencies and find (or write) libraries for the standard functions.
Far more important than platform independence (unless you're writing scripts only for your own pleasure) is server independence. Server independence is fairly easy to achieve, but for some reason seems to be a stumbling block to beginning script writers. To be server independent, your script must run without modification on any server using the same operating system. Only server-independent programs can be useful as shareware or freeware, and without a doubt, server independence is a requirement for commercial software.
Most programmers think of obvious issues, such as not assuming that the server has a static IP address. The following are some other rules of server independence that, although obvious once stated, nevertheless get overlooked time and time again:
When you talk about CGI libraries, there are two possibilities: libraries of code you develop and want to reuse in other projects, and publicly available libraries of programs, routines, and information.
If you follow the advice given earlier in the "Planning Your Script" section about writing your code in a black-box fashion, you'll soon discover that you're building a library of routines that you'll use over and over. For instance, after you puzzle out how to parse out URL-encoded data, you don't need to do it again. And when you have a basic main() function written, it will probably serve for every CGI program you ever write. This is also true for generic routines, such as querying a database, parsing input, and reporting runtime errors.
How you manage your personal library depends on the programming language you use. With C and assembler, you can precompile code into actual .lib files, with which you can then link your programs. Although possible, this likely is overkill for CGI and doesn't work for interpreted languages, such as Perl and Visual Basic. (Although Perl and VB can call compiled libraries, you can't link with them in a static fashion the way you can with C.) The advantage of using compiled libraries is that you don't have to recompile all your programs when you change code in the library. If the library is loaded at runtime (a DLL), you don't need to change anything. If the library is linked staticly, all you need to do is relink.
Another solution is to maintain separate source files and simply include them with each project. You might have a single, fairly large file that contains the most common routines while putting seldom-used routines in files of their own. Keeping the files in source format adds a little overhead at compile time, but not enough to worry about-especially when compared to the time savings you gain by writing the code only once. The disadvantage of this approach is that when you change your library code, you must recompile all your programs to take advantage of the change.
Nothing can keep you from incorporating public-domain routines into your personal library either. As long as you make sure that the copyright and license allow you to use and modify the source code without royalties or other stipulations, you should strip out the interesting bits and toss them into your library.
Well-designed and well-documented programs provide the basis for new programs. If you're careful to isolate the program-specific parts into subroutines, there's no reason not to cannibalize an entire program's structure for your next project.
You can also develop platform-specific versions of certain subroutines and, if your compiler will allow it, automatically include the correct ones for each type of build. At the worst, you'll have to manually specify which subroutines you want.
The key to making your code reusable this way is to make it as generic as possible. Not so generic that, for instance, a currency printing routine needs to handle both yen and dollars, but generic enough that any program that needs to print out dollar amounts can call that subroutine. As you upgrade, swat bugs, and add capabilities, keep each function's inputs and outputs the same, even when you change what happens inside the subroutine. This is the black-box approach in action. By keeping the calling convention and the parameters the same, you're free to upgrade any piece of code without fear of breaking older programs that call your function.
Another technique to consider is using function stubs. Say that you decide eventually that a single routine to print both yen and dollars is actually the most efficient way to go. But you already have separate subroutines, and your old programs wouldn't know to pass the additional parameter to the new routine. Rather than go back and modify each program that calls the old routines, just "stub out" the routines in your library so that the only thing they do is call the new, combined routine with the correct parameters. In some languages, you can do this by redefining the routine declarations; in others, you actually need to code a call and pay the price of some additional overhead. But even so, the price is far less than that of breaking all your old programs.
The Internet is rich with public-domain sample code, libraries, and precompiled programs. Although most of what you'll find is UNIX-oriented (because it has been around longer), there's nevertheless no shortage of routines for Windows NT.
Here's a list of some of the best sites on the Internet with a brief description of what you'll find at each site. This list is far from exhaustive. Hundreds of sites are dedicated to, or contain information about, CGI programming. Hop onto your Web browser and visit your favorite search engine. Tell it to search for "CGI" or "CGI libraries" and you'll see what I mean. To save you the tedium of wading through all the hits, I've explored them for you. The following are the ones that struck me as most useful:
I could go on listing sites forever, it seems, but that's enough to get you started.
By far, the biggest limitation of CGI is its statelessness. As you learned in Chapter 1 "Introducing CGI," an HTTP Web server doesn't remember callers between requests. In fact, what appears to the user as a single page may actually be made up of dozens of independent requests-either all to the same server or to many different servers. In each case, the server fulfills the request, then hangs up and forgets the user ever dropped by.
The capability to remember what a caller was doing the last time through is called remembering the user's state. HTTP, and therefore CGI, doesn't maintain state information automatically. The closest things to state information in a Web transaction are the user's browser cache and a CGI program's cleverness. For example, if a user leaves a required field empty when filling out a form, the CGI program can't pop up a warning box and refuse to accept the input. The program's only choices are to output a warning message and ask the user to click the browser's back button; or output the entire form again, filling in the value of the fields that were supplied and letting the user try again, either correcting mistakes or supplying the missing information.
There are several workarounds for this problem, none of them terribly satisfactory. One idea is to maintain a file containing the most recent information from all users. When a new request comes through, hunt up the user in the file and assume the correct program state based on what the user did the last time. The problems with this idea are that it's very hard to identify a Web user, and a user may not complete the action, yet visit again tomorrow for some other purpose. An incredible amount of effort has gone into algorithms to maintain state only for a limited time period-a period that's long enough to be useful, but short enough not to cause errors. However, these solutions are terribly inefficient and ignore the other problem-identifying the user in the first place.
You can't rely on the user to provide his identity. Not only do some want to remain anonymous, but even those who want you to know their names can misspell it from time to time. Okay, then, what about using the IP address as the identifier? Not good. Everyone going through a proxy uses the same IP address. Which particular employee of Large Company, Ltd., is calling at the moment? You can't tell. Not only that, but many people these days get their IP addresses assigned dynamically each time they dial in. You certainly don't want to give Joe Blow privileges to Jane Doe's data just because Joe got Jane's old IP address this time.
The only reliable form of identity mapping is that provided by the server, using a name-and-password scheme. Even so, users simply won't put up with entering a name and password for each request, so the server caches the data and uses one of those algorithms mentioned earlier to determine when the cache has gone invalid.
Assuming that the CEO of your company hasn't used his first name or something equally guessable as his password, and that no one has rifled through his secretary's drawer or looked at the yellow sticky note on his monitor, you can be reasonably sure that when the server tells you it's the CEO, then it's the CEO. So then what? Your CGI program still has to go through hoops to keep your CEO from answering the same questions repeatedly as he queries your database. Each response from your CGI program must contain all the information necessary to go backward or forward from that point. It's ugly and tiresome, but necessary.
The second main limitation inherent in CGI programs is related to the way the HTTP spec is designed around delivery of documents. HTTP was never intended for long exchanges or interactivity. This means that when your CGI program wants to do something, such as generate a server-pushed graphic, it must keep the connection open. It does this by pretending that multiple images are really part of the same image.
The poor user's browser keeps displaying its "connection active" signal, thinking it's still in the middle of retrieving a single document. From the browser's point of view, the document just happens to be extraordinarily long. From your script's point of view, the document is actually made up of dozens-perhaps hundreds-of separate images, each one funneled through the pipe in sequence and marked as the next part of a gigantic file that doesn't really exist anywhere.
Perhaps when the next iteration of the HTTP specification is released, and when browsers and servers are updated to take advantage of a keep-alive protocol, we'll see some real innovation. In the meantime, CGI is what it is, warts and all. Although CGI is occasionally inelegant, it's nevertheless still very useful-and a lot of fun.