Chapter 11

Perl and the Internet

by Bob Breedlove


CONTENTS

What Is CGI and What Can It Do?

The Common Gateway Interface (CGI) is a standard for interfacing external applications with information servers, such as Web servers. A plain HTML document that the Web server retrieves is static-a text file that doesn't change. Every time you call a CGI program in real time, on the other hand, it executes and outputs dynamic information.

CGI version 1.1 started with the idea of hooking a UNIX database to the Internet. Figure 11.1 shows the simple concept of the CGI interface.

Figure 11.1 : The concept of CGI programs.

CGI began as a way to interface databases to the Web, but CGI programs provide whatever access you want to program within the limits of the server/browser.

The Web browser communicates with the host server (daemon) using HTTP. When the Web browser requests the Universal Resource Locator (URL), or address, of a CGI program, the server starts the CGI program with the daemon's standard input attached to the CGI program's standard output. The CGI program services the request and communicates with the database engine using its application program interface (API), which usually uses some form of interprocess communication, shared memory, or sockets.

The database engine retrieves the requested data and returns it to the CGI program. The CGI program formats a Web page using hypertext markup language (HTML) and returns it to the Web daemon via its standard output. The daemon returns the HTML to the browser on the client machine, which formats the page for display.

All CGI programs use this basic concept. A CGI program outputs more than an HTML page, performs all the processing itself, or accesses any type of server. Figure 11.1 shows that the basic process remains the same.

Terminology

In this chapter, many of the terms used are the same as the hypertext markup language (HTML) standards. Some terms are unique to CGI and might not match common usage.

Environment variableA named parameter that carries information from the server to the script. It is not necessarily a variable in the operating system's environment, although that is the most common implementation.
ScriptThe software that is invoked by the server via this interface. It need not be a stand-alone program, but could be a dynamically loaded or shared library or even a subroutine in the server.
ServerThe application program that invokes the script in order to service requests. Generally, this application runs as an independent process on the host computer. In UNIX terms, this program is referred to as a daemon.

What Are the Benefits of Using CGI?

HTML pages are static. That is, once you create and place them on the server, requests are transmitted in the same format each time they are requested. As Figure 11.2 shows, CGI programs run on the server in real time. With CGI programs, a user can retrieve real-time data and format dynamic pages, and the pages are different with each transmission.

Figure 11.2 : CGI programs access data, equipment, and other processes and return a myriad of document types.

CGI programs return a myriad of document types, not just HTML pages, to the server. They send back an image, a plain-text document, or even an audio clip as well as redirect the user to other documents or CGI programs.

CGI programs can also store and retrieve information about the browser in records called cookies on the client machine. These cookies store information about the user during one session, which the browser transmits when the user next requests your pages. This typically validates a user or allows the user to set up a custom page for their own use on your system. For example, this information is used to store information about what the user has accessed during visits to the site and then is used to customize information (such as advertising) that the user sees on his or her next visit to the site.

Many different compiled languages or scripting languages can be used to write CGI programs. Some of the typical languages include

Many people prefer to write CGI in a scripting language such as Perl or a shell rather than a compiled language. This is because scripts don't require compiling, which results in easier incremental development and debugging. The programs to interpret the scripts and their environments usually take up less room on the server.

Tools for producing CGI programs that do not require knowledge of a particular programming language are coming to the market. Choose a language or tool with which you are familiar or which supports the application programming interface (API) of your database engine.

There are hundreds of modules, sample code, and complete applications available for any of the more common languages, especially Perl and C/C++. There are also libraries of routines for retrieving information stored by the daemon and returning information to the Web daemon. The CD-ROM accompanying this book has several examples and URLs to sites with additional code and information.

Unlike Java and other evolving technologies, CGI is a relatively mature protocol. Most Web servers use it (some with variations). Because of the explosion of the Internet, many people have experience with it on all types of platforms and servers.

What Are the Negatives of Using CGI?

One of the biggest negatives of using CGI for Web programming is a security issue.

Running a CGI program is like inviting the world to run a program on your system; therefore, there are security considerations when using CGI programs. Because of the possibility of this abuse by CGI programs, most HTTP daemons limit the directories in which CGI programs reside. Most HTTP daemons require an ID that limits the program's access to other parts of the system.

Hackers can take advantage of poorly protected host systems. CGI programs, therefore, should check data before passing it on. Hackers even breach security holes in some mail systems.

Because CGI programs run entirely on the host computer, they can be a drain on host and network resources. If the user receives a form on his browser and enters information, the information, in whatever state, is transmitted to the CGI program through the HTTP host, and the CGI program edits the information. If errors are found, the information returns over the network to the user for correction. Users edit on the host computer, forcing incorrect information to transmit over the network before discovery.

New technologies, such as Java, promise the capability of editing on the client before the data is transmitted to the host. This should minimize the amount of invalid data that is transmitted over the network.

The Protocols

The basics of CGI are relatively straightforward. As seen in Figure 11.1, the server communicates with the CGI program primarily through environment variables and under some conditions through the command line.

Note
Although most CGI implementations use environment variables to transmit information from the server to the CGI program, some implementations may pass information through specific files. An example is MS-DOS where environment space can sometimes be at a premium.

The CGI program communicates with the server through standard output (stdout). In most instances, the programmer writes as if the CGI program communicates directly with the client browser. Most information sent by the CGI program passes unaltered by the server to the browser.

The CGI program activates when the browser receives a request for the program's Universal Resource Locator (URL). The server places information in environment variables and activates the program by piping its stdout the server's standard input (stdin).

The information passes from the server to the program in a standard format. It is the CGI program's job to read and decode this information, perform specific processing, and return information to the browser.

There is no connection maintenance between send/receive exchanges in CGI interactions. If an application requires multiple send/receive exchanges to complete its work, the CGI program is responsible for maintaining information about the conversation between exchanges.

The CGI program writes the information to files or databases on the host or passes information through cookies (which are supported by many browsers). The CGI program then stores information on the client system. This book does not include a discussion of this part of the CGI application function because it is not covered by the CGI specification.

Environment Variables

The server must pass information about the request from the browser to the CGI program. In order to do this, it uses a combination of the command line and environment variables. The following variables are passed to the CGI program for every request:

Environment variable Purpose/Comment
GATEWAY_INTERFACE The revision of the CGI specification to which this server complies. The format is CGI/revision.
SERVER_NAME The server's host name, DNS alias, or IP address as it appears in URLs.
SERVER_SOFTWARE The name and version of the information server software answering the request (and running the gateway). The format is name/version.

The next set of environment variables is specific to the request serviced by the CGI program.

Environment variablePurpose/Comment
SERVER_PROTOCOLThe name and revision of the information protocol this request has. The format is protocol/revision.
SERVER_PORTThe port number to which the request was sent.
REQUEST_METHODThe method with which the request was made. For HTTP, this is GET, HEAD, POST, and so on.
PATH_INFOThe client gives the extra path information. Scripts can be accessed by their virtual path names followed by extra information at the end of this path. The extra information is sent as PATH_INFO and if it comes from a URL, the server should decode it before passing it to the CGI script.
PATH_TRANSLATEDThe server provides a translated version of PATH_INFO, which takes the path and does any virtual-to-physical mapping.
SCRIPT_NAMEA virtual path to the script being executed, used for self-referencing URLs.
QUERY_STRINGThe information that follows the ? in the URL referencing this script. This is the query information. Never decode the query string in any way. This variable should always be set when there is query information, regardless of command-line decoding.
REMOTE_HOSTThe host name making the request. If the server does not have this information, it should set REMOTE_ADDR and leave this unset.
REMOTE_ADDRThe IP address of the remote host making the request.
AUTH_TYPEIf the server supports user authentication and the script is protected, this is the protocol-specific authentication method used to validate the user.
REMOTE_USERThis is the authenticated user name if the server supports user authentication and the script is protected.
REMOTE_IDENTIf the HTTP server supports RFC 931 identification, this variable is set to the remote user name retrieved from the server. Usage of this variable should be limited to logging only.
CONTENT_TYPEFor queries with attached information, such as HTTP POST and PUT, this is the content type of the data.
CONTENT_LENGTHThe length of the content as given by the client.

In addition to these variables, the client supplies any header lines and places them into the environment with the prefix HTTP_ followed by the header name. Any dash characters in the header name then change to underscore characters. The server can exclude any headers that it has already processed, such as Authorization, Content-type, and Content-length. If necessary, the server can choose to exclude any or all of these headers if including them would exceed any system environment limits.

An example of these header lines is the HTTP_ACCEPT variable defined in CGI/1.0. Another example is the header User-Agent.

Environment variablePurpose/Comment
HTTP_ACCEPTThe MIME types that the client accepts, as given by HTTP headers. Other protocols might need to obtain this information elsewhere. Separate each item in this list with commas according to the HTTP spec. The format is type/subtype, type/subtype.
HTTP_USER_AGENTThe browser the client uses to send the request. The general format is software/version library/version.

Getting Information from the Server

Every time a server receives a request for the URL of a CGI program, it executes the program in real time. Most of the program's output goes directly to the client. A CGI program does not accept command-line arguments because it uses the command line for other purposes.

CGI uses environment variables to send parameters to the program. The two major environment variables for this purpose are QUERY_STRING and PATH_INFO.

QUERY_STRING is anything that follows the first question mark (?) in the URL. For example, the URL http://www.myhost.com/cgi-bin/myprog.cgi activates the program myprog.cgi in the /cgi-bin directory under the document root on host www.myhost.com. To pass additional information to the program, the URL expands to


http://www.myhost.com/cgi-bin/myprog.cgi?mydata is here

Place the information in QUERY_STRING as


QUERY_STRING=mydata+is+here

This string is encoded in the standard URL format of changing spaces to plus signs (+) and indicating special characters with a percent sign and two-digit number (%xx). The CGI program must decode the string in order to use it.

You can add the QUERY_STRING information using either an ISINDEX document or an HTML form (with the GET action). Another way is to manually embed it into the HTML anchor, which references your gateway. This string usually is an information query-for example, what the user wants to search for in the databases or the encoded results of your feedback GET form.

If the Web daemon is not decoding results from a form, the query string decodes onto the command line. This means that each word of the query string is in a different section of ARGV. The CGI program receives, for example, the query string my data as


argv[1]="forms"

argv[2]="rule"

No decoding or other processing is necessary in order to use the data.

CGI enables the URL to receive extra embedded information, which transmits extra context-specific information to the scripts. The PATH_INFO information passes at the end of the URL without the server encoding any of the information.

A typical use for PATH_INFO information is to provide directory or file information for processing. Suppose that the CGI program


http://www.myhost.com/cgi-bin/myprog.cgi

needs to process information in directory /mydir. This information passes as an addition to the URL:


http://www.myhost.com/cgi-bin/myprog.cgi/mydir

TIP
One use of PATH_INFO is to pass configuration filenames to a CGI program. The same base CGI program can then handle multiple configurations by including the configuration file in the URL for the application.

Myprog.cgi knows the location of the document relative to the DocumentRoot via the PATH_INFO environment variable or the actual path to the document via the PATH_TRANSLATED environment variable, which the server generates. Because the first slash / passes with the PATH_INFO variable, it must be stripped if it is not needed.

Getting Form Data

Use the GET and POST methods to retrieve information from the forms. Each method returns the form information in a different manner.

<FORM ACTION=CGI URL METHOD=GET>

If the form tag includes the GET method, the CGI program receives the tags in the QUERY_STRING environment variable. This can be a method of maintaining information about a set of request/send transactions between the client and CGI program. For example, the CGI program might store information about the client by indexing a serial number key and encoding this key in the URL specified in the ACTION= option of the form tag. The URL with the additional key information returns to the program with a click of the Submit button. This allows the program to retrieve the stored information and restore its working environment.

<FORM ACTION=CGI URL METHOD=POST>

If the form tag includes the POST method, the CGI program receives the tags through stdin. Note that the server does not send an indication of end of file (EOF) at the end of data. The program must read the CONTENT_LENGTH environment variable in order to determine the length of the input to read.

Decoding Form Information

Both the GET and POST methods send URL-encoded TAG=data pairs separated by ampersands (&). Plus signs (+) replace spaces, and certain characters are encoded as %xx hexadecimal characters. A NAME tag identifies each FORM variable, and this NAME is placed in the TAG part of the data pair. For example, given the following form:


<FORM METHOD=POST>

<INPUT NAME="A" SIZE=5>  (Input "A B C")

<INPUT NAME="B" SIZE=5>  (Input "12345")

<INPUT NAME="C" SIZE=2>  (Input "DE")

</FORM>

The CGI program receives the following:


CONTENT_LENGTH=20

stdin: A=A+B+C&B=12345&C=DE

Luckily, several library routines are available for various languages to decode URL-encoded data. This makes life easier when creating CGI programs.

Returning Information to the Client

CGI programs return many document types:

Others types are defined by MIME type. CGI programs also can return references to other documents. The client must know what kind of document the program is sending it so it can display it accordingly. For the client to know this, the CGI program must tell the server what type of document it is returning.

To tell the server what kind of document the program is sending back, whether it is a full document or a reference to one, CGI requires the CGI program to place a short header on the output. This header is ASCII text, consisting of lines separated by either line feeds or carriage returns (or both) followed by a single blank line. The output body then follows in its native format.

A Document with MIME Type

For a full document, the CGI program must tell the server what kind of document it is delivering via a MIME type. Examples of common MIME types are text/html for HTML and text/plain for ASCII text.

Here is an example of an HTML document:


Content-type: text/html

<HTML>

<HEAD>

<TITLE>Title Goes Here</TITLE>

</HEAD>

<BODY>

<H1>Heading Goes Here</H1>

Body of the HTML document.

</BODY>

</HTML>

A Reference to a Document

Instead of sending the document, the CGI program directs the browser to a particular predefined document or has the server automatically send the new one.

An example is an application that sends existing published white papers based on information requests. In this case, the program should know the full URL of the files to reference and send something like the following:


Location: http://www.myserver.com/document_location<lf><lf>

The two line-feed characters form a blank line after the Location: line. The server acts as if the client's request was for the returned URL instead of the CGI program. It takes care of looking up the file type and sending the appropriate headers.

If you do want to reference a document that is protected by access authentication, the program must have a full URL in the Location: line. This is because the client and the server must retransact to set up access to the referenced document.

NOTE
If your application needs to send headers such as Content-encoding, your server must be compatible with CGI/1.1. Send the headers and Location or Content-type, and they are sent back to the client.