Chapter 26 SGML

by Edward Hooban

CONTENTS

The Problem
The Solution
Standards
Portability
Form and Structure
Authoring Systems
Instance Components
Advantages of Structured Markup
Document Type Declaration
Coming Together-The SGML Authoring System
The Future of SGML

The Problem

One of the great things about technology over the last couple of decades is its increasing functionality and role in our everyday lives. Technology now plays a key role in one of the most fundamental office activities: writing text. Word processors are among the most prevalent applications in the microcomputer's existence. A major problem with computers and text over the last several decades, however, is that numerous proprietary storage means for text have been invented. All storage ideas have certain features that merit consideration, but most limit portability because they are a particular vendor's embodiment of the perfect text storage format, which does not necessarily correspond with another vendor's storage format.

Recently, we have witnessed an explosion in information dissemination with the advent of the World Wide Web, creating an acute need for a standard document interchange format. This explosion of information dissemination is directly attributable to the widespread use and adoption of the HyperText Markup Language (HTML) to mark up text. Microsoft Word, WordPerfect, and other word processors never achieved this level of ubiquity because they stored their text in proprietary formats. They did not conform to a standard, therefore documents created with a particular word processor were of little use to someone not equipped with that particular word processor. Because the rules of HTML markup are created according to an international standard (Standard Generalized Markup Language, or SGML), anyone can publish a document that can be viewed by the world as long as they mark up their document according to the standard. HTML is as close as we come to a universal interchange standard for text, but it really only taps a small portion of the expressive power of SGML.

Without an agreed-upon standard for storing textual information, vendors devise their own means for storage and management of text. Each vendor may feel that its system affords the user numerous advantages that they cannot get anywhere else. Users are bound to a particular vendor's application and its feature set. In the short term, they get great features, but the long term portability of their data suffers because it is stored according to a particular vendor's closed and proprietary scheme. When that vendor's product is obsolete and those original features are outdated, there is a huge issue with migrating that text to another vendor's product or platform. To solve the issue of information being stranded in a particular environment with a particular vendor, the International Standards Organization (ISO) defined a standard, Standard Generalized Markup Language (SGML), for representing and storing textual information to meet the portability and reuse requirements of an increasingly digitized world.

The Solution

Standard Generalized Markup Language (SGML) defines a standard for creating a markup language to describe the structure of a document. It provides a method for generically marking up an infinite variety of documents based strictly on that document's unique structure. For example, a sales report on fourth-quarter earnings might be structurally different from an engineering report on the specifications for maintaining an aircraft part. Each of these particular document types would have unique structural rules. Table structures might be more rigidly defined in a quarterly report, and explanation of parts might be more richly structured in a parts document for engineers. In an SGML authoring environment, the structure of the documents is more important than stylistic formatting issues such as font sizes (where structural meaning may have to be inferred). This enables documents to be produced in a variety of formats that require different markup codes.

Stylistic formatting is one of the biggest problems that plagues the interchange of electronic information. After you have authored a document for a particular vendor's application environment, you are tied to that vendor's proprietary method for defining the structure and style of documents. This is typically not compatible with another vendor's implementation of the same structural and stylistic markup. Within the same editing package, authors might attempt to convey widely varying structural meaning for the exact same stylistic element. Each vendor and author has developed what they consider to be an efficient and effective means for marking up documents.

For example, suppose you and a co-worker want to share a document. If you use Microsoft Word and your co-worker uses WordStar, you might have difficulty exchanging documents between the two formats. You and your co-worker might have different ideas of what 12-point bold means. If you want to exchange document drafts with a co-worker, you must find some neutral method for saving and interchanging work. Given the proprietary nature of storage for each word processing product, this can be a difficult issue. Another scenario is if you want to send out an electronic document company wide via e-mail but you are not sure if everyone has the word processor you used to create the document (actually, the World Wide Web solves this problem for simple documents with the prevalence of the intranet for broadcasting information).

SGML attempts to solve the problem of information becoming stranded in a particular application or on a particular platform. It is a generic, international standard. Markup rules are not controlled by any particular vendor, only by an international standards committee. The standard is vendor neutral, so adherence to it facilitates the long-term use and re-use of information. In addition, documents are validated against a particular structural rule set so that processing programs know the exact structure of what they are getting with the assurance that it has been validated against a specific rule set.

Listing 26.1 shows an example of marking up text. Instead of indicating that Section One is 12-point, bold Helvetica and embedding the style commands that are only understood by a particular application, we are indicating that Section One is Heading Level 2 and surround the text with the appropriate tags, thus conveying its structural meaning and not its stylistic presentation.

Listing 26.1. An example of marking up text.


.point 12 .attribute bold .font Helvetica Section One  [Word Processor 1]

!pointsize=12!attribute=bold!font=Helvetica! Section One [Word Processor 2]

<H2>Section One</H2>

This increases the document's flexibility when you want to move it to another editing application or process it for a new medium of distribution. Stylistic details are left to a particular processing program for a specific output medium. The output processing engine uses whatever stylistic elements are at its disposal for output on its unique medium.

You need to send along with your document a set of rule descriptions (known as a document type definition or DTD) and an instance (markup and data) so that the document can be easily processed for various outputs. For example, one processor for CD-ROM might make Section One 18-point italic, another processor such as a browser for the World Wide Web might make it 14-point bold, and yet another processor for a book might make it Arial, 16-point bold. The point is that it doesn't matter how each application renders the information. What matters is that they all know the exact structure of a document and can make their own formatting judgments. Additionally, they should verify that the document conforms to the agreed upon structure as defined in the DTD.

One of the most commonly held misconceptions is that HTML is a subset of SGML. HTML is a particular implementation of SGML. SGML just sets the ground rules by which you may create a markup language. The World Wide Web consortium is responsible for creating and maintaining the HTML document type definition, or rules files for Web-distributed documents. Browsers are free to render documents marked up according to the HTML DTD in any way they see fit. For example, one browser might interpret emphasis tags as bold and another might interpret them as italic. Both browsers know that, within the structure of this particular document, this text should be emphasized. The style of emphasis is completely up to the browser.

The key components of an SGML document include a declaration, a document type definition (DTD), and an instance. The declaration defines what character set is used and what the delimiters are. Unless you are a serious SGML professional, you are generally not concerned with the declaration. The DTD defines the rules by which you author documents, including markup tags and the order and relationship among them (these are pretty inscrutable, too). The instance contains the marked-up text according to the rules as defined by the DTD and the constraints and character sets as defined by the declaration.

Standards

Standard Generalized Markup Language (SGML) is an official international standard (ISO/IEC 8879:1986 "Information Processing-Text and Office Systems-Standard Generalized Markup Language (SGML)" is the title of the official standard available from the International Standards committee) for exchange of digital information. It is a meta-language (language for creating languages) or grammar for defining specific implementations of markup schemes. It is a flexible system for defining a rich set of markup languages to precisely define the structure of an unlimited set of documents. The standard was primarily driven by Dr. Charles Goldfarb, who worked extensively with markup languages for many years at IBM on GML (Generalized Markup Language), which eventually became SGML.

Portability

An illustration will help to clarify some of the benefits of SGML. In Figure 26.1, you can see that a document is authored according to a particular set of markup rules. From this source document, which is explicitly defined and rigidly structured, your processing programs have a known textual quantity. This isolates any filtering issues to a particular program and not to the data itself. Conformance to international standards ensures that documents created according to SGML are truly portable across hardware systems and application software vendor systems.

Figure 26.1 : Document with a particular set of markup rules.

Form and Structure

One of the key tenets of SGML is that the format of a document is separated from the structure of a document. A traditional word processor, such as Microsoft Word, allows you to intermingle your formatting codes with your structural codes. Each word processor or desktop publishing application has its own proprietary markup schemes. This makes the interchange of documents very cumbersome. It also limits their capabilities for being rendered by different word processing, desktop publishing, and electronic publishing engines.

These are issues that SGML attempts to address. An author can generically mark up a document according to a document type definition (DTD). The author is not concerned with exactly how the document looks but with how the document is structured. SGML markup works with the logical structure of the document, not the document's appearance. There are no stylistic concerns at this stage, so documents are more flexible for porting to various platforms and publishing environments. One set of source, SGML-marked-up files can be used to generate numerous output on various media such as the World Wide Web, CD-ROM (any number of electronic publishing engines), desktop publishing, and even hard copy (see Figure 26.2).

Figure 26.2 : A generic document.

A widely held misconception is that authoring and using SGML is overly complex and unforgiving. Although it is true that some SGML applications have been known as user-unfriendly, SGML authoring systems have become richer and more robust, including WYSIWYG (What You See Is What You Get) graphical authoring environments with on-the-fly validation and graphical tree representations of document models. In the long run, the increased effort required to use an SGML authoring system makes document management issues easier.

Authoring Systems

SGML authoring systems have varying levels of integration. There are several essential components to any SGML system. The following list describes these components in detail.

The editing application consists of a text editor (usually graphical) and a validating parser for managing the creation and editing of SGML documents. If you have used any editor for marking up HTML, then you used an SGML editor. Some of these HTML editors are more rigid than others. Some do not interactively parse your document; they do not check while you are authoring whether you are conforming to the DTD. HTML editors usually enforce one particular version of an HTML DTD and probably do not accommodate another DTD. This behavior is very limiting because such editors can validate and verify the integrity of only one document type. For a truly powerful SGML authoring environment, you need an editor with the capability to parse any DTD that is fed into the system.
The document type definition (DTD) is a file containing the valid markup language for a particular document type. For example, documents from the engineering department might have their own DTD that is different from documents from the accounting department. They both conform to the SGML standards and may even be authored with the same package. A DTD expert creates and modifies the library of document types. A number of specialized tools for editing DTDs are available, and some tools include tree-like graphical representations of the information.
The instance is the document that you create, which includes content and markup. Instances consist of individual textual elements and entities, the individual components of a document. This end-user document is validated against the DTD that is associated with it. The instance is the end product of an authoring session. After the content is entered and marked up, a parser processes the content and tags to validate its structural integrity against the DTD.
Processing software: After you create an SGML document, you want conversion programs to take it to various streams of output. These output media might include hard-cover books, CD-ROM, or the World Wide Web. The processing software can be written in any particular language. One particular software processing language with numerous features is Omnimark. Omnimark includes a validating parser in its processing environment and an English-like, event-driven programming language optimally designed to process SGML documents. Other languages for processing output include Perl and C (with freeware parsers available for both environments).

Instance Components

The instance must contain a reference to a particular DTD. This must come at the beginning of the document with the <!DOCTYPE> tag, indicating the type and location of the DTD. This is a critical piece of information to the parser because it needs to know what context it is in. The other components of an instance are the actual text and the markup that surrounds the text (tags as defined by the DTD referenced).

Elements

With SGML, consider your document a collection of objects. The object name, behavior, and characteristics are defined in the DTD. The objects are instantiated when you write the content and surround it with the appropriate markup. For example, a tag can have certain attributes that enable you to define information about that piece of text, such as a unique ID, object author, object topic, and object creation date, as shown in the following line:


<H1 ID="145981" TOPIC="MATH" AUTHOR="JOHN SMITH"> Functions and Graphs </H1>

The preceding line is a self-contained unit of a document called an element; it is a standard SGML textual unit. Think of it as an object with certain characteristics (known in SGML parlance as attributes). Elements can also contain other elements but not particular text content. An element set that encloses other elements (with no text of its own) is called a content model. Listing 26.2 shows a very simple example of a book to illustrate markup and elements.

Listing 26.2. Illustration of markup and elements.


<!-- SGML Example -->

<BOOK>

<PART><TITLE>Databases</TITLE>

<CHAPTER><TITLE>Object-Oriented Databases</TITLE>

<HEADING>Free OODBMS's</HEADING>

<PARA>There are a number of free Object Oriented Databases:</PARA>

<LIST>

<ITEM>Ingres</ITEM>

<ITEM>Hyper-G</ITEM>

</LIST>

<PARA> And there are also a variety of commercial databases:</PARA>

<LIST>

<ITEM>Gemstone</ITEM>

<ITEM>O2</ITEM>

</LIST>

</CHAPTER>

</PART>

</BOOK>

<!-- End SGML Sample -->

From this example, you can see a number of characteristics for an SGML document. First of all, notice that all the content is surrounded by tags. These tags are very important for conveying the structure of the document. A begin tag is created by using a <TAGNAME> and denotes the beginning of some structural element in the document. An end tag is similar, with the exception of a slash inserted, </TAGNAME>.

The first and last tags are comments and do not have a begin and end tag set; they are delimited with . The next tag is the <BOOK> tag and indicates that everything that follows until the closing </BOOK> tag is part of the book element construct. The BOOK tags are known as a content model. (<HTML> </HTML> are very similar tags structurally.) They do not directly contain text; they are merely containers for other tags and their text. The next tag, <PART>, is also a content model. It merely contains other tags and their content. These two tags represent an important concept in SGML, rich structural markup for flexible processing.

The next tag, <TITLE>, actually contains some text. It is delimited by an opening tag, <TITLE>, and a closing tag, </TITLE>. In between is the actual text. Next in the example, you see another content model, a <CHAPTER> tag. It doesn't contain content but merely delimits the structural boundary for a book chapter (a content model). Within the <CHAPTER> tag is a set of <TITLE> </TITLE> tags to indicate the actual name of the chapter.

Chapters have <HEADING> tags to delimit subsections. The <HEADING> is designed to have content. The rest of the chapter is structured in a similar manner.

The structure of the preceding book must be formally defined in a document type definition. Look at the table of contents for this book. You will notice how the book is broken down into sections and chapters. If you look at the individual chapters you will notice that they are further broken down into section headings within chapters and paragraphs within headings. Going back to the preceding simple example, you can infer some rules about the content markup:

A BOOK element always contains one or more PART elements.
A PART can have only one TITLE and must precede a HEADING. Optionally, you can leave out a TITLE element.
A CHAPTER can have zero or more of HEADING or PARA or LIST elements that must follow the TITLE.
A TITLE must contain non-parsable character data (ASCII text).
A HEADING must contain non-parsable character data (ASCII text).
A LIST must contain one or more ITEMs.
An ITEM must contain non-parsable character data (ASCII text).

Note that I have not made any figure references. Figures are usually handled with an external entity reference, indicating where on the file system they are located. Figure contents are not parsed, just the reference to them.

Minimization

Noting the rules that were set forth for authoring a book, you can see that this markup is a bit verbose. That is, every begin tag has an end tag that explicitly ends the textual unit. Many times, you can determine from the context whether or not an end tag is explicitly required. For example, in the following code, an <ITEM> tag is implicitly ended by the beginning of a new <ITEM> tag; therefore, you do not need to explicitly end it, as shown in the following:


<LIST>

<ITEM>Ingres

<ITEM>Hyper-G

</LIST>

The capability to omit tags, known as minimization, is dictated by the document type definition (rules file). The DTD indicates what tags can and cannot be minimized.

Attributes

Every SGML element can have one or more attributes (or none). Attributes are information about that particular element-you can think of them as adjectives to the element's noun. In an object-oriented class, attributes would be the instantiated object's instance variables. For example, every <PARA> tag might have an ID attribute as a unique identification code for that textual element that looks something like the following:


<PARA ID="1111-2222-3333-4444">This is a sample paragraph with 

an attribute</PARA>

This PARA container has a unique label, so you can distinguish and catalog each of your elements in a fine-grained manner. This identification could be useful if you have a distributed electronic authoring system with constant additions and deletions from a document. If the document is sufficiently large, you wouldn't want to lock the whole thing and check it out to one author. Instead, you'd like to control editing in a more sophisticated manner. The capability to lock certain discrete sections of a document facilitates the multiple author edit process. Authors could work simultaneously on different sections of document with no conflict. Within the attributes for each of the elements, an AUTHOR attribute could track who was responsible for each textual unit in a document.

Additionally, you can define access control to certain parts of a document. When an author checks out elements from the payroll section of a budget with the attribute GROUP="ACCOUNTING", you can verify that person as someone with access to this information. In contrast, a group of elements comprising the mission statement might have an attribute of GROUP="ALL", thus allowing everyone access to this information. The element, with its attributes, is a self-contained unit of information that can be processed in any number of ways. Attributes can be very powerful features. Hypertext systems (such as the World Wide Web) are highly dependent on the use of attributes. Tags such as <IMG> and <A> indicate particular object types, but attributes indicate the specific resources for those object types.

Advantages of Structured Markup

SGML provides facilities for defining the rules that govern the authoring process and provide for rigidly structured document creation. What are the benefits of marking up data in such a manner? In the case of a technical book, processing programs could automatically generate the table of contents as part of their conversion to an output format. This means that the author or editors do not have to worry about generating a table of contents; they let the processing program take care of that. The document is validly structured, making this a trivial task. Additionally, the author or editor can make major structural changes until the last minute without affecting the table of contents. The table of contents is generated by a processing program after you have completed a valid document.

NOTE

Another interesting thing that you can do with a well-structured document is contextual analysis. A processing program familiar with the document type could automatically generate weighted indexes for search engines based on the placement of words. (For example, words within the <HEADING> </HEADING> tags would have a higher weight than words within the PARA tags.) This technique is used by some World Wide Web crawlers in examining HTML documents. Words within the <H1> </H1> tags have a higher weight than words within the <H5> </H5> tags. With a highly structured and rigidly defined document, you can utilize some powerful processing capabilities for manipulating the data.

Document Type Declaration

A document type definition is written according to the rules as set forth in the SGML standard. SGML lays the ground rules for how to define markup, and a DTD is a particular implementation of markup rules. Specifically, a DTD defines the names of elements, how often such elements may appear, the order in which the elements must appear (in the document and relative to one another), and whether elements may be safely omitted.

Usually, a DTD is created for a particular class of documents. An example might be a corporate division business plan DTD with a unique structure different than an engineering document. What is common to both DTDs is that they define the valid range of tags allowed to mark up a particular type of document. A DTD can be as loosely defined or as rigidly defined as a particular document type requires. This is where the real power (and danger) of SGML is manifested. A DTD can be written as the most complex and rigidly defined of documents, or it can be used to give loose and informal structure to a document. Good design is imperative.

NOTE

Poorly designed DTDs can cause complications in a parsing engine. A DTD can be defined so loosely that a processing program responsible for interpreting an instance would have to be extremely complex and attempt to cover a lot of ambiguities in the markup.

Additionally, the DTD defines where tags could go in relation to other tags for a particular document type. Both the structure and syntax of the DTD and any documents created in accordance with it are verified as syntactically correct by a validating parser.

Listing 26.3 is written according to the following DTD, which lays out all the available tag names and how they relate to each other. Don't worry about the syntax. (I have already explained the rules for this particular document type.) Just realize that Listing 26.3 is part of what a DTD looks like (without the attributes).

Listing 26.3. Sample document type definition (DTD).


<!ELEMENT BOOK     - -  (PART+)>

<!ELEMENT PART     - -  (TITLE?, CHAPTER+)>

<!ELEMENT TITLE    - O  (#PCDATA)>

<!ELEMENT CHAPTER  - -  (TITLE?,(HEADING | LIST | PARA)+)>

<!ELEMENT HEADING  - -  (#PCDATA)>

<!ELEMENT PARA     - O  (#PCDATA)>

<!ELEMENT LIST     - -  (ITEM+)>

<!ELEMENT ITEM     - O  (#PCDATA)>

Do not worry if the markup makes no sense to you. Typically, a DTD is written and maintained by a skilled SGML analyst. These rule sets are too vital and complex to be left to an author who cares only about creating content. If you have a poorly written DTD, the repercussions can be disastrous. Processing systems will break and output will be unreliable (especially across various media).

Many of the authoring and processing environments provide access to a validating parser. The parser's job is to make sure that the DTD and the rules it sets forth are rigidly enforced against every document that claims to abide by that particular DTD. For example, in Listing 26.3, the <BOOK> tag cannot contain a <PARA> tag. A <PARA> tag is only allowed within a <CHAPTER> tag. If you tried to put the <PARA> tag before the <PART> tag in Listing 26.3, the parser would complain.


<BOOK><PARA>I am going to talk about databases</PARA>

<PART><TITLE>Databases</TITLE>

This is certainly possible to do, if the rules permit it. (If you need such flexibility, you must change your DTD accordingly.) In this particular DTD, placing the <PARA> tag as in the preceding example is an invalid construct.

Without these assurances of the integrity of the authoring process, programs would have an unreliable input stream. The obvious consequence is an unreliable output stream. A parser's job to validate documents is vital to the success of an SGML system.

Coming Together-The SGML Authoring System

The varieties of systems for authoring SGML documents range from GUI-based, on-the-fly validation environments to a crude UNIX-based vi text editor with a freeware validating parser. The choice depends on the publishing environment. Paradoxically, SGML's vendor independence and implementation flexibility can make it a tremendously complex environment to set up.

Depending on the type of publishing operation you intend to run, the training and software costs can be high. Each of the authors will likely need a WYSIWYG SGML authoring system with a fully compliant validating parser. Authors need training for such a system. It often takes a bit of adjustment, psychologically, to get attuned to a structured authoring environment. It can be frustrating for the authors because they prefer to concentrate on writing rather than conforming to a particular DTD (or even worrying about a DTD).

A document analyst must do a significant amount of work to assess your various document structures. After assessing your structural requirements, you must author and debug the DTDs. Then, the DTDs must be constantly maintained. Once you have a body of data that you'd like converted to other formats, you need to write programs to perform the conversion. These conversion programs must also change to reflect the changing structure of your DTD.

In short, developing and maintaining an SGML authoring system is no small task. It requires significant up-front investment of both time and money, but the long-term rewards can be invaluable.

The Future of SGML

Richer SGML systems will have increasing prominence in information technology shops of the future. The advent of the World Wide Web is a proven SGML application. HTML is a very simple DTD. HTML 3.0 proposes numerous features, including enhanced table markup, mathematical symbols, and greater attribute control. The groundwork and protocols for interchange of information has been established (TCP/IP). HTML has effectively addressed a short-term need to structurally define bits of information distributed throughout the world. As the demand for distribution of information via the Internet increases, the need for a more sophisticated and robust method of marking up a wider variety of information will become paramount.

Already, certain browser manufacturers are adding additional parsing functionality to their products to support extensions to the HTML DTD. This is because the market needs a product that allows it to represent a richer set of information. Demand is present. As a greater body of increasingly complex information requires a platform- and vendor-independent form of distribution, SGML will become a greater utilized standard. Presently, the most successful implementation of SGML, HTML, has bumped up against severe limitations in the range of data that it can represent. Browsers such as Arena and Panorama parse a wider variety of DTDs on-the-fly and render them in a hypertext environment.

The dream of a uniform and complete hypertext-linked environment is getting closer to realization with the advent of SGML standards.