Chapter 2 Reducing Site Maintenance Costs Through Testing and Validation

What Does "Maintenance" Mean?
What to Do with Validation Results
Validator-Like Tools

Arunning joke in the software development field is that software engineering is the only branch of engineering in which adding a new wing to a building is considered "maintenance." By some estimates, software maintenance-those changes made to the product after it is released for use-accounts for almost ninety percent of the lifetime cost of the product.

Software maintenance occurs for several reasons. Some maintenance actions occur to fix latent defects-bugs. Sometimes maintenance must be performed to keep software up-to-date with new standards or with changes in other components of the system. Of course, sometimes a product is changed to add a new feature requested by the users.

Regardless of why it occurs, changing a product that already exists in the field is expensive. Thorough testing can reduce the number of defects, eliminating some maintenance costs. Validating HTML to make sure it meets the standard makes it less likely that a change in some browser will force the developer to recode the page. Reducing these costs allows the developer to offer site development at a lower cost, and the site owner can spend more on content maintenance, which builds traffic.

What Does "Maintenance" Mean?

Software has no moving parts. There is nothing physical that can age, wear, or break down. So why do seemingly intelligent people talk about software "maintenance"?

First, software rots. All the reasons for this phenomenon aren't clear, but any experienced software engineer treats this as a given-all software rots.

A key reason that software needs maintenance is that the rest of the computer world does not stand still. A piece of software that works on today's computers, with today's operating systems, will inevitably be less compatible tomorrow. By next year, it might be totally unusable.

Another reason that software needs maintenance is to adapt to changing requirements. If the software is being used, then people probably are finding new ways to use it, and might want new features and capabilities. In many ways, a request for an enhancement is a sign of success-someone is running her business with your program and wants the program to become even more useful than it already is!

Yet another reason to maintain software is that, on very rare occasions, a user discovers a defect. The industry amuses itself by calling these defects bugs. Make no mistake about it-there's nothing amusing about a bug.

Keeping Traffic High with Changes

The best kind of maintenance is the kind that improves the site-by adding new content and features that attract new visitors and encourage them to come back again and again. This kind of maintenance usually takes a lower priority compared to the tasks of defect removal and keeping the site up-to-date with the browsers. One key to building an effective site is to keep the maintenance costs low so plenty of resources are available to improve the site, which in turn builds traffic.

On the Web, severe software defects are rare. One reason for this is that HTML is not a programming language, so many opportunities a programmer might have to introduce defects are eliminated. Another reason is that browsers are forgiving by design. If you write bad C++ and feed it to a C++ compiler, chances are high that the compiler will issue a warning or even an error. If you write bad HTML, on the other hand, a browser will try its best to put something meaningful on-screen. This behavior is commonly referred to as the Internet robustness principle: "Be liberal about what you accept and conservative about what you produce."

The Internet robustness principle can be a good thing. If you write poor HTML and don't want your clients to know, this principle can hide many of your errors. In general, though, living at the mercy of your browser's error-handling routines is bad for the following reasons:

Not everyone is using the same browser. HTML that survives in your browser might break in another.
You want your code to last. As time goes on, browsers must become stricter. Code that works in Level 1 might break in Level 2.
Browser authors don't always do error handling correctly. You might make an error that confuses the browser, with unpredictable results.
Browsers are not the only pieces of software that read the HTML on your site. Increasingly, robots are trying to understand your HTML. Robots are online programs that explore the Web looking for new sites. When they visit a site, they typically categorize each page (and sometimes the site as a whole) and include their findings in online databases. See Webcrawler (http://www.webcrawler.com/) for a good example of what a robot can find.
The standards change. Next year, features will be supported in HTML that have scarcely been thought of today. As you add new features to your site, problems that browsers ignored before might start to confuse the browsers.

If you could write each page once and leave it alone forever, then maybe you could take the time to perfect each line of HTML. If your site is being actively used, however, then it is being changed-or should be.

The most effective Web sites are those that invite two-way communication with the visitor. Remember the principle content is king. Web visitors crave information from your site. One way to draw them back to the site is to offer new, fresh information regularly. If a client posts new information every few weeks, people will return to the site. If the client posts new information daily, people will stampede back to the site. The expert Webmaster must deal with all the new content.

First, we'll discuss how to make the HTML as perfect as possible when the site is initially developed. Then we'll describe a maintenance program to keep the site working effectively.

Testing the Site

Although computerized validators are useful, a key step in building an effective site is to have the site reviewed by human evaluators. In any medium-the Web is no different than print-the principal author gets so close to the copy that he or she fails to see errors and might miss obvious ways to make the copy more effective.

Polishing the Copy

Too many Web sites are put up by technically skilled people who are not experienced in the effective use of words. Even people who can write well often get too close to their material to be able to spot or fix problems. Most Web sites benefit from having a professional copywriter participate in the design process.

Most copywriters get their experience in environments, such as the print media, where the client is paying by the word or the column inch. They are trained to pack the most impact into the smallest number of words. While Web sites don't usually have tight page limits, the Web developer is competing for the attention of the readers-and short, high-impact messages are useful to draw visitors into the site.

The very best copywriters have made it their business to know something about a branch of psychology called "human factors." They know, for example, that a line that can be read with little or no eye movement has more impact than a long sentence in which the eye has to scan across the page. HTML lends itself to wide lines-a skilled copywriter will break these lines up into shorter segments to increase readability and retention.

Just as the copywriter can add impact to the words, a graphic artist can evaluate the site for overall balance and color scheme. A larger development team may have a full-time art director. A smaller firm may have one or more freelancers on retainer. Either way, it is worthwhile to get a professional evaluation of the aesthetic aspects of the site.

Using a Red Team

A copywriter is not the only person who can improve the site. A good Web developer assembles a team of independent evaluators who can test the effectiveness of the site. These people should be representatives of the target audience. They can be friends and relatives, but only if those people can be counted on to give hard feedback when it's needed. Getting someone outside the development team to look over the site will lead to new ideas, plus they'll spot problems or weaknesses the developer has become blind to.

A group of independent evaluators who review the product from the point of view of the target audience is called a Red Team. The following is one sequence for using Red Team members:

Only one person at a time reviews the site. Each starts at the home page and goes as far into the site as he or she cares to. No one is required to read every page on the site.
Once a reviewer has turned in comments, he or she works with the developer to make the suggested changes (if the changes are accepted by the developer).
After one reviewer has looked at the site, made his or her comments, and had his or her comments incorporated, the next reviewer looks at the site.

When the Red Team members review the site, the following occurs:

The developer instructs half of the reviewers to look at the site on paper (hard copy) and the other half to look at the site online.
The developer asks the hard-copy reviewers to put a mark on the page whenever they put the paper down for any reason, even if they're only taking a break.
The developer uses log files to follow the path each online evaluator took. The developer looks to see which links were followed first, which links the visitor skipped entirely, and how long the visitor spent on each page.
At least one of the hard-copy reviewers is asked by the developer to circle anything on the site that he or she finds off-color, offensive, or in poor taste. He or she isn't asked to write an explanation-just to circle the item.

During the Red Team process, the site developer looks for trends. If most hard-copy evaluators took a break at the end of page 2, there might be something about page 2 that's tedious. Or there might be very little to draw them into page 3.

The developer uses the log analysis for the same reason. If most evaluators skip page 3, it's worth reconsidering how that page is introduced. If evaluators take a long time reading page 6, perhaps that page is tedious-or perhaps it was particularly interesting. The developer should talk to the evaluators to get their specific impressions.

Validating the Site

Copywriters, art directors, and Red Teams are all ways to improve the quality of the contents. It is just as important for the site developer to validate the quality of the code. Strictly speaking, "validation" refers to ensuring that the HTML code complies with approved standards. Generally, validator-like tools are available to check for consistency and good practice as well as compliance with the standards.

What Is an Open Standard?

HTML is part of an open standard. To understand open standards, it's important to understand the alternative: proprietary standards.

During the first few decades of the computer era, it was common for each computer manufacturer to come up with its own language, its own interfaces, and its own cable and signaling standards. It did this to make sure that if a customer ever considered changing vendors, he or she would have to throw out everything. Many of those companies became quite successful at keeping customers tied to specific architectures for years.

In the late seventies, personal computers from Apple and IBM were introduced. By the mid-eighties, the IBM PC had been cloned, and customers were delighted to have a choice of vendors for their computers, peripherals, and software. By the nineties, UNIX had been ported to almost every computer on the market, and customers had unprecedented freedom of choice. They could buy their hardware from one vendor, their operating system from another, and application programs from others. If they became dissatisfied with a vendor, they could change without having to throw out the rest of their system.

Then along came the Internet. In many ways, the Internet is the culmination of the rise of the open standard. From a desktop computer, a user can access software running on thousands of different computers from hundreds of different vendors. On the Internet, FTP works about the same for an IBM mainframe as it does for a PC running Linux. E-mail can be exchanged between VAXs and Macs, and Web servers and browsers exist for all popular platforms. In theory, a Web page can be written on one machine, served by a different machine, and read by yet another machine.

Unfortunately, "in theory" often means "not really." The next section describes what is happening to the open standard of HTML.

Document Type Definitions and Why You Care About Them

The HyperText Markup Language (HTML) is not a programming language or a desktop publishing language. It is a language for describing the structure of a document. Using HTML, users can identify headlines, paragraphs, and major divisions of a work.

HTML is the result of many hours of work by members of various working groups of the Internet Engineering Task Force (IETF), with support from the World Wide Web Consortium (W3C). Participation in these working groups is open to anyone who wishes to volunteer. Any output of the working groups is submitted to international standards organizations as a proposed standard. Once enough time has passed for public comment, the proposed standard becomes a draft and eventually might be published as a standard. HTML Level 2 has been approved by the Internet Engineering Steering Group (IESG) to be released as Proposed Standard RFC 1866. (As if the open review process weren't clear enough, RFC in proposed standard names stands for Request for Comments.)

The developers of HTML used the principles of a meta-language, the Standard Generalized Markup Language (SGML). SGML may be thought of as a toolkit for markup languages. One feature of SGML is the capability to identify within the document which of many languages and variants was used to build the document.

Each SGML language has a formal description designed to be read by computer. These descriptions are called Document Type Definitions (DTDs). An HTML document can declare which level of HTML it was written for by using a DOCTYPE tag as its first line. For example, an HTML 3.0 document starts with the following:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">_

The DOCTYPE tag is read by validators and other software. It's available for use by browsers and SGML-aware editors, although it's not generally used by those kinds of software. If the DOCTYPE tag is missing, the software reading the document assumes that the document is HTML 2.0.

Table 2.1 lists the most common DOCTYPE lines and their corresponding HTML levels.

Table 2.1 DOCTYPETags Cue Document Readers About What Type of Markup Language Is Used

DOCTYPE	Level
`<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">`	2.0
`<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">`	3.0
`<!DOCTYPE HTML PUBLIC "-//Netscape Comm. Corp.//DTD HTML//EN">`	Netscape

HALsoft, the Original Validator

The HALsoft validator was the first formal validator widely available on the Web. In January 1996, the HALsoft validator moved to WebTech and is now available at http://www.webtechs.com/html-val-svc/.

At the original Web validator, the WebTech validator is the standard by which other validators are judged. Unfortunately, the output of the WebTech program is not always clear. It reports errors in terms of the SGML standard-not a particularly useful reference for most Web designers.

The following gives an example of a piece of flawed HTML and the corresponding error messages from the WebTech validator:

<!DOCTYPE HTML PUBLIC	"-//IETF//DTD HTML 3.0//EN"> 
<HEAD>
<TITLE>Test</TITLE>
<BODY BACKGROUND="Graphics/white.gif>
<H1>This is header one</H1>
<P>
This document is about nothing at all.
<P>
But the HTML is not much good!
</BODY>
</HTML>

produces

Errors
sgmls: SGML error at -, line 4 at "B":
      Possible attributes treated as data because none were defined

The Netscape attribute (BACKGROUND) on the previous page will be flagged by the validator as nonstandard. The missing closing tag for the HEAD doesn't help much, either, but it's not an error (because the standard states that the HEAD is implicitly closed by the beginning of the BODY). Even though it's not a violation of the standard, it's certainly poor practice-this kind of problem will be flagged by Weblint, described later in this chapter.

The WebTech validator gives you the option of validating against any of several standards:

HTML Level 2
HTML Level 2 Strict
HTML Level 3
HTML Level 3 Strict
HTML with Netscape extensions
HTML with HotJava (a browser released by Sun Microsystems to demonstrate its Java language) extensions

HTML Level 2 is "plain vanilla" HTML. There were once HTML Level 0 and Level 1 standards, but the current base for all popular browsers is HTML Level 2 (also known as RFC 1866).

Each level of HTML tries to maintain backwards-compatibility with its predecessors, but using older features is rarely wise. The HTML working groups regularly deprecate features of previous levels. The notation Strict on a language level says that deprecated features are not allowed.

HTML Level 3 represents a bit of a problem. Shortly after HTML Level 2 stabilized, developers put together a list of good ideas that didn't make it into Level 2. This list became known as HTML+. The HTML Working Group used HTML+ as the starting point for developing HTML Level 3. A written description and a DTD were prepared for HTML Level 3, but it quickly became apparent that there were more good ideas than there was time or volunteers to implement them. In March 1995, the HTML Level 3 draft was allowed to expire, and the components of HTML Level 3 were divided among several working groups. Some of these groups, like the one on tables, released recommendations quickly. The tables portion of the standard has been adopted by several popular browsers. Other groups, such as the one on style sheets, have been slower to release a stable recommendation. As of this writing, only the Arena browser implements style sheets, and its implementation (known as cascading style sheets) won't necessarily become the standard.

The DTDs for Netscape and HotJava are even more troublesome. Neither Netscape Communications nor Sun Microsystems has released a DTD for its extension to HTML. The patient people at HALsoft reverse-engineered a DTD for validation purposes, but as new browser versions are released, there's no guarantee that the DTDs will be updated.

Gerald Oskoboiny's Kinder, Gentler Validator

During the brightest days of the HALsoft validator's reign, the two most commonly heard cries among Web developers were "We have to validate" and "Can anybody tell me what this error code means?"

Gerald Oskoboiny, at the University of Alberta, was a champion of HTML Level 3 validation and was acutely aware that the HALsoft validator did not make validation a pleasant experience. He developed his Kinder, Gentler Validator (KGV) to meet the validation needs of the developer community while also providing more intelligible error messages.

KGV is available at http://ugweb.cs.ualberta.ca/~gerald/validate/. To run it, just enter the URL of the page to be validated. KGV examines the page and displays any lines that have failed, with convenient arrows pointing to the approximate point of failure. The error codes are in real English, not SGML-ese.

Notice that each message contains an explanation link. The additional information in these explanations is useful.

Given the fact that KGV uses the same underlying validation engine as WebTech's program, there's no reason not to use KGV as your primary validation tool.

What to Do with Validation Results

Programmers learned years ago to be suspicious of code that gets through the compiler's error-checker on the first try. Humans at their best are not precise enough to satisfy the exacting requirements of the input parsers that try to make sense of our programs. For the most part, HTML pages are no different. Most of the time, pages of HTML fail somewhere in the validation process.

Six Common Problems That Keep Sites from Validating

There are many reasons that pages won't validate, and you can do something to resolve each of them. The following sections cover the problems in detail.

Netscapeisms

Netscape Communications Corporation has elected to introduce new, HTML-like tags and attributes to enhance the appearance of pages when viewed through its browser. The strategy appears to be working because in February 1996, BrowserWatch reported that over 90 percent of the visitors to its site used some form of Netscape.

There is much to be said for enhancing a site with Netscape tags, but unless the site is validated against the Netscape DTD (which has its own set of problems), the Netscape tags will cause the site to fail validation.

Table 2.2 is a list of some popular Netscape-specific tags. Later we describe a strategy for dealing with these tags. Chapter 3, "Deciding What to Do About Netscape," describes how to get the best of both worlds-putting up pages that take advantage of Netscape, while displaying acceptable quality to other browsers that follow the standard more closely.

Table 2.2 Common Netscape Tags and Attributes That Can Be Mistaken for Standard HTML

Tag	Attribute
`<BODY>`	`BGCOLOR TEXT LINK ALINK VLINK`
Multiple <`BODY>` tags `<CENTER>`Table caption with embedded headers (for example, `<TABLE><CAPTION><H2>...</H2></CAPTION>...`) `<TABLE WIDTH=400> <UL TYPE=Square> <HR SIZE=3 NOSHADE WIDTH=75% ALIGN=Center> <FONT...> <BLINK> <NOBR> <FRAME>`, `<FRAMESET>`, `<NOFRAME> <SCRIPT> <EMBED>`	No longer supported by Netscape

Using Quotation Marks

A generic HTML tag consists of three parts:

<TAG ATTRIBUTE=value>

You might have no attribute, one attribute, or more than one attribute.

The value of the attribute must be enclosed in quotation marks if the text of the attribute contains any characters except A through Z, a through z, 0 through 9, or a few more such as the period. When in doubt, quote.

Thus, format a hypertext link something like the following:

<A HREF="http://www.whitehouse.gov"

It is an error to leave off the quotation marks because a forward slash is not permitted unless it is within quotation marks.

It is also a common mistake to forget the final quotation mark:

<A HREF="http://www.whitehouse.gov

The syntax in this example is accepted by Netscape 1.1, but in Netscape 2.0 the text after the link doesn't display. Therefore, a developer who doesn't validate-and who instead checks the code with a browser-would have seen no problem in 1995 putting up this code and checking it with then-current Netscape 1.1. By 1996, though, when Netscape 2.0 began shipping, that developer's pages would break.

Keeping Tags Balanced

Most HTML tags come in pairs. For every <H1> there should be an </H1>. For every <EM> there must be an </EM>. It's easy to forget the trailing tag and even easier to forget the slash in the trailing tag, leaving something like the following:

<EM>This text is emphasized.<EM>

Occasionally, one also sees mismatched headers like the following:

<H1>This is the headline.</H2>

Validators catch these problems.

Typos

Spelling checkers catch many typographical errors, but desktop spelling checkers don't know about HTML tags, so it's difficult to use them on Web pages. It's possible to save a page as text and then check it. It's also possible to check the copy online using a spelling checker, such as WebSter, located at http://www.eece.ksu.edu/~spectre/WebSter/spell.html.

What can be done, however, about spelling errors inside the HTML itself? Here's an example:

<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINKS="#0000FF" ALINKS="#FF0000" VLINKS="#FF00FF">

The human eye does a pretty good job of reading right over the errors. The above tag is wrong-the LINK, ALINK, and VLINK attributes are typed incorrectly. A good browser just ignores anything it doesn't understand, so the browser acts as though it sees the following:

<BODY BGCOLOR="#FFFFFF" TEXT="#000000">

Validators report incorrect tags such as these so that the developer can correct them.

Incorrect Nesting

Every tag has a permitted context. The structure of an HTML document is shown in Listing 2.1.

Listing 2.1 General Structure of an HTML Document

<HTML>
 <HEAD>
  Various head tags, such as TITLE, BASE, and META
 </HEAD>
 <BODY>
  Various body tags, such as <H1>...</H1>,
      and paragraphs <P>...</P>
 </BODY>
</HTML>

While most developers don't make the mistake of putting paragraphs in the header, some inadvertently do something like the following.

<P><STRONG>Here is a key point.</STRONG>
<P>This text explains the key point.
<P><EM>Here is another point</EM>

The above is valid HTML. As the site is developed, the author decides to change the emphasized paragraphs to headings. The developer's intent is that the strongly emphasized paragraph will become an H1; the emphasized paragraph will become an H2. Here is the result:

<H1>Here is a key point.
<P>This text explains the key point.
<H2>Here is another point.</H1>
</H2>

Even the best browser would become confused by this code, but fortunately, a validator catches this error so the developer can clarify the intent.

Forgotten Tags

Developers frequently omit "unnecessary" tags. For example, the following code is legal HTML 2.0:

<P>Here is a paragraph.
<P>Here is another.
<P>And here is a third.

Under the now-obsolete HTML 1.0, <P> was a paragraph separator. It was an unpaired tag that typically was interpreted by browsers as a request for a bit of white space. Many pages still are written this way:

Here is a paragraph.<P>
Here is another.<P>
And here is a third.<P>

But starting with HTML 2.0, <P> became a paired tag, with strict usage calling for the formatting shown in Listing 2.2.

Listing 2.2 Strict Usage Calls for Pairs of <P> Tags Around Each Paragraph

<P>
Here is a paragraph.
</P>
<P>
Here is another.
</P>
<P>
And here is a third.
</P>

While the new style calls for a bit more typing and is not required, it serves to mark clearly where paragraphs begin and end. This style helps some coders and serves to clarify things for browsers. Thus, it often is useful to write pages using strict HTML and validate them with strict DTDs.

What About Netscape Tags?

Validation is intended to give some assurance that the code will display correctly in any browser. By definition, browser-specific extensions will display correctly only in one browser. Netscape draws the most attention, of course, because that browser has such a large market share. Netscape Communications has announced that when HTML 3.0 is standardized, Netscape will support the standard.

Note

Many other browsers, such as Microsoft's Internet Explorer, currently support some or all of the Netscape extensions.

Thus, you may decide it's reasonable to validate against HTML Level 2 Strict and then add enough HTML Level 3 features to give your page the desired appearance. The resulting page should validate successfully against the HTML Level 3 (expired) standard.

Finally, if the client wants a particular effect (such as a change in font size) that can be accomplished only using Netscape, you have to use the Netscape tags and do three things:

Validate against the HTML 3.0 standard. Any failures should be attributable to the Netscape-specific tags and attributes.
Validate against the Netscape DTD, such as it is. If the page fails validation, make sure that it's because the page uses a new feature that's not yet in the DTD. Test the page against multiple versions of Netscape, even if it passes validation. Test the page with other popular browsers to see how they handle the Netscape tags. Most browsers ignore tags that they don't understand, but that doesn't mean the result will look good.
Make sure that the client understands how the page will look in browsers other than Netscape. Both the client and developer should be satisfied with the results.

If the desired page (as enhanced for Netscape) doesn't look acceptable in other browsers, don't just mark the page "Enhanced for Netscape!" For many reasons, at least ten percent of the market does not use Netscape. Various estimates place the size of the Web audience at around 30,000,000 people. Putting "Enhanced for Netscape!" on a site turns away 3,000,000 potential customers. A better solution is to redesign the page so that it takes advantage of Netscape-specific features but still looks good in other browsers. Failing that, you might need to prepare more than one version of the page, and use META REFRESH or another technique to serve up browser-specific versions of the page. This is a lot of extra work but is better than turning away ten percent of the potential customers or having them see shoddy work.

The good news is, most pages can be made to validate under HTML 3.0 and then can be enhanced for Netscape without detracting from their appearance in other browsers. Chapter 3, "Deciding What to Do About Netscape," discusses techniques for preparing such pages.

Keeping Track of Validation Results

Validation results are to HTML as compile-time warnings and errors are to conventional programmers. Since about 1990, conventional programmers have been learning about the importance of software process and about the role of process improvement in quality management.

HTML development can benefit from the same lessons. Watts S. Humphrey of the Software Engineering Institute writes in his 1995 book, A Discipline for Software Engineering (Addison-Wesley), "By analyzing your defect data, you can generate a host of valuable analyses and reports…the following are some examples:

A table of the numbers of defects injected and removed by phase. While defects are easy to count for small programs, they are much more difficult for large programs. With a defect database, counting them is a simple matter.
Data on the numbers and types of defects found in a specific phase.
Data on the numbers of defects that were found in the product at phase entry but not found during that phase. An example would be the number of defects missed in a code review…
The time required to fix a defect as a function of the phase in which it was removed…

You can obtain a great deal of useful information from a defect database. Because the amount of defect data can become very large, you will likely find it helpful to enter these data promptly after you complete developing each program. It is even a good idea to do this as part of the postmortem phase."

All validators and validator-like tools provide an HTML page as a result of their analysis. It's a good idea to print this out for each page in the site and save it with the printout of the site pages. Be sure to note the date and time on the printout. Use the defects reported on these documents to fill in the defect log.

It also is a good idea to keep a multisection notebook beside the computer. Many sophisticated personal organizers can be readily adapted to this task, but simple three-ring binders are adequate.

Put the following tabs on the major sections: Time Log, Notes, Validation, Comments, Defects, and Summary.

The first section contains preprinted time logs. Record the time in minutes as accurately as possible. At the end of each day, move a blank log to the front of this section so that you never have to hunt through the book to find out where to record your time.
The second section of the notebook consists of blank paper. Number each page, or buy a notebook with prenumbered pages. Resist the temptation to tear out a page. Use this section to record notes on the development of the site as a whole, and on each page of the site. Remind yourself why you built each page the way you did. Maybe you ran an experiment to see if it would validate. Maybe you set up a page to see how it would look in various browsers. Print these results and add them to the notebook. (Keep a three-hole punch near your printer, or better yet, buy paper that is predrilled for a three-ring binder.)
After you run a validator, put the printed results into this notebook. Be sure to initial and date each error to show that you changed each page to make it pass the validation tests.
Include any written comments from evaluators, as well as abstracts from logs established while online evaluators were working. Put the date and time on everything you include.
The next section of the notebook contains the defect log. Each time a defect is identified (through testing, validation, or a problem report), record the phase where the defect was found and removed, and the phase in which the defect was injected. Leave the "After Development" line empty until after the site is released.
When the project begins, fill out the top section and the planning data on the Project Plan Summary. The estimated size should be based on the number of lines of HTML and copy to be prepared. Over time, you'll develop the correlation between estimated size and actual size and between time and size. By building a database of this information, you'll be able to more precisely estimate the resources necessary to build a Web site, and you'll be able to offer more competitive quotes (if you're a contractor) or estimates to management (if you're an employee).
When the site is finally released, identify (by phase) where defects are being injected. Think about how you work in your shop and what tools are available to you. Devise ways to stop defects from happening. If you can't prevent defects, at least find ways to detect them as early as possible. These steps increase the quality of the site, reduce production and maintenance costs, and make you more competitive.

Bear in mind that, at first, changing the way you work will have a negative effect. All this recording and analysis of data takes time. Moreover, improving processes can be temporarily unsettling to staff members. Figure 2.1 illustrates the fact that, when you begin process improvement, performance dips before it rises.

Figure 2.1: Process change causes a decline in throughput before it builds improvement.

Do not begin a major process improvement effort during a particularly busy time. Treat the effort as you would any major project. In the long run, improvements in your team's effectiveness will more than make up for any time lost.

Validator-Like Tools

WebTech and KGV are formal validators-they report places where a document does not conform to the DTD. A document can be valid HTML, though, and still be poor HTML.

What They Don't Teach You in Validator School

Part of what validators don't catch is content-related. Content problems are caught by copywriters, graphic artists, and human evaluators, as well as review by the client and developer. There are some other problems that can be caught by software, even though they are perfectly legal HTML.

Lack of `ALT` Tags

The following is an example of code that passes validation but is nonetheless broken:

<IMG SRC="Graphics/someGraphic.gif" HEIGHT=50 WIDTH=100>

The problem here is a missing ALT tag. When users visit this site with Lynx or with a graphical browser with image loading turned off, they see a browser-specific placeholder. In Netscape, they see a broken graphic. In Lynx, they see [IMAGE].

By adding the ALT attribute, browsers that cannot display the graphic instead display the ALT text.

<IMG SRC="Graphics/someGraphic.gif" ALT="[Some Graphic]"
HEIGHT=50 WIDTH=100>

Out-of-Sequence Headings

It's not an error to skip heading levels, but it's a poor idea. Some search engines look
for <H1>, then <H2>, and so on to prepare an outline of the document. Yet the code in Listing 2.3 is perfectly valid.

Listing 2.3 Using Headings Out of Sequence

<H2>This is not the top level heading</H2>
<P>
Here is some text that is not the top-level heading.
</P>
<H1>This text should be the top level heading, 
but it is buried inside the document</H1>
<P>
Here is some more text.
</P>

Some designers skip levels, going from H1 to H3. This technique is a bad idea, too. First, the reason people do this is often to get a specific visual effect, but no two browsers render headers in quite the same way, so this technique is not reliable for that purpose. Second, automated tools (like some robots) that attempt to build meaningful outlines may become confused by missing levels.

There are several software tools available online that can help locate problems like these.

Doctor HTML

One of the best online tools is Doctor HTML, located at http://imagiware.com/RxHTML.cgi. Written by Thomas Tongue and Imagiware, Doctor HTML can run eight different tests on a page. The following list explains the tests in detail:

Document Structure-This test looks at pairs of opening and closing tags. It highlights unpaired tags in a table, by tag type. The Document Structure test does not look at forms or tables-those are handled separately. The results of the Document Structure test are shown in Figure 2.2.

Figure 2.2: Doctor HTML's Document Structure Report provides a quick look at possible tag mismatches.

Table Structure-This test looks for matching pairs of table tags and for stray table tags that appear outside any valid table.
Form Structure-The Form Structure test looks at the syntax of INPUT tags inside forms. The current version of Doctor HTML ignores the SELECT and TEXTAREA elements.
Image Analysis-One of the most useful tests is the Image Analysis test performed against IMG tags. Doctor HTML loads every image on the page, measures its size, determines its dimensions, and gives an estimate of the time it will take to download the image over a 14.4-Kbps modem. The program also reports the dimensions (HEIGHT and WIDTH) and the number of colors in each graphic-the factors that determine overall size and download time. Figure 2.3 shows Doctor HTML's image analysis test.

Figure 2.3: Doctor HTML's Image Analysis Test tells the developer which graphics contribute most to download time, and how to fix them.

Image Syntax-If ALT, HEIGHT, or WIDTH attributes are missing in an IMG tag, the Image Syntax test notes the problem.
Spelling Check-Unlike a spelling checker on the development machine (which sees the HTML tags), Doctor HTML checks the words that you (and your site's visitors) will see on-screen.
Hyperlink Analysis-Another useful test, Hyperlink Analysis, exercises all links that leave the page. The results of this test are shown in Figure 2.4. Links that take more than ten seconds to return are reported as "timed out." Links that lead to a server error are listed as "failed."

Figure 2.4: Doctor HTML's Hyperlink Analysis Test shows which links are suspect.

Caution

This test has a difficult time with on-page named anchors such as

<A HREF="#more">.

Sometimes a link returns an unusually small message, such as This site has moved. Doctor HTML shows the size of the returned page, so that such small messages can be tested manually.
Command Hierarchy-Doctor HTML shows an outline of the document based on the HTML tags. This command hierarchy is used to determine whether the document has an unusual structure, such as a missing HEAD or out-of-order headers.
Summary-The Doctor HTML summary appears at the bottom and is the default page following the request for testing. Figure 2.5 shows a typical summary report. If anything happens so that the Doctor cannot return all the data, the summary does not appear. From the summary, you can link directly to the relevant portions of the report.

Figure 2.5: Doctor HTML's summary report contains a wealth of information about the page.

Weblint

Another online tool is the Perl script Weblint, written by Neil Bowers of Khoral Research. Weblint is distinctive in that it's available online at http://www.unipress.com/weblint/and can also be copied from the Net to a developer's local machine. The gzipped tar file of Weblint is available from ftp://ftp.khoral.com/pub/weblint/weblint-1.014.tar.gz. A ZIPped version is available at ftp://ftp.khoral.com/pub/weblint/weblint.zip. The Weblint home page is http://www.khoral. com/staff/neilb/weblint.html.

Tip

KGV (described earlier in this chapter) offers an integrated Weblint with a particularly rigorous mode called the pedantic option. You'll find it worthwhile to use this service.

What Is a Lint?

The original C compilers on UNIX let programmers get away with many poor practices. The language developers decided not to try to enforce good style in the compilers. Instead, compiler vendors wrote a lint, a program designed to "pick bits of fluff" from the program under inspection.

Weblint Warning Messages

Weblint is capable of performing 24 separate checks of an HTML document. The following list is adapted from the README file of Weblint 1.014, by Neil Bowers.

Weblint can check the document for the following:

Basic structure
Unknown elements and element attributes
Context checks (where a tag must appear within a certain element)
Overlapped elements
A TITLE in the HEAD element
An ALT attribute in each IMG tag
Illegally nested elements
Mismatched tags (for example, <H1>...</H2>)
Unclosed elements (for example, <HEAD>...)
Multiple occurrences of elements that should only appear once
Presence of obsolete elements
Odd number of quotation marks in tag
Proper order of headings
Potentially unclosed tags (for example, <EM>...)
Markup embedded in comments (because this can confuse some browsers)
Use of here as anchor text
Use of tags where attributes are expected
Existence of local anchor targets
Case of tags
A <LINK REV=MADE HREF="mailto:...> in HEAD element
HTML 3 elements such as TABLE, MATH, and FIG
Leading and trailing whitespace in certain container elements (for example, <A...>)
Optional support for the Java APPLET and PARAM elements
Optional support for Netscape tags
When you run Weblint from the command line, the following combination of checks gives a document the most thorough workout:

weblint -pedantic -e upper-case, bad-link,
require-doctype [filename]

The -pedantic switch turns on all warnings except case, bad-link, and require-doctype.

Note

The documentation says that -pedantic turns on all warnings except case, but that's incorrect.

The -e upper-case switch enables a warning about tags that aren't completely in uppercase. While there's nothing wrong with using lowercase, it's useful to be consistent. If you know that every occurrence of the BODY tag is <BODY> and never <body>, <Body>, or <Body>, then you can build automated tools that look at your document without worrying about tags that are in nonstandard format.

The -e..., bad-link switch enables a warning about missing links in the local directory. Consider the following example:

<A HREF="http://www.whitehouse.gov/"The White House</A>
<A HREF="theBrownHouse.html">The Brown House</A>
<A HREF="#myHouse">My House</A>

If you write this, Weblint (with the bad-link warning enabled) checks for the existence of the local file theBrownHouse.html. Links that begin with http:, news:, or mailto: are not checked. Neither are named anchors such as #myHouse.

The -e..., require-doctype switch enables a warning about a missing <!DOCTYPE...> tag.

Notice that the -x netscape switch is not included. Leave it off to show exactly which lines hold Netscape-specific tags. Never consider a page done until you're satisfied that you've eliminated as much Netscape-specific code as possible, and that you (and your client) can live with the rest. See Chapter 3, "Deciding What to Do About Netscape," for more specific recommendations.

If we use the Weblint settings in this section and the sample code we tested earlier in the chapter with the WebTech validator and KGV, Weblint gives us the warning messages shown in Listing 2.4.

Listing 2.4 Numerous Warnings of Weblint

       line 2: <HEAD> must immediately follow <HTML>
       line 2: outer tags should be <HTML> .. </HTML>.
       line 4: odd number of quotes in element 
       <BODY BACKGROUND="Graphics/white.gif>.
       line 4: <BODY> must immediately follow </HEAD>
       line 4: <BODY> cannot appear in the HEAD element.
       line 5: <H1> cannot appear in the HEAD element.
       line 6: <P> cannot appear in the HEAD element.
       line 8: <P> cannot appear in the HEAD element.
       line 11: unmatched </HTML> (no matching <HTML> seen).
       line 0: no closing </HEAD> seen for <HEAD> on line 2.
HTML source listing:
     1.<!-- select doctype above... -->
     2.<HEAD>
     3.<TITLE>Test</TITLE>
     4.<BODY BACKGROUND="Graphics/white.gif>
     5.<H1>This is header one</H1>
     6.<P>
     7.This document is about nothing at all.
     8.<P>
     9.But the HTML is not much good!
    10.</BODY>
    11

Because Weblint is a Perl script and is available for download, you should pull it down onto the development machine. Here is an efficient process for delivering high-quality validated pages using a remote server:

Check out all pages from the Configuration Control System and test them against Weblint on the local development machine. Use the -pedantic and -e upper-case, bad-link, require-doctype switches.
Once all the pages in a site are clean according to Weblint, make a final pass at the directory level:
weblint -pedantic -e upper-case, bad-link, require-doctype -x netscape [site-directory-name]

Weblint runs recursively through the directory. This check ensures that all subdirectories have a file named index.html (so that no one can browse the directories from outside the site) and serves as a double-check that all files have been linted.

Note

For this step, the -x netscape option is turned on. This option allows Weblint to read Netscape-specific tags without issuing a warning.
Copy the files from the development machine to the online server.
Test each page of the site online with KGV and the integrated Weblint. Make sure that each page is error-free. Figure 2.6 shows the online version of Weblint in action.

Figure 2.6: Weblint is aggressive and picky-just what you want in a lint.
Test each page of the site with Doctor HTML. Doctor HTML evaluates a different set of criteria for each page and can show things that neither Weblint nor KGV has caught. Change pages as required so that they pass inspection by Doctor HTML. Return to Step 1 or Step 2 as required after making the changes.
Once all pages in a site pass all three checks (local Weblint, KGV with integrated Weblint, and Doctor HTML) check them back into the Configuration Control System. Annotate them with the fact that they have fully passed these tests.

The HTML Source Listing

With some online tools, such as KGV, any problematic source line is printed by the tool. With others, such as Weblint, it isn't. The forms interface for Weblint, available through http://www.ccs.org/validate/ turns on the source listing by default. It's best if you leave it at that setting.

Integrating Test Results

As each of the previous tests is run, remember to print the resulting pages. Do not just print the final test; capture the defect data. Look for patterns: Do you tend to forget trailing quotes? Do you leave off closing anchor tags? Notice which defects are most common in your work, and try to catch yourself when making them in the future. It's faster to avoid the mistake in the first place than to validate, remove the defect, and retest.

Keep track of your results in the defect log. The more sites you develop, the better your code will become and the faster your development will be. Faster development (with no loss in quality) leads to lower costs.

Organizing for Maintenance

Once the site has passed all human-based tests (copywriter, graphic artist, and Red Team) and automated tests (KGV, Weblint, and Doctor HTML), print two sets of the pages for the entire site and go over them with the client. Have the client initial each page as "Released for distribution." Leave one set with the client, along with a set of change request forms.

As explained earlier, it's always a good idea to encourage the client to develop new material for the site. While the business arrangement may vary, the principle of regular updates does not. To increase the likelihood of success with the site, encourage the client to change something about the site at least once a month. The client should mark prospective changes on his or her copy of the pages and should submit the marked-up pages with a change request form.

If a client is on monthly maintenance, for example, assign her a particular day each month when her changes will be made. For example, Susan might schedule Bob's Homes for maintenance on the second Tuesday of each month. On the morning of that date, she goes through all the change requests that have come in from Bob since the previous month, and follows this process:

Check out from the Configuration Control System the page or pages needing to be changed.
Apply the changes.
If the desired changes warrant, staff the pages past one or more members of the Red Team, the copywriter, or graphic artist-or any combination of these people.
Run each page through the local copy of Weblint. Correct any errors found. Remember to retain copies of error reports for the project notebook.
Put the pages in a private part of the online server and run them through KGV (with integrated Weblint) and Doctor HTML. Again, print and save the resulting data. Change the pages as required to correct any problems; then return to Step 3 as needed.
Print the tested pages; FAX or carry them to the client for review and sign-off. Be sure to keep the approved changes in a safe place, and add a copy to the project notebook.
Check the approved pages back into the Configuration Control System. Be sure to note the number of the Change Request so that each change can be traced back to a particular request from the client.
Move the changed pages from the private portion of the server to the actual site, replacing the original pages.

Be sure to keep track of the time spent performing maintenance in the project notebook. Over time, develop an estimate of how much effort it takes to implement common changes. Use this figure as the basis for each site's maintenance budget.

Handling Defects

You also should supply the client with a set of customer trouble reports. These documents serve as high-priority forms the client can use to notify you that something is not working on the site.

Somewhere on the site, include an online customer trouble report. You might want to put it with the information about contacting the Webmaster. We'll talk about online forms in Chapter 7, "Extending HTML's Capabilities with CGI." For now, just use the following:

To report problems or offer suggestions about this site,
please contact the webmaster at
<A HREF="mailto:webmaster@xyz.com">webmaster@xyz.com</A>.

The above format presents the e-mail address to users whose browsers do not support mailto or who have printed the page and don't have immediate access to the online
version.

Caution

If you have followed these recommendations about human tests and online verification, the major source of trouble reports will come from people using less-than-perfect browsers or from people new to the Web who have missed something basic. Do not ignore these reports. If a user reports that he or she cannot see something on the site or that part of the site is gibberish, try to duplicate the problem using that browser or another. Determine whether the user's browser is broken or an actual problem exists on the site.

If you determine that the HTML is valid but the browser is broken, look for easy ways to work within the browser's limitations. For example, Netscape now supports the HTML 3.0 standard ALIGN attribute, so you can center a paragraph with the following:

<P ALIGN=Center>

Some browsers have copied Netscape's original <CENTER> tag, however, and these browsers usually ignore ALIGN. To support those browsers, use a redundant construct like this:

<CENTER>
<P ALIGN=Center>

Check BrowserCaps occasionally to see which browsers support which features. Monitor BrowserWatch and the mailing lists of the HTML Writers' Guild to see which browsers are becoming popular and which browsers are reported broken.

Be sure to report broken browsers to the manufacturer. Sometimes they have a newer version out, and you can give that information to the person who reported the problem.

Storing Pages Together

In order to reduce confusion (and the associated maintenance costs), it is important to keep the site together, both in print and on disk.

Building a Print Archive

Once the site is released, place the project notebook aside. Pull it down once a month to do maintenance on the site, and continue to update the defect log and timesheets. During the monthly maintenance, look over the site for trends concerning time or defects, and use that information as part of your effort toward continuous process improvement.

Building an Archive on Disk

Configuration management systems like SCCS and RCS do not store every version of a file. They instead store deltas, the differences between one file and the next. Nevertheless, over time the current files and the configuration archives can begin to take up a lot of disk space. Once the site is released, you can archive these files using the UNIX tar command, and then compress them with something like the gzip utility. This utility, widely available for all platforms on the Net, compresses a typical HTML tar archive 80 percent or more. The UNIX compress utility, often shipped with the operating system, usually doesn't do as well, topping out between 50 and 60 percent.

To make a compressed tar archive for a site, start in the development directory and enter:

cd ..

to go one directory level above the development directory. If the development directory is named, say, XYZ, enter

tar cvf - ./XYZ | gzip - > XYZ.tar.gz

cd changes the working directory to the parent directory of XYZ. This way, using the name ./XYZ in the tar command forces tar to remember the relative path to the files. If the tar archive later needs to be restored elsewhere in the directory tree, tar won't try to re-create the exact path.

The use of a hyphen (-) in the tar and gzip commands says that tar and gzip are communicating via standard input and output. tar writes its results out the stdout pipe, and writes the verbose listing of the files it's archiving to stderr.

Finally, be sure that the file system with the Web sites is backed up periodically, and at least some of the backups are stored off-site. Everybody gets burned once-the smart ones get burned only once.

Given enough time and money most anyone can build an effective Web site. The trick is to do the job on a budget. Every minute that is spent fixing a bug or every dollar spent changing HTML to accommodate a new browser is a resource wasted and serves to hurt the effectiveness of the site.

Once a site is up, people are using it, visitors rely on it being relatively stable, and the Web community expects the URLs to not change very much. The opportunity for the developer to make wholesale changes is lost. During development, by contrast, there is much less pressure. Under tight deadlines the difference is not always apparent, but it is nonetheless a fact. Everything that can be done during development to reduce maintenance costs pays high dividends throughout the life of the site-and frees resources so they may be used to enhance content and improve the site.

Chapter 2

Reducing Site Maintenance Costs Through Testing and Validation

CONTENTS