Chapter 10

Integrating Forms with Mailing Lists


CONTENTS


Instead of directing e-mail to a specific user or a staticuser list, it is possible to direct the output of a form to awider audience. This chapter shows how the Webmaster can use CGI to facilitate access to mailing lists and its archives.

E-mail is the most common application on the Net-far more people use e-mail than have ever visited a Web site. The Web, of course, is the fastest-growing service on the Net. This chapter shows how Web sites can harness the power of e-mail.

Why E-Mail Is So Powerful

Ask most people about the Internet, and they think about the World Wide Web. But electronic mail, and not the Web, is the most heavily used application on the Internet. There was a time when most e-mail was not compatible. A user on CompuServe could not send e-mail to someone on GEnie. America Online did not pass traffic to the Internet. Today, all popular networks intercommunicate. A user on the Internet can correspond as easily with an America Online subscriber as with someone using the same Internet server. This means that e-mail can reach many more people than the Web.

While its name makes it sound like e-mail is similar to conventional mail, in reality the functionality of e-mail is closer to that of the telephone. E-mail has been called "voice mail done right." Many people check their e-mail several times a day; they are accustomed to sending a message and getting a reply the same day-often within a few hours. But unlike the telephone, e-mail can be sent to hundreds or even thousands of people as easily as it can be sent to one. The technology that enables this feat is the list server.

To understand why list-server software is useful, this section contrasts running a mailing list by hand with using an automated list-server package. The two list-server packages discussed are Majordomo and LISTSERV. The focus is on how the list appears to the user, rather than the mechanics of how to operate the list-server software.

Running a Mailing List by Hand

Suppose the real estate broker from the example site introduced in Chapter 1, "How to Make a Good Site Look Great," wants to keep the Realtors informed about new listings and new developments in the company and the industry. Before e-mail, the broker might have sent out a company newsletter, flyers, or data packets describing new homes on the market. With e-mail, the broker can do the same thing faster, keeping agents even more up-to-date.

The broker might begin with a dozen or so agents, putting their names on a list in the broker's mail client. With many mail clients, you can set up a mail alias, so the list owner sends a message to the name agents@gsh.com, and the client software sends the message to everyone listed under the agent's alias.

As the list grows, the list owner finds that the act of maintaining the list consumes more and more time. As the firm grows, agents have to be added to the list. From time to time, agents leave. Occasionally, people outside the firm-such as builders or mortgage bankers-may join or leave the list. At some point, the broker may even decide to operate a list that is available to the public at large. If any of these addressees change e-mail accounts, their mail will bounce; that is, the mail is returned to the sender, undelivered. At some point, the members of the list decide they want to communicate among themselves, so they send mail to the list owner asking that it be forwarded to the other members. They mix administrative messages to the list owner with messages intended for the list itself, leading to a few embarrassing moments. The load on the system, not to mention the list owner, goes up. One day, the list owner takes a few days off, the list grinds to a halt, and everyone is unhappy.

The solution, of course, is to turn the mechanics of maintaining the list over to a computer. A program that does basic list management can be written fairly quickly, but even that work is not necessary. Several packages are available to automate the task. These packages support several variants of the basic mailing list:

The list owner can also maintain archives of the messages that have passed through the list. The second half of this chapter shows how to make these archives accessible over the Web.

Majordomo

Majordomo is a collection of Perl scripts developed by Brent Chapman and John Rouillard to automate some of the tasks of the list manager. The latest copy of Majordomo is available at ftp://ftp.greatcircle.com/pub/majordomo/. To learn more about Majordomo, subscribe to the Majordomo mailing list at majordomo-users mailing list at majordomo@GreatCircle.com.

Majordomo's Commands

Users send commands to Majordomo by e-mail, and they get their answers the same way. For example, to subscribe to the list that discusses Majordomo itself, a user named Jones with an e-mail address of jones@xyz.com would send the message

subscribe majordomo-users jones@xyz.com
end
to majordomo@GreatCircle.com.

From the user's point of view, Majordomo affords the following commands:

Some commands, such as Which and Who, send potentially sensitive information, and are frequently restricted or disabled.

Interacting with Majordomo

Suppose GSH Real Estate decides to operate a mailing list for the general public, describing local investment real estate. The company might call the list investments@gsh.com.

Tip
When you install majordomo on the system for the first time, edit majordomo.cf as necessary so that each path to a directory or a file is correct. Don't remove the
1;
in the last line. This configuration file is required into the Perl programs that make up Majordomo, and Perl likes to see that line at the bottom of include files so the require statement returns "true."

If the list is operated using Majordomo, the list would be set up with four mail aliases:

Tip
In UNIX, set up aliases in the system's sendmail alias file, usually found at /usr/lib/aliases or /etc/aliases. Each line in the file shows an alias and the usernames that are associated with that alias. Thus
investments: bob, susan, todd@anothersite.net sets up an alias called "investments" with three users: two local subscribers and one from another site. Any mail addressed to investments on this server will be sent to those three users.


Caution
During the installation of Majordomo you run make to compile the wrapper program. Before running make wrapper, be sure to check W_BIN and W_MAJORDOMO_CF in the makefile. W_BIN should point to Majordomo's home directory on the machine; W_MAJORDOMO_CF should point to the location of the configuration file majordomo.cf.
While you're in the makefile, be sure to comment out the non-POSIX section and uncomment the POSIX lines if your version of UNIX is POSIX-compliant.

Note that, in general, the term bounced message refers to a message returned as undeliverable. Majordomo's documentation uses the term returned message for such mail, and it reserves the term bounced message for mail that requires special attention before being sent out. The terms as used in this section are consistent with Majordomo's usage.

Once the aliases are set up, the responsibility for handling the mailing list can be allocated to people. On a low-volume mailing list, one person may receive the mail from both administrative aliases (-approval and owner-). On a high-volume mailing list, several people may divide the work.

Most common actions (such as approving subscription requests and moderated messages) are handled with scripts supplied with Majordomo. For example, on a moderated list, Majordomo sends a request for approval for each message received at the list address to the owner- address. The list owner pipes the message to the Perl script approve and enters his or her password to release the message to the list membership.

Messages that fail certain administrative requests are also sent to the list owner. For example, subscribers sometimes get confused and send requests for subscription or unsubscription to the list as a whole. Majordomo can be configured to search for such requests and bounce them to the list owner.

Many list owners configure Majordomo with a mail alias that points to the script archive2.pl. On a regular basis (daily, monthly, or yearly, as set by the list owner), this script saves all messages into an archive file. For example, the administrator at GSH Real Estate might set the archive to save all messages sent to the investments list once a month. Then the message that came through the list in January 1996 would be saved in the archive under the file name investments.9601.

LISTSERV

LISTSERV offers the same functionality as Majordomo, but with some important differences (not to mention a different command set). LISTSERV reflects its BITNET heritage. Back when the ancestor of the Internet (called ArpaNet) was being connected, some universities started their own network. This alternative to ArpaNet, called BITNET, was hosted mainly on IBM mainframes and DEC VAXen. Back in those days no one had heard much about open systems, so while the ArpaNet/Internet/UNIX community was standardizing on ASCII and the mail standard RFC 822, BITNET was building on top of the Extended Binary Coded Decimal Interchange Code (EBCDIC) and the 80-column Hollerith card mind-set. The upshot of all this is that BITNET and, consequently, LISTSERV, are, shall we say, a little different.

Note
During the early years of computing, the marketplace was dominated by IBM. IBM had so much marketshare that they could afford to set their own standards. The academic and research communities, with smaller budgets and different needs, developed a different set of standards. When the dust settled, there were two very different ways of meeting similar requirements. The business community (typified by IBM) submitted jobs in batch (as described in Chapter 12, "Forms for Batching Processes") on 80-column punch cards called Hollerith cards.
Initially most input was numeric, and was encoded in Binary Coded Decimal, or BCD. When the standard was extended to include more characters, it became known as the Extended Binary Coded Decimal Interchange Code. Many
programmers consider EBCDIC to be inferior to the more common American Standard Code for Information Interchange, or ASCII, because ASCII supports contiguous letter collating sequences. (If "A" should collate two characters before "C", the difference between the ASCII codes for A and C is 2. This fact allows for a great deal of simplification in many programs.)
The fact that EBCDIC and punch cards are so different from ASCII and simple terminals has led many programmers to speak of the "80-column mind." A humorous tour of these and other terms is given in The New Hacker's Dictionary by Eric Raymond (The MIT Press, 1991). That book is an adaptation of the online "jargon file" that was maintained by hackers on the ARPANET and, later, the Internet, for over 15 years.

The original LISTSERV had centralized management. A human administrator was required to approve all subscription and unsubscription requests. When LISTSERV was revised, the major changes were to allow more automation and less centralization. Nevertheless, some of the nicer features of centralization were retained. For example, you can still send a command that says "Sign me off of all LISTSERVs, everywhere," and that command will get propagated around the world.

LISTSERV's Commands

Like Majordomo, LISTSERV runs on individual machines around the Net (though in the case of LISTSERV, the machines are connected to BITNET). Unlike Majordomo, each copy of LISTSERV talks to other copies on BITNET. Thus, if a user wanted to subscribe to the LISTSERV mailing list POWER-L (which discusses the IBM RISC System/6000 family of computers) but didn't know which machine (BITNET calls them nodes) hosted that list, that user could send the Subscribe request to

LISTSERV@LISTSERV.NET

If the user's Domain Name Server couldn't find LISTSERV.NET, he or she could send the message to the Internet/BITNET gateway at

LISTSERV%LISTSERV.BITNET@CUNYVM.CUNY.EDU

In either case, BITNET would find the correct node and forward the request (in this case, to LISTSERV@VM1.NODAK.EDU).

LISTSERV has four functional areas of commands for the user:

The online user manual, at http://www.earn.net/lug/notice.html contains a chapter for each functional area. While the LISTSERV command set is much richer than Majordomo's, it is well-documented. The typical user will need to know only a handful of these commands. As you will see, a Webmaster can use CGI to allow a user to interact with LISTSERV without knowing the commands at all.

Here is a summary of the most frequently used LISTSERV commands. The capital letters in the command show the approved abbreviation.

The following commands are used to review and change the user's personal profile. The profile contains information about whether you want the list in digest mode (if available), index mode (in which only summary information about each message is sent), or mail mode (in which all messages are sent to the user as they come in). On some lists, you can select which topics you want sent. Again, there are many options, all fully described in the online user's guide.

The following commands interact with the LISTSERV file server:

LISTSERV offers a set of commands that interact with the LISTSERV database server. LISTSERV nodes maintain several databases, each of which is documented in the online user's guide. The mailing list archives are stored in a "notebook" database, which has field names like Subject, Sender, Header, and Body.

To access the database, start with a template like the following.

// JOB
DATABASE SEARCH DD=RULES
//RULES DD *
command1
command2
...
/*
// EOJ

In this template, the first line starts the database job. Any line before it will be ignored. The next line specifies that this is a database job and gives the name of the section that holds the database commands (this comes after the DD= keyword). That section name can be called anything; in this template, it is called RULES. To start the command section, LISTSERV needs to see '//', followed by the name, followed by 'DD *'. From here until the line with '/*', LISTSERV interprets each line as a database command. The '// EOJ' terminates the database job. Any lines that appear after the // EOJ are interpreted as nondata-base commands.

Note that if LISTSERV sees a line that it cannot understand, it ignores it. If the number of such lines exceeds a threshold, the job is abandoned. The most common cause for this behavior is a user leaving a mail signature on. Remember to turn off any mail signature at the end of the message.

Some of LISTSERV's database commands can lead to long lines. If the line is inconveniently long, enter a - (a dash) at the end of the line and continue to the next line. The last line of the command should not have a dash.

The common database commands are

The SEARCH command allows a full range of Boolean operators. The default behavior is for keywords to be ANDed together. For example, LISTSERV interprets the command

SEARCH 'PC Virus' OR 'Virus Warning'

as asking for documents containing either 'PC Virus' or 'Virus Warning'. Single quotes denote a case-insensitive search. If the search string is double-quoted, it must match the case of the text. If the search string contains a quote, it must be escaped by doubling it.

The SEARCH command allows several optional rules. To specify a database to search, add 'IN database' to the search string. Once one or more files have been selected, subsequent invocations of SEARCH will search those files, unless a database is specified. Suppose

SEARCH 'PC Virus' OR 'Virus Warning' IN BUGS

yielded 10 documents. The next call,

SEARCH 'MS-Word' OR 'Microsoft Word'

would search those 10 documents for any mention of Microsoft's word-processing
product.

Data rules restrict the search by time. LISTSERV allows SINCE, UNTIL, and FROM rules. So,

SEARCH 'PC Virus' OR 'Virus Warning' IN BUGS SINCE 01-96

returns the hits in the BUGS database since January 1996. Many date-time formats are supported and are described in the online user's guide.

The WHERE or WITH clause supports 12 operators, as shown below:

IS  value
 =  value
IS NOT  value
&circ.=  value
>  value
>=  value
 <  value
<=  value
CONTAINS  value
DOES NOT CONTAIN  value
SOUNDS LIKE  value
DOES NOT SOUND LIKE  value

These tests can be connected with Boolean operators:

NOT  or  ^
AND  or  BUT  or  &
OR  or  |  or  /

Recall that notebook databases (mailing-list archives) contain the fields Sender, Subject, Header, and Body. So you can say

SEARCH 'PC Virus' or 'Virus Warning' in BUGS SINCE 01-96 WHERE --
SENDER SOUNDS LIKE 'Smith' BUT NOT 'John'

Often, the last command of the first database inquiry will be INDEX. This command causes LISTSERV to return the list of each of the documents found in the search. The user may then issue a subsequent call to PRINT those documents that look most relevant. Thus, the user might submit a job like the one in Listing 10.1.


Listing 10.1  Listing.101-Sends This Mail to a LISTSERV List Manager to Run a Query Against the Database

// JOB
DATABASE SEARCH DD=RULES
//RULES DD *
SEARCH 'PC Virus' or 'Virus Warning' in BUGS SINCE 01-96 WHERE-
SENDER SOUNDS LIKE 'Smith' BUT NOT 'John'
SEARCH 'MS-Word' or 'Microsoft Word'
INDEX
/*
// EOJ

You get back a response like the following:

> SEARCH 'PC Virus' or 'Virus Warning' in BUGS SINCE 01-96 WHERE-
SENDER SOUNDS LIKE 'Smith' BUT NOT 'John'
SEARCH 'MS-Word' or 'Microsoft Word'
--> Database BUGS, 4 hits.

> INDEX
Item #   Date   Time  Recs   Subject
------   ----   ----  ----   -------
000001 96/02/15 16:50   42   MS-Word Virus Warning
000002 96/02/16 04:02   89   Microsoft Word Virus Warning
000003 96/02/16 10:59 1239   PC Virus in Microsoft Word
000004 96/03/05 17:48   14   Another Virus Warning re MS-Word

The user might then send another message, like the one in Listing 10.2.


Listing 10.2  Listing.102-Sends This Mail to a LISTSERV List Manager to Retrieve Specific Database Entries

// JOB
DATABASE SEARCH DD=RULES
//RULES DD *
SEARCH 'PC Virus' or 'Virus Warning' in BUGS SINCE 01-96 WHERE-
SENDER SOUNDS LIKE 'Smith' BUT NOT 'John'
SEARCH 'MS-Word' or 'Microsoft Word'
INDEX
PRINT SENDER, SUBJECT, BODY OF 1, 3-
/*
// EOJ

This message says to send back the Sender, Subject, and Body of document 1 and all documents 3 and above.

The Importance of the Request Address

One of the most common social gaffs is to confuse the administrative address with the address of the list. On an unmoderated list, it is all too common to see messages saying "subscribe," "unsubscribe," or even "Please delete me from this list." On a list managed by Majordomo, such requests should go to the Majordomo address. By convention, listname-request often points to Majordomo. On a list managed by LISTSERV, administrative requests go to LISTSERV.

Making a mistake in this area only serves to tell several thousand people that you didn't take the time and effort to learn how to do it right. Most of those people will politely ignore your mistake. A few will try to help you out. Some will get angry. Take care.

The Front End: Subscribe, Unsubscribe, Posting, and Queries

Majordomo and LISTSERV are just two of the list managers available. List owners have many choices of server software, each of which offers somewhat different commands and somewhat different capabilities. Many list owners would like to decrease the workload on users who want to subscribe or unsubscribe to their lists.

Note that not all list owners want to make it easier for end users. List owners watch their "signal-to-noise ratio" carefully. They are mindful of the fact that if many users post messages with little content, everyone's mailbox fills up, and some longtime members will drop the list. Some of these list owners use the subscription process as a rite of passage, with the logic that anyone bright enough to figure out how to type subscribe thisList might have something worthwhile to say. (Whether this correlation actually exists is a question I'll leave to you.)

Assuming that a Webmaster does want to make it easier for users to interact with mailing lists, there are several techniques one might use.

Subscribe and Unsubscribe Requests

There is a risk associated with making it easy for someone to subscribe: That person may not know how to unsubscribe! Users whose mailbox is filling up every day with postings to a high-volume list that they cannot stop have been known to take desperate measures. The best practice is, therefore, to follow three rules:

Sneaking People onto a Mailing List

Not that anyone here would sneak, mind you. But some users have little understanding of the technology and are more than a bit afraid of it. When they start getting e-mail from people they've never met and have never heard of, they get overwhelmed. (Even those of us who do understand the technology sometimes get overwhelmed by our mailing lists.) If the user can find the person responsible-or someone who seems to be responsible-that user may lash out viciously. Therefore, in setting up a form to put someone on a mailing list, make sure that person knows exactly what he or she is getting into, give the person an estimate of the volume of the list, and tell him or her how to unsubscribe. Remember that terms like moderate volume have no standard meaning. Tell potential subscribers how many messages they may expect per day. Tell them where to find the archive so they can decide if the content justifies the noise. And tell them how to unsubscribe if they ever want to leave the list.

Respecting Mailing List Owners

The decision about how to put people on a mailing list is best left up to the list owner. As mentioned, some list owners do not want to attract large numbers of people who join "at the click of a button." The best practice is to provide Web forms as front ends to your own mailing lists, and maybe for mailing lists in which the owner asks for help. Do not feel free to "help out" a list owner by signing up new users from your form. No matter how thoroughly the page explains that the visitor is about to subscribe to a mailing list, some users will click a link thinking they are going to a new page or signing up for a monthly flyer. Imagine their dismay when they find 40 pieces of e-mail documenting a flame war about some fine point of the topic, and they don't know how to stop it. They've forgotten your URL, didn't bookmark your form, don't ever remember signing up for this list, and now feel they are at the mercy of this crowd of drooling, foul-mouthed heathen who daily invade their in-box. (Did I mention that you should tell them how to unsubscribe?)

Tell Them How to Unsubscribe

The best lists tell people how to unsubscribe at least three different ways. First, in the initial information packet on the list, they're told exactly how to unsubscribe. Second, the new subscriber gets a welcome message that contains the same information-and tells them to keep this message in case they ever want to get off the list. Finally, at the bottom of each message, there's a line added like this one from the APOLOGIA list:

To unsubscribe, send UNSUBSCRIBE APOLOGIA-LIST to MAJORDOMO@ESKIMO.COM

Despite all these precautions, the lists are peppered daily with messages that say "Subscribe," "Unsubscribe," "Please delete me from this list," and "Hi, I'd like to join the discussion…" So if it isn't clear by now, tell them…Well, you get the idea.

Custom Forms, Mailto, and Engine Mail

One off-the-Net script that implements the standard commands is LWgate at http://www.netspace.org/users/dwb/lwgate.html. The system administrator configures it by supplying a list of lists that users may access through LWgate. As of version 1.16, LWgate supports Majordomo, LISTSERV, ListProc, and SmartList command sets.

Figure 10.1 shows an example of LWgate serving as the entry point to the BIG-LINUX mailing list at http://www.netspace.org/cgi-bin/lwgate/BIG-LINUX/.

Figure 10.1: LWgate interface.

Another front-end is MailServ, at http://iquest.com/~fitz/www/mailserv/. It supports at least some commands from each of the following mailing-list managers:

MailServ can also accommodate subscribe, unsubscribe, and comment requests to manually managed lists. For an example of the MailServ user interface, see Figure 10.2, from http://iquest.com/fitzbin/listserv.

Figure 10.2: MailServ allows a Web user to send commands to LISTSERV.

Engine Mail 2.1

As Chapter 7 ("Extending HTML's Capabilities with CGI") showed, there are advantages to a form's interface between the Web and e-mail. Some, but by no means all, Web browsers support mailto: URLs. An interesting problem that comes up when people start integrating e-mail with the Web is this: A large number of people (say, all the people on a campus or at a company) want to receive e-mail from using forms. Yet the cost of developing hundreds of essentially identical forms is nontrivial.

One elegant solution is Engine Mail 2.1, available at http://pharmdec.wustl.edu/juju/E.M./engine_mail.html. engine_mail accomplishes two tasks: First, it puts up either generic or custom forms for any users named on a list. (The authors provide a script, do_mail, to facilitate transforming a UNIX list of users-/etc/passwd-into an Engine Mail list.) Second, the script offers a searchable Query/Email gateway so visitors can search for the e-mail address of the person they are trying to reach.

In a nice touch, Engine Mail 2.1 is polylingual. By plugging in language libraries, the system administrator can offer pages in French, Spanish, and Swedish. More language libraries are under consideration. Translators are welcome.

The demo installation of Engine Mail is shown in Figure 10.3. The script is called one of three ways. When called by GET, the query string holds the name of the e-mail recipient. A link to mail for user morganm would be specified as <A HREF="/cgi-bin/engine_mail?morganm>E-mail to Mike Morgan</A>. When called by POST, the script expects to have been called from a form-it processes fields named "name," "reply-to," "subject," "message," "user," and "url." When called with an empty query string, the script puts up a query form, allowing the user to search for an e-mail address that matches a user's name.

Figure 10.3: Engine Mail gives Web visitors e-mail access to a list of people.

The Back End: Integrating Mail Archives with the Web

Once a user has found a mailing list, that user may well want to look back through the archives to find an answer to a question. Indeed, this behavior is encouraged. Most list owners would rather not load up their lists with messages about topics that have already been discussed. They encourage users to visit the archives as well as Frequently Asked Questions (FAQ) lists so that messages are likely to break new ground and make good use of the time and talent represented by their subscribers.

Hypermail

Hypermail is the "grand old man" of archive searchers. It is typically set up to run in cron, the UNIX time-based background processor. During off-peak periods such as the middle of the night, the system administrator schedules large jobs to run so they won't interfere with day-to-day applications. Hypermail is usually set to read all the mail in a mailbox and update an archive file.

Hypermail works, and works well. However, it suffers from two shortcomings. First, it keeps two copies of each message. One is the original message, still in the mailbox. The other is the HTML file. While you can delete the file in the mailbox, that step is irrevocable. No one can later come back and use that file as the basis for, say, an FTP archive.

Second, Hypermail breaks the archive into time slices. The user selects a relevant quarter and then searches by subject or author within the quarter. While this level of search is welcome, it is less desirable than a search over the whole archive in one level.

WAIS and Its Kin

The Wide Area Information Server (WAIS) allows users to search large, distributed databases. The protocol that describes how users ask for these searches is given in ANSI standard Z39.50. The latest version of Z39.50 describes mechanisms for searching for binary files such as images as well as text, making WAIS a natural candidate for searching mailing list and UseNet archives.

WAIS began life running on massively parallel computers made by Thinking Machines, Inc. For many applications, searches can be completed in a reasonable time using conventional hardware. As is shown with other pieces of software in this section, the key to succeeding with large databases is to prepare very complete indexes ahead of time. WAIS's indexers are among the very best.

WAIS now comes in various flavors, from freeWAIS-sf, athttp://ls6www .informatik.uni-dortmund.de/freeWAIS-sf/README-sf; to SWISH, at http://www.eit.com/software/swish/swish.html; to GLIMPSE, at http://glimpse.cs.arizona.edu:1994/glimpse.html. GLIMPSE is used as the basis for Jason Tibbitts's archiver, which is described later in this chapter. It is also closely related to agrep, the powerful runtime search engine used in HURL.

Chapter 16, "How to Index and Search the Information on Your Site," contains a more detailed description of WAIS in the context of indexing and searching a Web site.

Indexing UseNet and Mailing List Archives with HURL

Mailing lists and network news (known as UseNet) are generating new material at the rate of one full set of the Encyclopedia Britannica every day. The bad news is that it's as ephemeral as the TV news. For the most part, it is unindexed, unmoderated, and is not saved in any way that makes it readily available. Earlier, you saw that Majordomo archives are strictly time-based. If you know you are looking for a message that came through in March of 1994, you might find it in the LIST.9403 file. But if you are looking for the migration habits of green sea turtles, the archives don't do much good. Hypermail allows for larger "chunks," but it still requires that the user start by choosing a quarter in which to search.

More and more list owners and newsgroup moderators are realizing the long-term value of these articles and messages and are storing them away, hoping that someday, someone may find a way to tame all that information. A first cut at such an attempt has been made by Cameron Laird. Laird maintains a comprehensive list of all UseNet news archives at http://starbase.neosoft.com/~claird/news.lists/newsgroup_archives.html.

The Hypertext UseNet Reader and Linker (HURL) is the product of Gerald Oskoboiny and is a response to the need to make archives from UseNet as well as mailing lists available to a broader audience. HURL was originally designed to work with UseNet articles (which are defined by RFC 1036) but has since been extended to read Internet mail articles stored in the format defined by RFC 822. Central to HURL's design philosophy is the decision to keep the articles and messages in their original format. This decision means that the archives are still available by FTP and other means and are converted to HTML by CGI scripts on demand.

The Query Page

Unlike Hypermail, HURL is entirely query-driven. The user begins with a set of keywords, not a time frame. Figure 10.4 shows the HURL query screen from the HTML Writers Guild mail archives.

Figure 10.4: The HTML Writers Guild mailing list archive is based on HURL.

The Message List Browser

After the user submits a query, the search engine returns a list of messages that match the specified search criteria. The Message List browser splits this list into separate pages with links at the top and bottom of each page to scroll through the list.

For each message in the list, a single line is displayed listing the Date, Author, and Subject of the article, with a link from the Subject to retrieve the article itself. The current version of HURL uses <PRE></PRE> tags to align the contents of the page. A future version will use HTML 3.0-compliant tables.

Figure 10.5 shows an example of the Message List browser.

Figure 10.5: The HURL Message List browser displays messages that match the search criteria.

The Article Page

Selecting an article from a message list produces an Article page for that article. The Article page (see Fig. 10.6) contains icons that link to other articles in the thread.

Figure 10.6: The HURL Article page.

Note that the message's headers have been handled intelligently. The To and CC lines are, of course, shown. The article's subject gets a link to a query for articles having the same subject, and the From line gets a link to an Author page for that author, which contains lists of that author's articles. The script also scans the article in the In Reply To header; if that article is in the archive, the header is linked to it.

Note in the article that references to e-mail message or message-ID references are linked to the associated author (if he or she has a page in the archive). This feature is a nice touch in an already comprehensive package.

HURL is an example of dividing the workload between runtime (when the user is waiting for the result) and batch (typically, late at night when the system has excess capacity). During the late-night processing, HURL reviews the new messages that have come in during the day and builds an index and database of key message information. At runtime, HURL uses these data structures to select the messages that meet the search criteria, formats the page, and then serves it up on the Article page, upon request.

Implementation Details

HURL databases are stored in DBM files using Perl. DBM files are a natural data type in Perl-they can be bound directly to associative arrays. This technique allowed Oskoboiny to write extremely readable and extremely fast code, like the following:

# load the database during the nightly build process
dbmopen( DBFILE, "dbfile", 0600 );
$DBFILE{'Subject'} = $subject;
$DBFILE{'Author'} = $author;
dbmclose( DBFILE );
.
.
.
dbmopen( DBFILE, "dbfile", 0600 );
$subject = $DBFILE{'subject'}\n";
$author = $DBFILE{'author'}\n";
dbmclose( DBFILE );

Computer scientists worry about things called the Big-O notation. The Big-O measure of time for accessing a data structure says how long it takes to look something up as a function of the number of items in the database. DBM files mapped to associative arrays use a data structure called a hash table for implementation. Hash tables are the fastest known lookup mechanism. They have O(1), or order 1 lookup time-that means that it takes about the same amount of time to look something up in a database of 100,000,000 entries as it does to look thinks up in a database of 10 entries. The decision to concentrate on the efficiency of the most-used page in the system represents a good CGI design approach.

Oskoboiny also took special pains to get the queries right. It would have been tempting to build a form that built a query string out of fields and check boxes (see Fig. 10.7). Instead, Oskoboiny accepts a general query string and parses out the Boolean operators.

Figure 10.7: Queries done wrong.

For HURL's query system, Oskoboiny needed a fast utility to search text files (the articles and messages). Instead of building one from scratch, he turned to an off-the-Net utility called agrep. This utility, patterned on the UNIX standard tool grep, was written by Sun We and Udi Manber of the University of Arizona. It is one of the faster members of the grep family and is unique in its ability to conduct "approximate" searches. You can say

agrep -3 security messages

and agrep will find matches in the file messages to the word security, as well as securities, securaty, and secuity. In fact, it will find any word that matches the original word with no more than, in this case, three substitutions.

In addition, agrep is record-oriented rather than line-oriented. Although it defaults to a new line, to search a multiline message file, just define a new message delimiter. For example, the command

agrep -d '^From ' 'Win96' mbox 

searches the file mbox for occurrences of the string "Win96". When it finds one, it outputs the entire message (as delimited by the string "From " at the beginning of a line).

agrep already has built-in Boolean operators. The string "Win95,Win96" matches records with either "Win95" or "Win96" in them. The string "Win95;Win96" matches only those records with both "Win95" and "Win96" in the record.

By passing the query string to agrep, Oskoboiny was able to build a powerful pattern-matcher into HURL, without reinventing all the complexity of agrep.

agrep is available from the authors at http://glimpse.cs.arizona.edu:1994/.

Recall that Chapter 9 ("Making a User's Life Simpler with Multipart Forms") describes how to pass state between the pages of a multipart form. HURL is a different kind of multipart CGI script, but it still needs to preserve state. Visit the HTML Writers Guild archives http://www.hwg.org/lists/archives.html and watch the URL. You will see characters like ?jiagvyfcn&pos=101 being passed along. Those are the state information being passed in the GET query string.

The query processor generates a random string of characters (in this case, jiagvyfcn) and uses this string to name the file in which it writes its query results. The Message List browser starts at the top of this file (pos=0) and walks through the file, a page at a time. At any time, the user can select a line of the file and the Message Line browser pulls up the message ID from the file, uses it to index the associative array, and fetches back the file name and link information of the selected message.

The preceding design also allows on-the-fly query construction from other pages. The query processor handles both POST and GET requests. If the request is a POST, it looks to STDIN to read the query from the form. If the request is sent by GET, it looks to the query string for something like

?Subject=something.interesting

Whatever it finds there is massaged into the multiple variable form used with POST. From there, the script proceeds just as it would have if the query had come in from the form.

Handling Multiple Browsers: A Real-World Solution

Chapter 3, "Deciding What to Do About Netscape," described how to build pages that look good with any browser. HURL makes some concessions to the varieties of browser. For example, a message line can easily grow beyond 80 characters-not a problem for graphical browsers, but ugly when the browser wraps long lines (like Lynx does). Oskoboiny's solution was to check the USER_AGENT CGI variable. If the browser is Lynx, HURL tightens the message line somewhat and truncates the subject line.

Other Back Ends

While Hypermail, WAIS, and HURL are among the best archivers available, they are not alone.

UseNet-Web

UseNet-Web is an interface to UseNet articles. (Version 1.0.3 will also support mailing lists.) There is no real search capability-the archives are organized by month and day. For more information, see the demo and description at http://www.netimages.com/~snowhare/utilities/usenet-web/.

MHonArc

MHonArc is similar to Hypermail, but MHonArc handles MIME attachments. Attached pictures show up in the HTML as images. MHonArc is available at http://www.oac.uci.edu/indiv/ehood/mhonarc.html. The demo page, http://www.oac.uci.edu/indiv/ehood/mhaeg/maillist.html, is shown in Figure 10.8.

Figure 10.8: MHonArc archive ofcomp.infosystems.www.authoring.cgi.

MHonArc takes the opposite approach of Hypermail. Recall that Hypermail does all of its processing in batch mode. HURL preprocesses the files to build a database but completes the query processing at runtime. MHonArc does all processing at runtime. This approach is acceptable on small archives. As the files grow, so does the time required to access them. At some point, most mailing-list archives will outgrow MHonArc.

The Tibbitts Archive Manager

Jason L. Tibbitts III <tibbs@hpc.uh.edu> reports that he is developing a list archive manager. It has full GLIMPSE indexing; eventually Tibbitts intends to add a link to MailServ. His work-in-process is at http://www.hpc.uh.edu/type-o/, and is shown in Figure 10.9.

Figure 10.9: The Tibbitts archive manager user interface provides a variety of options.

For more information on ListProc 7.0 (the commercial version of ListProc), visit http://www.cren.net/. The revised LISTSERV is available from LSoft; for more information visit http://www.lfsoft.com/. Although LISTSERV's roots are on IBM mainframes and DEC VAXen, LSoft ships UNIX, NT, and Win 95 versions of the product, which are reported to be quite solid.

The mailing lists LSTSRV-L and LSTOWN-L both cover aspects of LISTSERV. LSTSRV-L is hosted on UGA.CC.UGA.EDU. LSTOWN-L is hosted on SEARN.SUNET.SE. Majordomo is discussed on the majordomo-users mailing list-send a subscription request to majordomo@GreatCircle.com. For general list-management discussion, join the List-Managers list, also hosted on majordomo@GreatCircle.com.