Chapter 24

User Profiles and Tracking


CONTENTS


Many sites are collecting information about users who visit their site. Although the nature of the Web is such that the user's name or personal information is not transferred to the Web site, there are several ways to collect, store, and use such information.

The most subtle way to collect user data is with the cookie mechanism introduced by Netscape Communications. This mechanism has sparked a wave of controversy from some users who perceive that their hard disk is being used to store data about them without their consent.

Why Maintain a User Profile?

Keeping information about people who visit a site is controversial. This chapter describes how to do it ethically and what pitfalls to watch for so as not to give even the appearance of misusing the information.

Improved Customer Service

Imagine going to a doctor and having him or her take a complete medical history. That sounds reasonable-we get better medical care when the doctor knows something about us. Now imagine going back for the next visit and having to fill out all those forms again.

Whether we're taking the children to the doctor, the dog to the veterinarian, or the car to the mechanic, we expect service providers to remember who we are. For many Web sites, the same expectation applies.

Improved Marketing

Most Web sites are essentially one-way affairs. A user visits, looks, reads, and leaves. Many users like it that way. Some users are willing to leave information as well. Some sites report that about two percent of all users take the time to fill out a form asking for more product information if the form is associated with a bonus or incentive of some kind.

One use of the information from user profiles is to build one or more focused mailing lists. Mailing lists do not always enjoy a good reputation because they are used to distribute junk mail.

A good use of the user profiles is to make sure that information is going only to recipients who have expressed an interest in the subject-and then to show these recipients how to get off the list quickly and easily if they wish.

Renting the Mailing List

Most magazine subscribers understand that their name may be put on a mailing list that is rented out by the publisher or sold outright to a list manager. Many publishers give subscribers a place on the invoice to check if they do not want to allow their name on the list. With these safeguards in place, most people seem to be fairly comfortable subscribing.

The same safeguards and understanding have not found their way onto the Web. If a site owner collects personal information about people who visit the site and uses it in a manner unrelated to the site itself, they can expect to get flamed. If the use borders on spamming (widespread e-mail unrelated to the initial topic), the intensity of the flames may be so overwhelming that the e-mail server gets saturated, forcing the service provider to terminate the site owner's account.

Note
Some Internet services, such as UseNet and mailing lists, are public forums in which members post messages generally based on a single topic, such as running a Web site. Off-topic postings are frowned upon-some members air their arguments publicly (a practice known as flaming).
While people are often flamed for violating the rules of Netiquette, many people consider flaming itself to be impolite. Nevertheless, it is not uncommon for long "flame wars" to rage between two or more opposing sides in a discussion group.

How to Maintain a User Profile

This section describes the technical aspects of setting up a user profile. The next section, "How Not to Maintain a User Profile," addresses the ethical and social aspects of user profiles.

User Registration

One simple way to collect user information is to ask. Figure 24.1 shows a form that can be used for this purpose. The script that processes this form, shown in Listing 24.1, looks at the user's name and the user's REMOTE_HOST.

Figure 24.1: The easiest way to collect user information is to ask.


Listing 24.1  lookup.cgi-Using a "Unique" User ID

#!/usr/local/bin/perl
# lookup.cgi

require "html.cgi";

# make sure arguments are passed using the POST method
if ($ENV{'REQUEST_METHOD'} eq 'POST' )
{
  # using POST; look to STDIN for fields
  read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});

  # split the name/value pairs on '&'
  @pairs = split(/&/, $buffer);
  foreach $pair (@pairs)
  {
    ($name, $value) = split(/=/, $pair);
    $value =~ tr/+/ /;
    $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
    $FORM{$name} = $value;
  }

  if ($FORM{email} !~ /^[a-zA-Z0-9_\-+ \t\/@%.]+$/ && $FORM{email} !~/^$/)
  {
    &html_header("Illegal Email Address");
    print "<HR><P>\n";
    print "The Email address you entered ($FORM{email}) contains illegal ";
    print "characters. Please back up, correct, then resubmit.\n";
    &html_trailer;
    exit;
  }
  
  # Now the real work begins
  # First, open the database
  dbmopen (visitors, "visitors", 0666) || &die("Cannot open visitors file\n");

  $key = $ENV{'REMOTE_HOST'} ."\t". $FORM{'lastName'} ."\t". $FORM{'firstName'};
  $remainder = $visitors{$key};
  if ($remainder == "")
  {
    $visitors{$key} = $remainder;

    # newVisitor is responsible for collecting additional
    # information such as the address and phone number
    # and writing it into the DBM file.
    print "Location: /cgi-bin/dse/Chap24/
newVisitor.cgi?$FORM{firstName}+$FORM{lastName}\n\n";
  }
  else
  {
    print "Location: /cgi-bin/dse/Chap24/
oldVisitor.cgi?$FORM{firstName}+$FORM{lastName}\n\n";
  }
  dbmclose(visitors);
  exit;  
}
else
{
&die("Not started via POST\n");
}

The first time users visit the site, the script collects their contact information. If a user has been to the site before, the script pulls up the information and proceeds.

The one problem with this script is that there is no easy way to lock a database management (DBM) file. This means there is a small chance of corruption if two users access the user's file at the same moment.

If the site is very busy, the Webmaster may want to use a full database such as those described in Chapter 18, "How to Query Databases." You could also use the locking mechanisms described in Chapter 27, "Multipage Shopping Environment."

Cookies

Netscape and several other browsers make it possible for the Web server to store information on the client's machine. The technology is called cookies.

Note
The term cookie is sometimes used to mean generic methods of state preservation. To keep this distinction clear, this book calls Netscape's method Netscape cookies. Netscape cookies are now supported in over a dozen browsers, not just in products from Netscape

.

Cookies-Plain Vanilla

Netscape cookies are introduced in Chapter 9, "Making a User's Life Simpler with Multipart Forms," and described more fully in Chapter 20, "Preserving Data." This section shows how to use Netscape cookies to implement a user profile.

The Set-Cookie: response header includes five attributes:

To use a Netscape cookie as the basis for a user profile, set the expires field. The expires field requires that the date be specified in a precisely defined format, defined in RFC 850, 1036, and 822

Wdy, DD-Mon-YY HH:MM:SS GMT

The only legal time zone is Greenwich Mean Time (GMT). For example, if the server is on the U.S. East Coast, then local time is GMT minus 5 hours during most of the year. Thus, a typical entry for an East Coast Monday afternoon in April might be

Monday, 29-Apr-96 20:43:34

Tip
For a quick check of how your time zone compares with GMT, look at the out box on your e-mail. (See Fig. 24.2.) The last field of the date and time shows how many hours' difference there is between local time and GMT.
Remember to take daylight-saving time into account, if applicable. For example, in April, the U.S. East Coast is on Eastern Daylight Time (EDT). 12:58 EDT is equivalent to 11:58 eastern standard time (EST), which is 16:58 GMT.

Figure 24.2: The e-mail out box gives a quick check of GMT.

It's often useful to be able to get the time, in GMT, inside a script. Listing 24.2 shows a short program that prints out the GMT, regardless of the local time. Use it as the basis for providing GMT to your own programs.

Caution
Be sure the server's clock is set correctly. With most versions of UNIX, you can set the time zone to one that "knows about" daylight-saving time so that the local time switches from standard time to daylight-saving time as appropriate. Figures 24.3 and 24.4 show this process for Advanced Interactive Executive (AIX), IBM's version of UNIX.

Figure 24.3: IBM's version of UNIX asks the administrator if the site goes on daylight-saving time.

Figure 24.4: If the administrator selects daylight-saving time, a list of all the time zones that observe DST appears.


Listing 24.2  gmtime.pl-A Quick Perl Script to Report GMT

#!/usr/bin/perl

($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdat) =
   gmtime(time);

printf ("GMT is %02d:%02d:%02d\n", $hour,$min,$sec);
exit;

Caution
A bug in Netscape Navigator version 1.1 and earlier causes cookies with an expires attribute to be stored incorrectly if they have a path attribute that is not set explicitly to /. As long as users are still using Netscape Navigator version 1.1 or earlier, set path=/ whenever expires is set.

Short Cookies

One downside of Netscape cookies is that sending all of a user's information back and forth can use significant bandwidth and disk space-and the user is paying for both of those resources. Netscape's specification suggests that each client support a minimum of 300 total cookies, with 20 cookies per server or domain, and 4K per cookie. Depending on how much information is in the user profile, and whether it is all put in one cookie or is split across several, the program could run up against one or more of these limits.

A different approach is to store a short ID number, sometimes called a "short cookie," in a user's cookie file and use that ID to index a record on the server's hard disk.

Listing 24.3 shows how to implement short cookies. The code in Listing 24.3 relies upon the version of html.cgi shown in Listing 24.4.


Listing 24.3  step1.cgi-Using Short Cookies to Index a DBM File

#!/usr/bin/perl

require "html.cgi";

open (PROFILE, "users.txt") || &die ("Cannot open user database\n");
  
$foundIt = 0;

# Let the user enter the site. If the user has not been here before, issue him a cookie with the user ID.
# Whether he's been here before or not, he leaves this subroutine with a valid $userID.
&html_header("Login");

while (<PROFILE>)
{
  chop;
  ($storeduserID, $storeduserFirstName, $storeduserLastName, $storeduserAddress, $storeduserHomePhone, 
           $storeduserWorkPhone) = split (':', $_);
  if ($storeduserID eq $userID)
  {
    print "<FORM METHOD=\"POST\" ACTION=\"http://www.dse.com/FirstJefferson/\">\n";
    print "userID is $storeduserID\n";
    print "Would you like your monthly payment to adjust Periodically possibly higher, 
possibly lower\n";
    print "<INPUT TYPE=Text Name=everyyear VALUE=$storedusereveryyear>\n";
    print "How long do you expect to stay in this home\?<BR>\n";
    print "<INPUT TYPE=Text Name=Stay VALUE=$storeduserStay>\n";
    print "How much do you plan to borrow\?<BR>\n";
    print "<INPUT TYPE=Text Name=borrow VALUE=$storeduserborrow>\n";
    print "<P>\n";
    print "<INPUT TYPE=Submit VALUE=Next Step>\n";
    print "<INPUT TYPE=Reset VALUE=Clear Form>\n";
    $foundIt= 1; 
  }
}

# if we did not find him in the user file...
if (!$foundIt)
  { 
    # step2.cgi and its successors are responsible for writing
    # profile information to the users.txt file in colon-delimited
    # format

    print "<FORM METHOD=\"POST\" 
    ACTION=\"http://www.dse.com/cgi-bin/dse/FirstJefferson/step2.cgi\">\n";
    print "Last Name<BR><INPUT TYPE=Text name=LastName size=44><BR>\n";
    print "First Name<BR><INPUT TYPE=Text name=FirstName size=44><BR>\n";
    print "<INPUT TYPE=Submit VALUE=\"Log in\">\n"; 
    print "<INPUT TYPE=Reset VALUE=Clear><BR>\n";
  } 

&html_trailer; 
1;


Listing 24.4  html.cgi-An Adaptation That Stores User IDs in Netscape Cookies

# ===============================================================
# This subroutine takes a single input parameter and uses it as
# the <TITLE> and the first-level header.
# ===============================================================

sub html_header
{
  $document_title =$_[0];

  require "counter.cgi";

  $theCookie = $ENV{'HTTP_COOKIE'};
  if ($theCookie =~ /userID/)
  {
    @cookies = split (/; /, $theCookie);
    foreach (@cookies)
    {
     ($name, $value) = split(/=/, $_);
     last if ($name eq 'userID');
    }
    $userID = $value;
    print "Content-type: text/html\n\n";
  }
  else
  {
    $theDomainName = ".dse.com";
    $userID = &counter;
    print "Content-type: text/html\n";
    $aMonthFromNow = time() + 3600 * 24 * 31;
    ($sec, $min, $hour, $mday, $mon, $year, 
           $wday, $yday, $isdat) =
           gmtime($aMonthFromNow);
    $month = (Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,
           Nov,Dec)[$mon];
    $weekday = (Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday)[$wday];
   printf "Set-Cookie: userID=%s\; expires=%s %02d-%02s-%02d %02d:%02d:%02d\; 
Domain=%s\;\n\n", 
     $userID, $weekday, $mday, $month, $year,
     $hour, $min, $sec, $theDomainName;
  }
  print "<HTML>\n";
  print "<HEAD>\n";
  print "<TITLE>$document_title</TITLE>\n";
  print "</HEAD>\n";
  print "<BODY BGCOLOR=\"#FFFFFF\" TEXT=\"#000000\" 
LINK=\"#CC0000\" VLINK=\"#663366\" ALINK=\"#333366\">\n";
  print "<H2>$document_title</H2>\n";
  print "<P>\n";

}

sub html_trailer
{
  print "</BODY>\n";
  print "</HTML>\n";
}
 
sub die
{
  print "Content-type: text/html\n\n";
  print "<HTML>\n";
  print "<HEAD>\n";
  print "<TITLE>Error</TITLE>\n";
  print "</HEAD>\n";
  print "<BODY BGCOLOR=\"#FFFFFF\" TEXT=\"#000000\" 
LINK=\"#CC0000\" VLINK=\"#663366\"> ALINK=\"#333366\">\n";
  print "<H2>An Error has occurred</H2>\n";
  print @_;
  print "\n";
  print "</BODY>\n";
  print "</HTML>\n";
  die;
} 

Hybrid Methods

Often a site needs to collect user information before collecting a user's name, address, and other contact information. For example, at the First Jefferson Mortgage site (http://www.dse.com/FirstJefferson/) the Webmasters wanted to allow users to access the content of the site (including the Mortgage Advisor) before they provide information about who they are. Figures 24.5 through 24.9 show how the Mortgage Advisor works.

Figure 24.5: A first-time user enters the Mortgage Advisor.

Figure 24.6: Based on the user's answers, the Mortgage Advisor recommends a set of products.

Figure 24.7: A first-time user enters more information about financial goals and means.

Figure 24.8: The site makes it easy for the user to start the application process.

Figure 24.9: A first-time user completes the Mortgage Advisor.

Some people would consider the questions asked so far to be personal-and they are willing to answer them anonymously. The fact that we have stored this information on the server's hard disk and tagged the user with a short cookie does not violate the user's anonymity since we still don't know who this person is.

To tell users more about monthly payments, the script needs to know where users would like to live and how much money they have available. This information is routinely collected by loan officers as part of the application process (users may or may not consider themselves to be "applying" at this point). The script is storing this information on the server's hard disk but is prepared to discard it if the user asks to have it discarded.

If users elect not to have their information forwarded to the loan officer, they are given a blank prequalification form to fill out and to fax to the loan officer. They are also given the option of retaining this information (to simplify their next visit) or of deleting it.

If they allow the information from the Mortgage Advisor to be kept on file, it is used as the basis for their application. With just a few more questions, the application is on its way. This loan originator specializes in giving prompt answers, even in difficult cases, so the user can typically expect to get a loan approval certificate by e-mail quickly.

If users allow their information to be retained, the next time they visit, the first page recognizes their cookie and puts up their personal profile. Figure 24.10 shows a user profile in action.

Figure 24.10: A return user enters the Mortgage Advisor.

Providing Custom Services with User Profiles

Using Netscape cookies (or better still, short cookies), a site can maintain a full account folder on each user. Figures 24.11 through 24.14 show how various sites have configured a custom front page.

Figure 24.11: A user configures the LiveWire subscription.

Figure 24.12: The user enjoys the customized subscription.

Figure 24.13: A Bank of America user "builds his or her own bank."

Figure 24.14: The customized Bank of America does not use Netscape cookies.

How Not to Maintain a User Profile

On February 13, 1996, the San Jose Mercury News ran a story that began, "Attention, Web surfers: You'll probably be surprised to hear this, but the Web sites you're visiting may be spying on you and using your own computer's hard disk drive to keep detailed notes about what they see."

The article went on to say that Netscape cookies seem to violate two "nearly universal assumptions" in the Internet community:

Cookies and the San Jose Mercury News

The San Jose Mercury News article was circulating on the Net less than 24 hours after the paper hit the newsstands. Not surprisingly, the Web community was polarized. See the full thread in the HWG Mailing List Archives with the subject, "Cookies violate your privacy." The two sides of the discussion ran like this:

Some disgruntled users suggested passing their cookie files around so that Webmasters wouldn't be able to tell who was who. They called this approach the "cookie oven." Others argued that cookies would be acceptable if the user had a way of turning them off.

On the other side, readers pointed out that most sites use cookies to provide user profiles or similar services. They noted that a Webmaster wouldn't know who the user was unless the user chose to share that information.

They also acknowledged that in the hands of an aggressive Webmaster, cookies could indeed be used to track which pages a user visited and how long he or she spent on each one. It seems that this "big brother" potential is what many Web users fear most.

Netscape and the Wall Street Journal

On February 14, 1996, the Wall Street Journal printed an article that said cookies are a feature that "allows merchants to track what customers do in their online storefronts and how much time they spend there" and that cookies allow merchants to track customers' movements "over long periods of time." The Journal quoted Netscape Communications as saying that Netscape would modify the browser so that in future versions, customers could disable cookies.

Based upon beta releases, Netscape Navigator 3.0 will have a menu item under Options|Network Preferences|Protocols called "Show and Alert Before Accepting a Cookie."

How to Get Everyone Mad at You

Regardless of whether you believe that cookies are a useful tool for preserving client state or that cookies are the tentacles of big brother's organization tracking the individual Netizen's movements on his or her own hard disk, as Webmasters we have a responsibility to respect both points of view.

Most users do not seem to object to the use of cookies to store a few bytes of state information during a single session, as described in Chapter 9, "Making a User's Life Simpler with Multipart Forms." When the cookie has an expiration date (so that it lives past the end of this session), it is a good idea to follow three rules:

Some sites have two objectives: provide a service to the user (possibly anonymously) and collect user information for use by the site owner. When the site has these two objectives, the hybrid approach (described earlier under "Hybrid Methods") offers three advantages over the pure registration or pure cookie method:

The downside of the use of cookies, of course, is that they only work with certain browsers. Although that number is growing steadily, not every visitor to the site has a cookie-aware browser.

Implementing a "Polite" Cookie Site

Figure 24.15 shows the page of the First Jefferson Mortgage site that collects user information. Before coming to this page, users have answered questions about their financial goals and means-information that is considered quite sensitive.

Figure 24.15: Get permission before associating the cookie information with contact information.

When users submit this form, the information is sent to a loan officer who uses it to qualify the user for a loan. Access to the user's previous answers helps the loan officer design a mortgage product that is best for the user, but the site does not assume that the user is willing to share that information.

Instead, users are given a control to determine whether or not to share that information-if they elect not to share it, the cookie is expired so that it is not available after this browser session. The code to implement this control is given in Listing 24.5, and uses the code from Listing 24.4.


Listing 24.5  apply.cgi-The Site Only Uses the Cookie Information if the User Grants Permission

#!/usr/bin/perl

require "html.cgi";

# if started by POST
if ($ENV{'REQUEST_METHOD'} eq 'POST' )
{
  # using POST; look to STDIN for fields
  read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
  # read STDIN and store in FORM array

  open (PROFILE, "users.txt") || &die ("Cannot open user database\n");
  
  $foundIt = 0;

  $theCookie = $ENV{'HTTP_COOKIE'};
  if ($theCookie =~ /userID/)
  {
    @cookies = split (/; /, $theCookie);
    foreach (@cookies)
    {
     ($name, $value) = split(/=/, $_);
     last if ($name eq 'userID');
    }
    $userID = $value;
    print "Content-type: text/html\n\n";
  }
  else
  {
    # if the user doesn't have our cookie, start him at the 
    # beginning of the mortgage advisor.
    print "Location: http://www.dse.com/FirstJefferson/MortgageAdv.html\n\n";
  }  

  while (<PROFILE>)
  {
    chop;
    ($storeduserID, $storeduserFirstName, $storeduserLastName, $storeduserAddress,
$storeduserHomePhone, 
                    $storeduserWorkPhone, $storeduserLive, 
$storeduserDownPaymentandClosingCost, $storeduserInitialRate,
                    $storeduserTypeofLoan, $storeduserMonthlyIncome,  
$storeduserMonthlyObligations,
                    $storeduserStay, $storeduserCondoFees, $storedusereveryyear, 
$storeduserborrow,
                    $storeduserRealEstateTax, $storeduserLoanAmout, 
$storeduserHomePrice, $storeduserClosingCost,
                    $storeduserDownPayment, $storeduserMonthlyPayment) = split (':', $_);
    last if ($storeduserID eq $userID);
    if ($storeduserID eq $userID)
    {
      $foundIt= 1; 
      if ($FORM{forward} eq "Yes")
      { 
         open (MAIL, "| sendmail -t") || &die("Cannot open sendmail.");
         print MAIL "To: $usersName <$usersEmail>\n";
         print MAIL "From: $siteOwner <$ownersEmail>\n";
         print MAIL "Subject: Loan Application\n";
         print MAIL "The stored user ID in the profile is $storeduserID\n";
         print MAIL "First Name: $storeduserFirstName\n";
         print MAIL "Last Name: $storeduserLastName\n";
         print MAIL "Address: $storeduserAddress\n";
         print MAIL "HomePhone: $storeduserHomePhone\n";
         print MAIL "WorkPhone: $storeduserWorkPhone\n";
         print MAIL "You Plane to Live In: $storeduserLive\n";
         print MAIL "Condo Fees: $storeduserCondoFees\n";
         print MAIL "Down Payment & Closing Cost: 
$storeduserDownPaymentandClosingCost\n";
         print MAIL "Initial Rate: $storeduserInitialRate\n";
         print MAIL "Type of Loan: $storeduserTypeofLoan\n";
         print MAIL "Monthly Income: $storeduserMonthlyIncome\n";
         print MAIL "Monthly Obligations: $storeduserMonthlyObligations\n";
         print MAIL "How long do you expect to stay in this home: $storeduserStay\n";
         print MAIL "Condo Fees: $storeduserCondoFees\n";
         print MAIL "Would you like your monthly payment to adjust Periodically 
(possibly higher, possibly lower): $storedusereveryyear\n";
         print MAIL "How much do you plan to borrow: $storeduserborrow\n";
         print MAIL "Real Estate tax: $storeduserRealEstateTax\n";
         print MAIL "Loan Amount: $storeduserLoanAmount\n";
         print MAIL "Home price: $storeduserHomePrice\n";
         print MAIL "Closing Cost: $storeduserClosingCost\n";
         print MAIL "Down Payment: $storeduserDownPayment\n";
         print MAIL "Monthly Payment: $storeduserMonthlyPayment\n";
      print "Location: ..... \n\n";
      } # if forward is YES 
      else
      {
         # Send contact info only
         print MAIL "First Name: $storeduserFirstName\n";
         print MAIL "Last Name: $storeduserLastName\n";
         print MAIL "Address: $storeduserAddress\n";
         print MAIL "HomePhone: $storeduserHomePhone\n";
         print MAIL "WorkPhone: $storeduserWorkPhone\n";

       print "Content-type: text/html\n";
       print "Set-Cookie: userID=$userID\; expires=Monday 01-Jan-96 00:00:00\; 
Domain=$theDomainName\;\n\n";
      } # end of 'Forward is No'
    } # end of 'if this user was the one we are looking for' 
  } # keep on looping

  # if we did not find him in the user file...
  if (!$foundIt)
  { 
    # step2.cgi and its successors are responsible for writing
    # profile information to the users.txt file in colon-delimited
    # format
    print "<FORM METHOD=\"POST\" ACTION=\"http://www.dse.com/cgi-bin/dse/
FirstJefferson/step2.cgi\">\n";
    print "Last Name<BR><INPUT TYPE=Text name=LastName size=44><BR>\n";
    print "First Name<BR><INPUT TYPE=Text name=FirstName size=44><BR>\n";
    print "<INPUT TYPE=Submit VALUE=\"Log in\">\n"; 
    print "<INPUT TYPE=Reset VALUE=Clear><BR>\n";
  } # get his personal data if he's not on file 
} # end of check for METHOD=POST
else
{
  &html_header('Error');
  print "Not started via POST\n";
  &html_trailer;
}