Chapter 34

How to Add Sound


Whether it's Oscar the Grouch singing "I love trash," a custom beep, or the President's weekly radio address, we love to hear sound from our computers. This chapter describes how to use sound on a Web site.

Short bits of sound can be used for entertainment, or to draw a visitor into a site. Longer sound files have content value. Some of the best content comes from real-time audio such as RealAudio from Progressive Networks, which allows the visitor to start playing the sound file while it continues to download in the background.

Unfortunately, sound files are big. Without very sophisticated compression, real-time audio and even most off-line sound would take so long to download that few users would tolerate it. Much of this chapter, therefore, is focused on audio compression.

Understanding Sound

Getting sound to play from a Web site is far more complicated than just recording it, saving it to a file, and sending the file to a user. Sound, of course, is made by compressions in the air. These compressions are translated into voltages by a microphone and possibly an amplifier. After that, what happens to the sound depends on how it will be used and on what the Webmaster hopes to accomplish with sound.

From Sound to Numbers

Figure 34.1 shows that the incoming analog voltage from a microphone varies by frequency as well as by level. The frequency corresponds to pitch or tone, and the level corresponds to volume. Since computers are much more comfortable with numbers than they are with varying voltages, the signal is sampled to get a file of numbers.

Figure 34.1 : Sampling an analog signal.


The first decision when obtaining a sound for use on the Web is the sampling rate. To understand sampling rates, we must first understand frequency.

Note from Figure 36.1 that the alternate compressions and decompressions of the air form cycles. The number of cycles per second is the frequency. Frequency is measured in Hertz, where one cycle per second is 1 Hertz (Hz).

The human range of hearing runs the gamut from around 20 Hz to more than 20,000 Hz (20 kHz). Most of the information content in the human voice is concentrated in the band between about 2,000 Hz and 4,000 Hz. This range is called the voice band.

Sampling Rates and the Nyquist Theorem

There is a principle in physics called the Nyquist theorem that says that to reproduce a sound you must sample that sound at a rate at least twice the highest frequency in the signal. Note that the theorem applies to the highest frequency present, not just the highest frequency of interest.

Most audio systems use a low-pass filter to remove the part of the signal that human beings can't hear, so only sound at about 20 kHz or below gets through to be sampled. To capture a sound with 20 kHz present, the sampler must collect at least 40,000 samples per second.

Most engineers prefer to leave a bit of margin to allow for filter inaccuracies, so sample rates just above 40 kHz are common. Some professional audio equipment samples at 44,056 samples per second. The CD sampling rate is 44,100 samples per second. In the U.S., digital audio tape is sampled at 48,000 samples per second.


The next decision to be made when collecting sound is how many bits to use to represent each sample. This decision is like the choice of the number of bits used to represent full color in an image. (See Chapter 33, "How to Add High-End Graphics.")

Chapter 33 points out that for many purposes, human beings either could not tell the difference or did not find the difference annoying when the number of bits per color was reduced from 8 to, say, 6 or 7. The same principle holds with sound.

For best results, sound should be recorded at 16 bits per sample. Doing the math, a 60-second sound file sampled at 44,100 samples per second and 16 bits per sample would take up over 5M, and would take at least an hour to download over a 14,400-bps connection.

Of course, for true audiophiles only stereo will do. Doubling the above numbers to come up to two channels means that data is coming out of the system at the rate of 1,411,200 bps. Clearly, some compression is called for.

Encoding and Compression (Especially MPEG Audio)

A simple approach to storing sound might map the input signal into 65,535 levels and represent each level with one 16-bit word. This approach leads to the huge file sizes described above. A better approach is to take advantage of the fact that each sample tends to be fairly close to its predecessor.

You can get significant compression by just storing the differences between successive samples. Companies with a major stake in the CD industry like Philips and SONY have spent millions of dollars to develop better compression schemes.

Another way to quickly reduce the size of the file is to throw away some of the information. Recall that the voice band only goes up to about 4 kHz. When you set a low-pass filter to reject sound above that level, the sampling rate needed by the Nyquist theorem drops to around 8,000 samples per second-about 1/5 of the rate needed for the full 0-to-20-kHz range. Confining the samples to 8 bits rather than 16 also reduces the file size, though at some cost to quality.

Audio engineers get very interested in something called the signal-to-noise ratio (S/N ratio). The higher the S/N ratio, the more the sound that you want to hear stands out from the background. Each additional bit of quantization improves the S/N ratio by 6 decibels (dB). To the ear, 6 dB sounds like a doubling of the sound level. CD audio with 16 bits of quantization gives about 90 dB S/N ratio.

Moving from 16 bits to 8 bits of quantization drops the S/N ratio from 90 dB down to around 40 dB. The sound is perceived as being only about 1/16 as loud, so the amplifiers must be turned up to get the same volume. On an 8-bit system, during the quiet moments between words or songs, there is a perceptible hiss. This hiss is quantization noise. It is the price paid for giving up those extra 8 bits.

A 60-second sound sampled at 8,000 samples per second and quantized into 8 bits takes up just under 469K-about 1/10 of the size of the 44,100 samples per second by 16 bits per sample file.

A-Law, µ-Law, and Companding

Chapter 33, "How to Add High-End Graphics," points out that since the human eye is more sensitive to some colors than to others, image compression schemes can get better quality by using precious bandwidth for the colors we see best.

Sound is similar. More information is in the low-level components of the signal than in the higher levels, so schemes have been developed that compress the high-level signals, and enhance or expand the low-level signals.

In general, these techniques are called companding. They are heavily used in telephony since telephone engineers have decades of experience improving the quality of voice-band signals.

Figure 34.2 illustrates how a typical compander works.

Figure 34.2 : Companding techniques expand the parts of the signal that carry the most information, to give those parts extra bandwidth.

The companding standard in the U.S. and Japan is µ-law (also written mu-law or u-law). The European standard is A-law. Both standards do the same thing but differ in the specifics of how they do it.

Software to implement the µ-law is available online at The A-law standard itself is at You must be a member of the International Telecommunications Union (ITU) to access the standard online.

Using µ-law companding, a 13-to-16-bit signal from the voice band can be encoded into just 8 bits. After it is transmitted, the signal is reconstructed to restore the original quality. Files with the extension .AU are usually µ-law files.

µ-law and A-law are examples of encoding a signal before storage and transmission, and decoding it during playback to recover the quality of the original signal. Software that implements such a coding and decoding scheme is generally called a codec. µ-law and A-law represent good codecs based on an old technology. The newest research is in codecs that take into account the way people actually hear.

A Psychoacoustic View of Sound

Since µ-law and A-law were developed, researchers have uncovered important information about the way human hearing works. They have found that under certain conditions, sounds that are present are never heard: They are masked by other sounds.

Newer codecs take this information into account and don't waste bandwidth sending sounds that will never be heard. The three forms of psychoacoustic masking are concurrent, pre, and post.

Figure 34.3 shows signals at 1000 Hz, 1100 Hz, and 2000 Hz. The 1100-Hz tone is 18 dB down from the 1000-Hz tone. The 2000-Hz tone is 45 dB down. If all three tones are present at these levels in the signal at the same time, the listener hears only the 1000-Hz tone. The other two tones are masked. This phenomenon is called concurrent masking.

Figure 34.3 : The listener hears only the 1000-Hz tone, a phenomenon known as concurrent masking.

Suppose there is an abrupt shift in sound levels. The signal drops by, say, 30 to 40 dB. For the next 100 milliseconds (ms) or so, the listener does not hear the lower signal. It has been masked out by postmasking.

Premasking occurs before such an abrupt shift. Suppose the signal jumps up by 30 or 40 dB. The brain begins to process the new, high-level signal and discards the last 2 to 5 ms of processed data. The listener never hears that sound.

Perceptual codecs take advantage of this masking effect by not storing sounds the ear will never hear. For example, a 6500-Hz tone at 60 dB masks nearby signals (within about 750 Hz) that are 35 dB down. The encoder can allow quantization noise to climb as high as 25 dB (60 dB to 35 dB) since everything below 35 dB will be masked. Recall that each bit contributes 6 dB, so an S/N ratio of about 25 dB can be achieved in just 4 bits.

In real-world signals, there are sounds at many different frequencies, each of which adds masking effects. The encoder is continually recalculating the noise floor in each band and using up just enough bits to maintain the necessary S/N ratio.

MPEG Audio: Levels 1, 2, and 3

The Moving Pictures Experts Group (MPEG) is a group of international experts who set standards for digital video and audio compression. (Except for the fact that they both meet under the auspices of the International Standards Organization (ISO), they have nothing to do with the still-image standards body, JPEG.)

MPEG I is the original standard for compressed audio and video. It is optimized to fit into a 1.5-Mbit/sec bit stream. Recall that this figure is approximately the data rate of uncompressed, CD-quality stereo audio. MPEG II extends the standard to bit streams between 3 and 10 Mbits/sec.

The MPEG standard is formally called ISO CD 11172. The third part of the MPEG I standard (11172-3) addresses audio compression. Under MPEG I, the 1.5 Mbit/sec CD-quality stereo audio can be compressed down to 256 kbits/sec, leaving about 1.25 Mbits/sec for the video. MPEG video compression is discussed in Chapter 35, "How to Add Video."

MPEG I audio compression uses a perceptual codec based on psychoacoustic models such as those discussed in the previous section. Specifically, MPEG audio compression can be done by any of three codecs, called layer 1, layer 2, and layer 3.

Each layer offers higher performance (and increased complexity) than the one before it. Each layer is also compatible with the one before it. A layer-2 decoder can decode a bit stream encoded in either layer 1 or layer 2. A layer-3 decoder can decode any MPEG I audio bit stream.

All MPEG audio begins by passing the signal (20 to 20000 Hz) through a bank of filters, separating it into 32 subbands. Layer 3 does additional processing to increase the frequency resolution. All three layers send header information such as the sampling rate.

In MPEG layer 2, when the sampling rate is 48,000 samples per second, each subband is 750 Hz wide. This division isn't ideal-the human ear is more sensitive to the lower frequencies than the higher frequencies-but it simplifies the computation. Layer 3 uses a modified discrete cosine transform to effectively divide the frequency spectrum into 576 subbands, giving it much better performance at low bit rates.

Layers 2 and 3 look at a 24-ms window (when the sampling rate is 48,000 samples per second) for pre- and postmasking effects. This figure also represents a compromise between computational reality and fidelity to the psychoacoustic model. Layer-3 encoders have additional code to detect certain kinds of transient signals and drop the window size to 4 ms for detailed analysis.

To choose an appropriate MPEG audio layer, the Webmaster must take into account the available bit rate, the desired quality, the codec delay, and the hardware and software resources available. MPEG decoders are relatively simple compared to the encoders, so the encoding hardware and software is most often the limiting factor.

Audio quality is best determined by listening tests. Indeed, for perceptual codecs, audio quality can only be evaluated by human listeners. In formal listening tests, the listener is presented with three signals: the original (called signal A), followed by pairs of the original and the encoded signal (called signal B and signal C) in random sequence.

The listener is asked to evaluate B and C on a scale of 1 to 5, as shown in Table 34.1.

Table 34.1  Human Listeners Evaluate Sound Sources on a Scale of 1 to 5

Transparent (indistinguishable from the original signal)
Perceptible difference but not annoying
Slightly annoying
Very annoying

Table 34.2 summarizes the quality, target bit rate, and codec delay for the three MPEG audio layers.

Table 34.2  For Moderate Bitrates: Layer 3 Scores Appreciably Better Than Layer 2

Target Bitrate
Codec Delay
Layer 1
192 Kbps
19 ms (< 50 ms)
Layer 2
128 Kbps
35 ms (100 ms)
Layer 3
64 Kbps
59 ms (150 ms)

The quality figures in this table are for bit rates of around 60 to 64 Kbps. At bit rates of 120 Kbps per second, layer 2 and layer 3 performed about the same: Listeners found it difficult to distinguish them from the original signal.

The codec delays shown in the table represent the theoretical minimums (from the standard) and the practical values (in parentheses) given in the MPEG frequently asked questions list at

For most applications, even the longest delays do not represent a problem. They do serve as an indicator of processing complexity, however. Real-time encoders are based on special hardware such as digital signal processing chips (DSPs). Stereo layer 3 real-time encoders that meet ISO reference quality need two DSP32C or two DSP56002 chips.

Most desktop computers have no DSPs; a few have just one. Webmasters typically choose to outsource MPEG audio encoding rather than invest in additional hardware.

MPEG I accommodates two audio channels that can be used to deliver stereo. Layers 2 and 3 accommodate intensity stereo. Layer 3 accommodates m/s stereo. In intensity stereo, the high-frequency portions of both signals (above 2 kHz) are combined. In m/s stereo, one channel carries the sum of the two signals (left + right) whereas the other carries the difference (left - right).

Transmission and Decoding

Note that even the most aggressive MPEG audio (layer 3) only brings the bit rate down to 64 Kbps. That's 8,000 bytes per second-about three times as fast as the practical throughput of a 28,800-bps modem connection and six times as fast as the throughput of a 14,400-bps connection.

Whereas ISDN and other fast connections may make it possible to play MPEG audio in real time, for the immediate future Webmasters have to make the files downloadable and let users play them from their local hard disk.

The good news is that real-time decoders are approaching the computational capacity of many user machines (though layer-2 and layer-3 decoders still need at least one DSP or a dedicated MPEG audio decoder chip to keep up with real-time). Many desktop MPEG players are actually converters. They convert the MPEG audio into a native format such as Audio IFF (AIFF) and then play the native format in real time.

Things to Do with Sound

Just as high-end graphics can enhance a site if used carefully and ruin a site if misused, sound also can add a certain sparkle to a site or destroy it. Some sites set up their home page to play a sound file when the user downloads it. This effect is novel the first time it is used. It is tiresome the 10th time the sound is heard. It is quite annoying the 100th time the same sound is played.

Sound has been an area of active research in recent years. This section describes various sound formats and protocols available to the Webmaster.

Real-Time Audio

For technical reasons described later in the chapter, high-quality sound such as you might hear from a CD needs much more bandwidth than, say, human voice. When modem speeds topped out at 2,400 bps, the only way to serve sound was to send the entire file to the user and let them play it from the desktop. While that method is still used for high-quality sounds, several companies have introduced real-time audio for the Web.

Real-Time Audio in an HTTP Environment

RealAudio, a set of software from Progressive Networks, offers voice-grade, real-time audio. This company's latest product, RealAudio 2.0, does a good job of delivering music as well as speech. To perform these feats, Progressive Networks has developed a great deal of behind-the-scenes technology.

Recall from Chapter 4, "Designing Faster Sites," that HTTP, the protocol of the Web, is meant to accommodate requests for files. When a client sends a GET, the server locates the requested entity, sends it back, and closes the connection.

This protocol is not well-suited for the way people listen to audio. They fast-forward, they rewind, they look for a 4-minute snippet out of a 30-minute file. HTTP is based on TCP, one of the two major ways packets can be sent over transmission control protocol/Internet protocol (TCP/IP) networks.

TCP emphasizes reliable delivery. As described in Chapter 4, TCP relies on a three-way handshake and packet numbers to make sure that the receiver gets every packet. If a packet is not acknowledged, the sender sends it again. If the connection quality is poor, the sender keeps trying to resend packets to make sure the receiver doesn't miss any data.

For real-time audio, this guaranteed delivery is neither necessary nor useful. A 2- to 3-percent retransmission rate can bring a 14.4 Kbps modem connection to a standstill. Figure 34.4 shows a typical client statistics screen with about a 2-percent error rate.

Figure 34.4 : Retransmission rates, as measured by the client, typically range from 2 to 3 percent.

One TCP/IP protocol that does not guarantee delivery is user datagram protocol (UDP). Using UDP, the sender sends out packets as fast as it can without waiting for acknowledgments. UDP is often used in TCP/IP applications for status reporting. If one packet gets dropped, it doesn't matter since a new status will be along momentarily.

With TCP, each retransmitted packet is delayed by a few milliseconds compared to where it should have appeared in the data stream. With audio, these delays begin to become noticeable when just 2 or 3 packets out of a 100 are retransmitted.

Another need specific to a real-time audio server is for a large number of connections. A typical Web server may have anywhere from 6 to 100 copies of the HTTP daemon running. A site serving a live audio event may have 1,000s or even 100,000s of simultaneous connections.

Progressive Networks decided not to try to force this kind of behavior onto Web servers. Instead, they built their own server (which is available commercially) and their own client. The client is downloadable from their Web site at http:/

The server can use either TCP or UDP, although the best results come from using UDP. The RealAudio server gives good performance under modest retransmission levels (2 to 5 percent) and degrades smoothly as retransmission levels approach 10 percent.

To deal with packet loss, the RealAudio client does not request retransmission of any lost packets. Instead it makes an approximation of the lost packet based on the packets around it. For modest loss rates, the effect is not noticeable by most listeners.

RealAudio and TrueSpeech

RealAudio makes it easy for users to listen to RealAudio files. The client software is available for all major platforms. The server is commercially available and is easy to configure.

RealAudio 1 provides quality similar to a good AM radio station. On a fast processor, the quality is good enough for speech. RealAudio 2's quality is comparable to a nonstereo FM station.

TrueSpeech is a family of speech compression and decompression algorithms developed by the DSP Group and adopted by Microsoft for use in Windows 95. There are two major algorithms in the family: TrueSpeech 8.5 and TrueSpeech 6.3/5.3/4.8.

The numbers in the TrueSpeech algorithm names refer to the bit rates supported. TrueSpeech 6.3/5.3/4.8 supports 6.3, 5.3, and 4.8 Kbps and can be switched on the fly. This algorithm is the basis for the ITU voice compression standard G.723.

TrueSpeech 8.5 supports only 15:1 compression (compared to 20:1 and 24:1 for TrueSpeech 6.3 and 5.3, respectively) but it needs only about half the computing power to encode and decode. DSP Group provides players online (at that can play TrueSpeech 8.5 for most major platforms. Encoders are available in Windows 95 and Windows NT.

Even though TrueSpeech 8.5 needs less computing power than TrueSpeech 6.3/5.3, it can still challenge older computers. If the player stutters, allocate a larger buffer to the application. The program fills the buffer before starting to play the sound so it will take a bit longer to start playing. You can also wait until the file has loaded and then play it from the cache.

How to Design a Site for Audio-on-Demand

Real-time audio includes both audio-on-demand and live audio. Live audio is oriented toward special events, and needs equipment and software that can handle a large number of simultaneous connections. This section focuses on ways to integrate audio-on-demand into a Web site.

First, make sure the site is pleasing and consistent without audio. No matter how easy it is to load and install the player software, some users will not play the audio. Others will print the page or save it to a disk file. The site must work without the audio. The recommendations in the other chapters of this book help make a site effective in this way.

Second, have a purpose for each audio clip. Let's face it, adding sound is fun. It is tempting to serve up a 30-minute speech. It is better to break that speech into topics the way a news broadcast is broken into segments (such as sports, business, and weather) and clips.

Next, make sure the audio quality is first-rate. Audio has enough problems getting from the source to the listener. Don't handicap yourself by trying to work with poor-quality sound.

Use the icons supplied by the server vendor to identify the audio clips and make it easy to download the player software.

22 kHz, 16 bit
8 kHz, 8 bit
RealAudio 2.0
RealAudio 1.0

To achieve this degree of compression (44:1), much of the information in the original sound must be thrown away, just as color and other information is thrown away from a graphic image to make a JPEG or GIF. (For details on still-image compression, see Chapter 33, "How to Add High-End Graphics.")

RealAudio attempts to extract the portion of the audio signal where the most important information is stored. "Understanding Sound," earlier in this chapter, describes the principles on which this work is based. For now, it is enough to say that the higher the input quality, the more information RealAudio has to work with and the better the finished product will be.

For the best results, invest in a professional-quality microphone. Cheap microphones allow hiss and distortion to enter the signal that can never be completely removed. Progressive Networks lists the equipment in their studio on their Web site. It makes for useful reading.

Interactive Sound: Telephony over the Net

Internet telephony is an emerging technology. While it does not play a significant role on Web sites yet, the time will come when making an online, interactive, voice connection to a tech support staff member or a salesperson will be as commonplace as making a phone call is today.

To be ready for that day, Webmasters should be sure that their site has enough bandwidth and computing power to handle multiple, simultaneous, voice encodes and decodes. If it doesn't have such ability today, make plans to get it as the demand for telephony increases.

Sound Files

Even more sound file formats are in use on the Net than graphics formats. For many years, each computer vendor has had their own format, so the Net has a proliferation of files in many different formats. This section compares and contrasts the uses of the more popular of these formats.

Some audio file formats are self-describing. They include a header that says how they are formatted. Many self-describing audio files allow for variations in the format: The details of the encoding of a particular file are in the file's header.

Other files are without headers. They rely on the user to know what kind of encoding they contain. The user then selects the proper application to play the sound.

Most of the formats are designed so that the sound is downloaded, then decoded and played. Some simple sounds, such as beeps, take very little time to download, so users can hear them while they are displaying the site.


The AIFF was developed by Apple Computer for storing high-quality sampled sound. It can be read on most UNIX machines using native players, on PCs using wham, and on Macintoshes using soundapp.

The full spec is available by FTP at A version of the format that supports compression (called AIFC or AIFF-C) is documented at The large number of format variants makes it hard to find applications that can play any AIFF file.

wham 1.33 for Windows is available from SoundApp1.5.1 for the Mac is stored at soundapp1.51.cpt.hqx.gz.


UNIX machines can play AU files with showaudio (available at or with a native player. wham or wplany play AU files for Windows. SoundApp for the Macintosh can also handle this format.

wham is available at; wplny is stored at SoundApp for the Mac can be downloaded from
computing/systems/mac/Collections/umich/sound/soundutil/ soundapp1.51.cpt.hqx.gz

WAV format

The RIFF WAVE format, commonly called WAV, was developed by Microsoft and IBM. It is comparable to but not compatible with AIFF. WAV became popular when it was adopted as the native sound format for Windows 3.1 The latest version of WAV supports TrueSpeech(r), which is integrated into Windows 95.

The WAV spec is archived at

Various native UNIX utilities are available for playing WAV files. PC users can use wplany or wham, mentioned previously. Macintosh users can use SoundApp, also mentioned previously.

The Ubiquitous SND

The file extension .snd is used to describe sound formats from a number of vendors. Apple uses it to describe a headerless, single-channel, 8-bit sound sampled at various rates.

Tandy uses .snd to denote a music file with a header and optional compression. Tandy's .snd sounds are typically sampled at 5,500, 11,000, or 22,000 samples per second.

Using Tandy's Sound.pdm software (part of the DeskMate environment), you can make instrument snd files (which provide information about attack, sustain, and decay and up to 16 notes)

Using the two different kinds of .snd file and the Tandy program Music.pdm, you can produce music modules (.sng files). Conversion programs such as Conv2snd and Snd2wav by Kenneth Udut are available to convert between RIFF WAVE format and Tandy .snd.

MPEG Audio

One of the richest audio formats is MPEG audio. MPEG audio typically needs special players. For UNIX machines, check out maplay at Source is available; so are binaries for Indigo, Next, Solaris, and SunOS.

Windows users can download mpgaudio from Mac users should look at mpeg-audio (from or MPEG/CD from Kauai Media.

Information on the product is available at MPEG/CD is a commercial program. A demo version of the software at

Note that the demo of MPEG/CD plays only 5 seconds of the sound track. Information on obtaining the full version is available on the site.


MIDI (Musical Instrument Digital Interface) is not an audio format per se. It is a music format. As Eric Lipscomb, vice president of the International Electronic Musicians User's Group, explains on his excellent Web site,, "MIDI is a communications protocol that allows electronic musical instruments to interact with each other."

Computers can be used to drive musical instruments using MIDI. Since the MIDI data rate (31.5 Kbps) is different from typical modem rates, the computer needs a special adapter to be able to "speak MIDI" to the instruments. Unless your audience consists of musicians who are likely to have these adapters and instruments, you may prefer to serve MIDI through a renderer like MIDI Renderer from DiAcoustics (described at This renderer contains the software equivalent of over 128 instruments, and can play 65,000 notes simultaneously. If you provide the MIDI sequence as input to the renderer, the output is a WAV file that can be served on your site and played on most desktop computers.

There are reports that some sound cards (such as the Roland Soundcanvas card) are being introduced that handle MIDI directly. As of Netscape Navigator 2.0, MIDI is definitely outside the Web mainstream. Navigator 3.0, however, will include a plug-in called LiveAudio, which will handle WAV, AIFF, AU, and MIDI files. The syntax for embedding sound in that system is expected to be:

<EMBED SRC=url autostart=[true|false] loop=[true|false] ...>

When the time comes that MIDI is a viable choice for producing sound on your users' machines, MIDI is likely to be the format of choice. MIDI files are far smaller than the equivalent WAV files (because they are decoded by the hardware on the client machine).

"Plug-in" technology was introduced with Netscape Navigator 2.0. A programmer builds a special program that runs on the client and handles specific MIME types. LiveAudio is a plug-in that plays downloaded audio. Plug-ins represent a next-generation approach to helper applications.

How to Serve Sound Files

As mentioned throughout this chapter, it is hard to present sound with the Web page. Unless the sound is short, the user must download it and then start a player. The principal exceptions are the real-time audio formats such as RealAudio and TrueSpeech.

Sound Files

Real-time audio is not usually downloaded like other kinds of files. Instead, the link on the Web page points to a placeholder file, which in turn tells the desktop computer to launch the player application. Unlike helper applications for formats such as graphics, player applications for RealAudio and TrueSpeech actually talk directly to the server to bring down the sound file.

MIME Types

Table 34.4 shows list of MIME types for various sound formats.

Table 34.4  This Is Information Used by Visitors to Configure Their Web Browsers

Sound FormatMIME Type MIME SubtypeExtensions
AIFFaudiox-aiff .aiff, .aif, .aifc
AU (m-law)audio
MPEG Audioaudiox-mpeg .mp2
RealAudioapplication dsptype.ram
TrueSpeechapplication dsptype.tsp
WAVaudiox-wav .wav

Helper Applications

To help users keep their helper applications current, provide a link to the test page at If users attempt to download a sound file and their browser doesn't recognize it, they are only a click away from current information about what software to get for their browser and how to configure it.

RealAudio and TrueSpeech

TrueSpeech works by delivering the entire file to the client machine, though play can start as soon as the buffer is full and continue as long as the connection stays ahead of the ever-filling buffer. Figure 34.5 shows how the play point compares to the buffer.

Figure 34.5 : TrueSpeech shows the listener how much of the file has been downloaded and how much has already been played.

RealAudio is best served from a RealAudio server. Progressive Networks makes several versions of the software available.

Setting Up a RealAudio Server

Progressive Networks offers the server at several connection levels. For a busy site, you may want to license 100 or more simultaneous connections. A low-traffic site can be well-served by about 10 connections.

The RealAudio server is well-supported by Progressive Networks, both from their Web site and by their technical support staff. For the best results, set up the server to use a UDP rather than a TCP port. Once the server is up, go to a client machine that accesses the server through the Net.

Testing the Connection

A dial-up connection makes a good test since that is how most users still access the Net. Connect to your RealAudio server and play a sound clip. On most machines (80486-class or higher), the sound quality should be comparable to a strong AM radio station.

If the sound skips or stutters, switch to another server, such as the one at If the quality is poor on all servers and the desktop machine is fast enough, the problem is in the connection. Either the packets are being delivered slowly or they are being lost.

Check the Statistics window in the RealAudio client. It should show packet loss of 10 percent or less. If the packet loss is higher than that, the network is too busy. Try again later. If the network is consistently losing more than 10 percent of the packets, consider accessing it through a different service provider.

If the packet loss is minimal but the audio quality is still poor, increase the speed of the serial port. It is possible to communicate between two computers at a speed faster than the speed of the modems by taking advantage of advanced protocols built into most modern modems.

Setting Up the Modems

Most 14,400 bps or faster modems include CCITT V.42, the link access procedure for modems (LAP-M) or Microcom Networking Protocol (MNP) error control, which guarantees an error-free connection. Data is sent from the service provider's modem to the modem at the desktop computer in packets. (These are not the same packets that TCP/IP uses.)

When a packet is sent, the modem performs a complex mathematical calculation and attaches the result to the packet. When the packet is received, the modem on the receiving end repeats the calculation and compares its result with the attached error-control value. If the two numbers don't match, the modem requests that the packet be sent again.

Most modern modems also include CCITT V.42bis or MNP class 5, which are data-compression algorithms. Both V.42bis and MNP class 5 need error control. That is, you can have error control without compression but you can't have compression without error control. V.42bis needs a LAP-M connection. MNP class 5 can only be made on an MNP class-2, -3, or -4 connection.

When manufacturers quote a speed for a modem, they are quoting the number of signaling transitions per second. V.42bis or MNP class 5 compress the data before sending it, so the effective throughput (from computer to computer) is higher than the actual data rate on the telephone lines (modem to modem).

The effective throughput is a function of the number of retries the error-control protocol layer has to make. On a noisy line, the throughput may fall well below the modem's rated speed. Under ideal conditions, the connection may run several times higher than the rated modem speed.

MNP class 5 has a theoretical compression ratio of 2:1. V.42bis has a theoretical maximum of 4:1. Of course, if the data being transferred has already been compressed (for example, by GIF, JPEG, MPEG, or RealAudio), there is little opportunity for the modem to perform further compression. In fact, MNP class 5 is not recommended for use with compressed data. (V.42bis is smart enough to sense compressed data and does not attempt to compress it even further.)

For maximum throughput, set the speed of the computer's serial port to four times the speed of the modem if the modem supports V.42bis or twice the speed of the modem if the modem only supports MNP class 5.

Proper Use of Flow Control

When the speed of the serial port is faster than the speed of the modem, it is possible for the serial port to send faster than the modem can transmit. The modem may be recovering from a bout with line noise, for example, when a large file is sent from the computer. Most modems contain buffers to deal with this speed difference, but under some circumstances the buffer can become full.

The modem tells the computer to stop sending using a mechanism called local flow control. Modems and computers support two different kinds of flow control: hardware-based, also called request-to-send/clear-to-send (RTS/CTS) flow control, and software flow control, also called XON/XOFF. For high-speed modems, always use hardware flow control.

Hardware flow control needs extra connections between the modem and the computer. On a standard RS-232C 25-pin cable, these connections are made on pins 4 and 5. Some cables only hook up pins 2, 3, and 7. Other cables cross-connect the pins (pin 4 on one end is connected to pin 5 on the other end and vice versa). If you have selected hardware flow control but it doesn't seem to be working, "buzz" the cable to be sure pins 4 and 5 are connected straight through.

Macintosh computers have traditionally used software flow control, so many serial cables for Macintoshes do not hook up the RTS/CTS lines. If hardware flow control does not seem to be working with your Mac, check the cable's documentation, buzz the cable, or replace it with one known to be wired for RTS/CTS. See Figure 34.6 for the proper Macintosh cable connection.

Figure 34.6 : Macintosh hardware handshake cable.

Figure 34.6 shows the necessary connections to allow a Macintosh to exercise hardware flow control over the modem. Pin 1 on the Macintosh DIN 8 connector is called HSKo and is connected to pin 4 (RTS) and pin 20, data terminal ready (DRT), at the modem. Pin 2 on the Macintosh DIN 8 connector is called HSKi. It is connected to pin 5 (CTS) at the modem.

To check the cable, make sure the serial port speed is set to several times the modem speed. Then attach a break-out box to the modem end of the cable and send a large file from the Mac through the modem. Watch the LEDs on the break-out box for pins 4 and 5. If the LED next to pin 4 does not come on, the cable is probably not right.

When using hardware flow control with a Macintosh cable, you must tell the modem to ignore DTR. Otherwise, the modem will hang up when the Mac drops RTS. Some modems have this function available through a command; others need the user to change DIP switches. Check the modem documentation to find out how to disable DTR hangup.

Do not confuse local flow control, which regulates communications between the computer and the modem, with end-to-end flow control, which regulates the flow of data between the two modems. Modern modems handle end-to-end flow control as part of their built-in protocols. The installer should not try to adjust end-to-end flow control.

Chapter 37, "Evaluating the Server Envirionment," shows that Web sites are often limited by the size of their communications links, and seldom by the speed of their CPU. Those rules work best for servers serving Web pages. There are different rules of thumb for pages serving sound. If the sound skips or is choppy, there may not be enough CPU cycles to go around. This condition is more likely to occur on the client, where the computer may be a PC, possibly with slow serial ports. Recall that a 14.4 Kbps modem can get throughput as high as 57,600 bps, and a 28.8 Kbps modem can hit 115,200.

If the RealAudio statistics screen reports an unexpectedly high number of errors, disable any terminate-and-stay-resident (TSR) programs such as screen savers. They may be stealing CPU cycles away from the communications software.

Some older PCs have a communications chip (called a UART) that is too slow to support the faster communications rates. If your PC has an 8250 UART, do not set the serial port to a speed higher than 19,200 bps. The 16450 UART can support 38,400 bps and the 16550 UART can support 57,600 Kbps or, under some circumstances, 115,200 Kbps. Check the serial card's documentation before using the higher rate.

Getting the Best from the Sound File

After setting up the server, the Webmaster will want to encode the audio for the site. Remember to use the best-quality audio available. The original analog signal is the best starting point. High-bandwidth sources, such as CDs and DAT, also give good results. For recording, use professional-grade equipment. The quality of the microphone is particularly important.

During recording and later during digitization, be sure the input levels are set so that the signal comes up to but does not exceed the maximum level of the recording equipment. Most audio equipment shows a red light or shows a needle going into a red area of the display when the signal levels are too high. Setting the input level correctly makes sure the signal fills the full amplitude range of the recording and digitizing equipment.

Before encoding the sound into RealAudio format, preprocess it to make the quality even higher. RealAudio recommends four different kinds of preprocessing: noise gating, compression, equalization, and normalization.

Recall from the discussion about sound in the previous section that 8-bit quantization leads to quantization noise-a perceptible hiss when the speaker pauses. One fix to this problem is called noise gating, also called downward expansion, and is illustrated in Figure 34.7.

Figure 34.7 : Noise gating cuts out sound below a given threshold.

If your hardware or software offers noise gating, set it to around 5 to 10 decibels (dB). If the equipment doesn't support numeric settings, set the threshold control so that gating occurs when there is no audio. Then back off until the beginnings of words are not clipped.

"A Psychoacoustic View of Sound," earlier in this chapter, describes concurrent masking. RealAudio's encoding process can introduce a low-level rumbling noise into the signal. Make sure this signal isn't heard by feeding the encoder as loud a signal as possible. Use concurrent masking to make the distortion inaudible.

During the recording process, the levels are set so that the highest peaks do not exceed the maximum level of the equipment. For many recording sessions, such peaks are rare, and the average sound level is far below the top of the amplitude range. Use audio compression (not related to file compression) to turn down the peaks so that the overall level of the signal can be increased. Figure 34.8 illustrates audio compression.

Figure 34.8 : Audio compression "turns down" the peaks so that more of the amplitude range is available for signal.

For RealAudio 1.0, Progressive Networks recommends using 4:1 to 10:1 compression. RealAudio 2.0 has a much greater dynamic range so artifacts are much less noticeable. Compression of 2:1 to 4:1 is more than enough for speech; higher levels of compression may be desirable for some pieces of music.

Recall from Chapter 33, "How to Add High-End Graphics," that the human eye is more sensitive to some colors than others and that high-end graphics systems compensate for this fact by boosting certain colors. Sound is no different. The ear is particularly sensitive to sounds between 2000 and 4000 Hz. Equalization (EQ) boosts the midrange frequencies that carry the desired information and cuts higher frequencies.

If your equipment allows you to choose how you equalize, boost the signal around 2.5 kHz. If the equipment does not allow equalization in that way, sometimes it's possible to get a similar effect by cutting the bass and treble, and increasing the overall frequency. Equalization is illustrated in Figure 34.9.

Figure 34.9 : Equalization boosts the signal where the ear is most sensitive.

Keep boosting the mids until the voice sounds too harsh. Then encode a portion of it and listen to it through a RealAudio player. What sounds too harsh before encoding sometimes sounds about right after encoding.

After you boost the mids, a woman's voice sometimes sounds as if it has a second, lower voice shadowing the first. Try cutting the bass frequencies to eliminate this shadow. Back off the bass slowly or the voice will sound thin or brittle.

RealAudio 2 has a much more dynamic range than RealAudio 1. Boost the mids (around 2.5 kHz) a bit but don't overdo it or the voice will sound thin.

The final step in preprocessing is normalization, illustrated in Figure 34.10. During normalization, the computer brings the volume up to the highest level possible without introducing distortion. It is important that normalization be done after the other preprocessing steps since each of the other steps changes the signal level.

Figure 34.10 : Normalization should be the last processing the audio gets before it is sent to the encoder.

RealAudio recommends normalizing to 95 percent of maximum capacity-the RealAudio encoder is designed to work with signals that are at least 5 percent down from the maximum. If your system does not allow you to specify normalization in percentages, just allow it to normalize, then turn down the volume a bit before sending the signal to the encoder.

Getting the Sound to the Client

The RealAudio encoder outputs RealAudio files (with a file extension of .ra). Once the files are set up on the server, the Webmaster needs to connect those RealAudio files to Web documents.

Recall that it is the RealAudio client and not the Web browser that is responsible for talking to the RealAudio server. Progressive Networks recommends the use of a metafile with a .ram extension to bridge the gap between the two clients and the two servers. Figure 34.11 shows how this works.

Figure 34.11 : RealAudio served with metafiles.

The metafiles contain a special URL with a service identifier of pnm:. To play the RealAudio welcome message from their server, a .ram file would contain:


To connect the metafile to to HTML, the Web author can say,

<A HREF="/path/to/metafile.ram"><IMG SRC="graphics/rafile.gif>Welcome</A>

This bit of HTML puts up the RealAudio file icon.

Playing RealAudio Without a Server

Progressive Network's commercial servers offer 10 or more simultaneous connections. They also offer a "personal server" with a 2-connection capacity, available for download from their site. For some applications, however, no server is needed.

Sometimes Webmasters want to reference a sound file on another machine and can point a hyperlink to it just as they would a graphic or a Web page. Other times the Webmaster wants just to download the entire sound file to the user's machine, without allowing the user to pick and choose which parts of the sound they want to play. This style is most appropriate when the sound file is a clip of perhaps 4 minutes or less.

To link directly to the sound file, put something like this in the HTML:

<A HREF="/path/to/audio.ra">Sound file</A>

When users follow this link, the entire sound file downloads to their machine. Be sure to set up the server so that file extension .ra is served as audio/x-pn-realaudio. On NCSA servers, the lines in mime.types are


The same changes can be made in srm.conf by using the AddType directive.

Indexing a Database to Sound

One advanced technique possible with RealAudio is to index a word-for-word database to the sound. This technique is illustrated at, where the President's weekly radio addresses are indexed to the text of the speech.

Follow the Library link to the audio files and search for a keyword or phrase like "Bosnia." The system finds several speeches in which that word appears. Now open one of the speeches and follow one of the links. (The URL will look something like An example of the page is shown in Figure 34.12.)

Figure 34.12 : Each of President Clienton's weekly radio addresses encoded in RealAudio and available online.

Note that the URL contains a time tag. There is an index on the site that ties each word to the time in the .ra file associated with the beginning of that sentence so that RealAudio can start playing at the beginning of the sentence in which the word appears.

Digital Telephony and the Web

Internet-based telephony from such companies as NetSpeak (WebPhone), Quarterdeck (WebTalk, at, VocalTec (Internet Phone, at, and Third Planet (DigiPhone, at are gradually beginning to offer enough quality to be credible.

In one survey, Internet Phone users reported that 20 percent of their calls were as good as a regular phone call and 62 percent had acceptable quality. The makers of DigiPhone allow the user to trade bandwidth and speed for quality.

At the best quality levels, Internet-based telephony often exceeds the quality of conventional telephony. If the user opts for lower quality to get more speed, the quality degrades to about the level of a call on a cellular phone.

VocalTec has also announced InternetWave, a competing technology to RealAudio and TrueSpeech.

Setting Up a Computer-based Phone Link on the Web

The various Internet telephony vendors are in fierce competition. So far there are no signs of interoperability. Eventually, that will change. When the quality of a call through the Internet rivals conventional technology and there is a consensus on standards (or one vendor emerges as the clear winner), it may make sense to offer person-to-person technical support over the Net.

If these connections have enough bandwidth (such as an ISDN line), a technical support specialist could actually walk users through a solution, showing them on the screen what to do or watching their progress through one of the screen-mirroring utilities such as Timbuktu.

The Future of Digital Telephony

At present, Internet telephony is a novelty. Its next niche is groups of related individuals (such as families) who are willing to sacrifice quality for low cost. The big boom in this technology will come when the business market integrates Internet telephony into their conferencing, technical support, and other remote business processes.

At present, the cutting edge of this technology is in full-duplex-the ability to hear and speak at the same time as it is with conventional telephones. All the leading companies are beginning to offer full-duplex versions of their produce.

All Internet-based, full-duplex telephony products have to deal with feedback. If the sound is played from speakers and picked up by microphones, opportunities for feedback abound. Serious users of this technology should invest in a good-quality headset to isolate the two sides of the telephone circuit.

Some service providers do not permit digital telephony over their lines since it can consume large amounts of bandwidth. Check with your service provider before investing in the technology. Also use the checks described earlier for setting up a RealAudio server to make sure the service provider has enough bandwidth to adequately support digital telephony.

Sound can enhance a site in many ways. This chapter shows how sound is turned into computer files, and provides many tips and techniques for compressing sound files so they can be downloaded quickly.

This chapter also describes specific uses of sound such as real-time audio and digital telephony, as well as information on how to serve various sound files such as WAV, AIFF, and SND. The next chapter continues the discussion on multimedia, describing methods for serving video.