of any"dry"(e.g, mono, low reverberation) source with the stored He. Ga, (p, S)s and corresponding H.( p, 8)s On the right side in the figure, the resulting binaural signals are reproduced via equalized headphones The equalization ensures that a sound source with a flat spectrum(e.g, white noise)does not suffer any perceivable coloration for any direction(p, 8) Implemented in a real-time "binaural mixing console, the above scheme can be used to create"virtual sound sources. When combined with an appropriate scheme for interpolating head-related transfer functions, moving sound sources can be mimicked realistically. Furthermore, it is possible to superimpose early reflection of a hypothetical recording room, each filterered by the appropriate head-related transfer function. Such inclusion of a room in the simulation makes the spatial reproduction more robust against individual differences between"recording"and"listening"ears, in particular, if the listener's head movements are fed back to the binaural mixing console.(Head movements are useful for disambiguating spatial cues. )Finally, such a can be used to create virtual acoustic displays, " for example, for pilots and astronauts Wenzel, 1992] research issues are, for example, the required accuracy of the head-related transfer functions, inters variability, and psychoacoustic aspects of room simulations. Audio coding Audio coding is concerned with compressing(reducing the bit rate)of audio signals. The uncompressed digital audio of compact disks( CDs)is recorded at a rate of 705.6 kbit/s for each of the two channels of a stereo signal (ie, 16 bit/sample, 44. 1-kHz sampling rate; 1411. 2 kbit/s total). This is too high a bit rate for digital audio roadcasting(DAB)or for transmission via end-to-end digital telephone connections(integrated services digital network, ISDN). Current audio coding algorithms provide at least"better than FM"quality at a combined rate of 128 kbit/s for the two stereo channels(2 ISDN B channels! ), transparent coding "at rates of 96 to 128 kbit/s per mono channel, and"studio quality at rates between 128 and 196 kbit/s per mono channel. (While a large lumber of people will be able to detect distortions in the first class of coders, even so-called"golden ears should not be able to detect any differences between original and coded versions of known"critical"test signals the highest quality category adds a safety margin for editing, filtering, and/or recoding. To compress audio signals by a factor as large as eleven while maintaining a quality exceeding that of a local FM radio station requires sophisticated algorithms for reducing the irrelevance and redundancy in a given signal. A large portion(but usually less than 50%)of the bit-rate reduction in an audio coder is due to the first of the two mechanisms. Eliminating irrelevant portions of an input signal is done with the help of psycho- coustic models. It is obvious that a coder can eliminate portions of the input signal that-when played back-will be below the threshold of hearing. More complicated is the case when we have multiple signal components that tend to cover each other, that is, when weaker components cannot be heard due to the presence of stronger components. This effect is called masking. To let a coder take advantage of masking effects, we need to use good masking models. Masking can be modeled in the time domain where we distinguish so-called simultaneous masking(masker and maskee occur at the same time), forward masking(masker occurs before masked), and backward masking(masker occurs after maskee) Simultaneous masking usually is modeled in the frequency domain. This latter case is illustrated in Fig. 19.5 Audio coders that employ common frequency-domain models of masking start out by splitting and sub- sampling the input signal into different frequency bands(using filterbanks such as subband filterbanks or time- frequency transforms). Then, the masking threshold (i.e, Predicted masked threshold) is determined, followed by quantization of the spectral information and (optional) noiseless compression using variable-length coding The encoding process is completed by multiplexing the spectral information with side information, adding error protection, The first stage, the filter bank, has the following requirements. First, decomposing and then simply recon structing the signal should not lead to distortions("perfect reconstruction filterbank"). This results in the advantage that all distortions are due to the quantization of the spectral data. Since each quantizer works on band-limited data, the distortion(also band-limited due to refiltering) is controllable by using the masking models described above. Second, the bandwidths of the filters should be narrow to provide sufficient coding gain. On the other hand, the length of the impulse responses of the filters should be short enough(time resolution of the coder! )to avoid so-called pre-echoes, that is, backward spreading of distortion components e 2000 by CRC Press LLC
© 2000 by CRC Press LLC of any “dry” (e.g., mono, low reverberation) source with the stored Hff,1(jw, j, d)s and corresponding Hff,r(jw, j, d)s. On the right side in the figure, the resulting binaural signals are reproduced via equalized headphones. The equalization ensures that a sound source with a flat spectrum (e.g., white noise) does not suffer any perceivable coloration for any direction (j, d). Implemented in a real-time “binaural mixing console,” the above scheme can be used to create “virtual” sound sources. When combined with an appropriate scheme for interpolating head-related transfer functions, moving sound sources can be mimicked realistically. Furthermore, it is possible to superimpose early reflections of a hypothetical recording room, each filterered by the appropriate head-related transfer function. Such inclusion of a room in the simulation makes the spatial reproduction more robust against individual differences between “recording” and “listening” ears, in particular, if the listener’s head movements are fed back to the binaural mixing console. (Head movements are useful for disambiguating spatial cues.) Finally, such a system can be used to create “virtual acoustic displays,” for example, for pilots and astronauts [Wenzel, 1992]. Other research issues are, for example, the required accuracy of the head-related transfer functions, intersubject variability, and psychoacoustic aspects of room simulations. Audio Coding Audio coding is concerned with compressing (reducing the bit rate) of audio signals. The uncompressed digital audio of compact disks (CDs) is recorded at a rate of 705.6 kbit/s for each of the two channels of a stereo signal (i.e., 16 bit/sample, 44.1-kHz sampling rate; 1411.2 kbit/s total). This is too high a bit rate for digital audio broadcasting (DAB) or for transmission via end-to-end digital telephone connections (integrated services digital network, ISDN). Current audio coding algorithms provide at least “better than FM” quality at a combined rate of 128 kbit/s for the two stereo channels (2 ISDN B channels!), “transparent coding” at rates of 96 to 128 kbit/s per mono channel, and “studio quality” at rates between 128 and 196 kbit/s per mono channel. (While a large number of people will be able to detect distortions in the first class of coders, even so-called “golden ears” should not be able to detect any differences between original and coded versions of known “critical” test signals; the highest quality category adds a safety margin for editing, filtering, and/or recoding.) To compress audio signals by a factor as large as eleven while maintaining a quality exceeding that of a local FM radio station requires sophisticated algorithms for reducing the irrelevance and redundancy in a given signal. A large portion (but usually less than 50%) of the bit-rate reduction in an audio coder is due to the first of the two mechanisms. Eliminating irrelevant portions of an input signal is done with the help of psychoacoustic models. It is obvious that a coder can eliminate portions of the input signal that—when played back—will be below the threshold of hearing. More complicated is the case when we have multiple signal components that tend to cover each other, that is, when weaker components cannot be heard due to the presence of stronger components. This effect is called masking. To let a coder take advantage of masking effects, we need to use good masking models. Masking can be modeled in the time domain where we distinguish so-called simultaneous masking (masker and maskee occur at the same time), forward masking (masker occurs before maskee), and backward masking (masker occurs after maskee). Simultaneous masking usually is modeled in the frequency domain. This latter case is illustrated in Fig. 19.5. Audio coders that employ common frequency-domain models of masking start out by splitting and subsampling the input signal into different frequency bands (using filterbanks such as subband filterbanks or timefrequency transforms). Then, the masking threshold (i.e., predicted masked threshold) is determined, followed by quantization of the spectral information and (optional) noiseless compression using variable-length coding. The encoding process is completed by multiplexing the spectral information with side information, adding error protection, etc. The first stage, the filter bank, has the following requirements. First, decomposing and then simply reconstructing the signal should not lead to distortions (“perfect reconstruction filterbank”). This results in the advantage that all distortions are due to the quantization of the spectral data. Since each quantizer works on band-limited data, the distortion (also band-limited due to refiltering) is controllable by using the masking models described above. Second, the bandwidths of the filters should be narrow to provide sufficient coding gain. On the other hand, the length of the impulse responses of the filters should be short enough (time resolution of the coder!) to avoid so-called pre-echoes, that is, backward spreading of distortion components
Masked threshold &r o/Unmasked Threshold 录 (for Pure Tones) g0020.0501020.5 requency[kHz]→ FIGURE 19.5 Masked threshold in the frequency domain for a hypothetical input signal. In the vicinity of high-level spectral components, signal components below the current masked threshold cannot be heard. that result from sudden onsets(e.g, castanets). These two contradictory requirements, obviously, have to be worked out by a compromise. Critical band"filters have the shortest impulse responses needed for coding of transient signals. On the other hand, the optimum frequency resolution(ie, the one resulting in the highest coding gain) for a typical signal can be achieved by using, for example, a 2048-point modified discrete cosine transform(MDCt). In the second stage, the(time-varying) masking threshold as determined by the psychoacoustic model usually ontrols an iterative analysis-by-synthesis quantization and coding loop. It ate rules for masking of tones by noise and of noise by tones, though little is known in the psychoacoustic literature for more general signals. Quantizer step sizes can be set and bits can be allocated according to the known spectral estimate, by block companding with transmission of the scale factors as side information or iteratively in a variable-length coding loop(Huffman coding). In the latter case, one can low-pass filter the signal if the total required bit rate is too high The decoder has to invert the processing steps of the encoder, that is, do the error correction, perform Huffman decoding, and reconstruct the filter signals or the inverse-transformed time-domain signal. decoder is significantly less complex than the encoder, it is usually implemented on a single DSP chip, while the encoder uses several DSP chips Current research topics encompass tonality measures and time-frequency representations of signals. More information can be found in Johnston and Brandenburg [1991] Echo cancellation Echo cancellers were first deployed in the U.S. telephone network in 1979. Today, they are virtually ubiquitous in long-distance telephone circuits where they cancel so-called line echoes(i.e, electrical echoes)resulting from nonperfect hybrids(the devices that couple local two-wire to long-distance four-wire circuits). In satellite circuits, echoes bouncing back from the far end of a telephone connection with a round-trip delay of about 600 ms are very annoying and disruptive. Acoustic echo cancellation-where the echo path is characterized by the transfer function H(z) between a loudspeaker and a microphone in a room(e. g, in a speakerphone)-is crucial for teleconferencing where two or more parties are connected via full-duplex links. Here, echo cancel lation can also alleviate acoustic feedback(howling,) The principle of acoustic echo cancellation is depicted in Fig 19.6(a). The echo path H()is cancelled by modeling H(z) by an adaptive filter and subtracting the filter's output y(r) from the microphone signal yo The adaptability of the filter is necessary since H(z) changes appreciably with movement of people or objects in the room and because periodic measurements of the room would be impractical. Acoustic echo cancellation is more challenging than cancelling line echoes for several reasons. First, room impulse responses h(n)are longer than 200 ms compared to less than 20 ms for line echo cancellers. Second, the echo path of a room h(r) is likely to change constantly (note that even small changes in temperature can cause significant changes of h). Third, e 2000 by CRC Press LLC
© 2000 by CRC Press LLC that result from sudden onsets (e.g., castanets). These two contradictory requirements, obviously, have to be worked out by a compromise. “Critical band” filters have the shortest impulse responses needed for coding of transient signals. On the other hand, the optimum frequency resolution (i.e., the one resulting in the highest coding gain) for a typical signal can be achieved by using, for example, a 2048-point modified discrete cosine transform (MDCT). In the second stage, the (time-varying) masking threshold as determined by the psychoacoustic model usually controls an iterative analysis-by-synthesis quantization and coding loop. It can incorporate rules for masking of tones by noise and of noise by tones, though little is known in the psychoacoustic literature for more general signals. Quantizer step sizes can be set and bits can be allocated according to the known spectral estimate, by block companding with transmission of the scale factors as side information or iteratively in a variable-length coding loop (Huffman coding). In the latter case, one can low-pass filter the signal if the total required bit rate is too high. The decoder has to invert the processing steps of the encoder, that is, do the error correction, perform Huffman decoding, and reconstruct the filter signals or the inverse-transformed time-domain signal. Since the decoder is significantly less complex than the encoder, it is usually implemented on a single DSP chip, while the encoder uses several DSP chips. Current research topics encompass tonality measures and time-frequency representations of signals. More information can be found in Johnston and Brandenburg [1991]. Echo Cancellation Echo cancellers were first deployed in the U.S. telephone network in 1979. Today, they are virtually ubiquitous in long-distance telephone circuits where they cancel so-called line echoes (i.e.,electrical echoes) resulting from nonperfect hybrids (the devices that couple local two-wire to long-distance four-wire circuits). In satellite circuits, echoes bouncing back from the far end of a telephone connection with a round-trip delay of about 600 ms are very annoying and disruptive. Acoustic echo cancellation—where the echo path is characterized by the transfer function H(z) between a loudspeaker and a microphone in a room (e.g., in a speakerphone)—is crucial for teleconferencing where two or more parties are connected via full-duplex links. Here, echo cancellation can also alleviate acoustic feedback (“howling”). The principle of acoustic echo cancellation is depicted in Fig. 19.6(a). The echo path H(z) is cancelled by modeling H(z) by an adaptive filter and subtracting the filter’s output y$(t) from the microphone signal y(t). The adaptability of the filter is necessary since H(z) changes appreciably with movement of people or objects in the room and because periodic measurements of the room would be impractical. Acoustic echo cancellation is more challenging than cancelling line echoes for several reasons. First,room impulse responses h(t) are longer than 200 ms compared to less than 20 ms for line echo cancellers. Second, the echo path of a room h(t) is likely to change constantly (note that even small changes in temperature can cause significant changes of h). Third, FIGURE 19.5 Masked threshold in the frequency domain for a hypothetical input signal. In the vicinity of high-level spectral components, signal components below the current masked threshold cannot be heard
ADAPTATION s FIGURE 19.6(a) Principle of using an echo canceller in teleconferencing. (b) Realization of the echo canceller in subbands. After M. M. Sondhi and W. Kellermann, "Adaptive echo cancellation for speech signals, "in Advances in Speech Signal Processing, S Furui and M. M. Sondhi, Eds, New York: Marcel Dekker, 1991. By courtesy of Marcel Dekker, Inc. teleconferencing eventually will demand larger audio bandwidths(e. g, 7 kHz) compared to standard telephone connections(about 3. 2 kHz). Finally, we note that echo cancellation in a stereo setup(two microphones and two loudspeakers at each end) is an even harder problem on which very little work has been done so far It is obvious that the initially unknown echo path H() has to be "learned"by the canceller. It is also clear that for adaptation to work there needs to be a nonzero input signal x(n)that excites all the eigenmodes of the system(resonances, or"peaks"of the system magnitude response H(jo)). Another important problem is how to handle double-talk(speakers at both ends are talking simultaneously ) In such a case, the canceller could sily get confused by the speech from the near end that acts as an uncorrelated noise in the adaptation. Finally, the convergence rate, that is, how fast the canceller adapts to a change in the echo path, is an important measure to compare different algorithms Adaptive filter theory suggests several algorithms for use in echo cancellation. The most popular one is the so-called least-mean square(LMs)algorithm that models the echo path by an FIr filter with an impulse respons h(t). Using vector notation h for the true echo path impulse response, h for its estimate, and x for the excitation time signal, an estimate of the echo is obtained by y(o)=h,x, where the prime denotes vector transpose. A reasonable objective for a canceller is to minimize the instantaneous squared error e 2 (t), where e(n)=yo-no The time derivative of h can be set to dh -uVh e (t)=-2ue(t)vA e(t)=2ue(t)x (194 resulting in the simple update equation hk+=h+ae x, where a(or H)control the rate of change. In practice, whenever the far-end signal x(r) is low in power, it is a good idea to freeze the canceller by setting a =0 Sophisticated logic is needed to detect double talk. When it occurs, then also set a =0. It can be shown that the spread of the eigenvalues of the autocorrelation matrix of xd t)determines the convergence rate, where the lowest-converging eigenmode corresponds to the smallest eigenvalue. Since the eigenvalues themselves scale with the power of the predominant spectral components in x(o), setting a= 2w/(x'x)will make the convergence will converge at the same rate only if x(r) is white noise. Therefore, pre-whitening the far-end signal will help The LMS method is an iterative approach to echo cancellation. An example of a noniterative, block-oriented approach is the least-squares(LS)algorithm Solving a system of equations to get h, however, is computational more costly. This cost can be reduced considerably by running the LS method on a sample-by-sample basis and by taking advantage of the fact that the new signal vectors are the old vectors with the oldest sample dropped and one new sample added. This is the recursive least-squares(rls)algorithm. It also has the advantage e 2000 by CRC Press LLC
© 2000 by CRC Press LLC teleconferencing eventually will demand larger audio bandwidths (e.g., 7 kHz) compared to standard telephone connections (about 3.2 kHz). Finally, we note that echo cancellation in a stereo setup (two microphones and two loudspeakers at each end) is an even harder problem on which very little work has been done so far. It is obvious that the initially unknown echo path H(z) has to be “learned” by the canceller. It is also clear that for adaptation to work there needs to be a nonzero input signal x(t) that excites all the eigenmodes of the system (resonances, or “peaks” of the system magnitude response *H(jv)*). Another important problem is how to handle double-talk (speakers at both ends are talking simultaneously). In such a case, the canceller could easily get confused by the speech from the near end that acts as an uncorrelated noise in the adaptation. Finally, the convergence rate, that is, how fast the canceller adapts to a change in the echo path, is an important measure to compare different algorithms. Adaptive filter theory suggests several algorithms for use in echo cancellation. The most popular one is the so-called least-mean square (LMS) algorithm that models the echo path by an FIR filter with an impulse response $ h(t). Using vector notation h for the true echo path impulse response, $ h for its estimate, and x for the excitation time signal, an estimate of the echo is obtained by y$(t) = $ h¢x, where the prime denotes vector transpose. A reasonable objective for a canceller is to minimize the instantaneous squared error e 2 (t), where e(t) = y(t) – y$(t). The time derivative of $ h can be set to (19.4) resulting in the simple update equation $ hk+1 = $ hk + aekxk, where a (or m) control the rate of change. In practice, whenever the far-end signal x(t) is low in power, it is a good idea to freeze the canceller by setting a = 0. Sophisticated logic is needed to detect double talk. When it occurs, then also set a = 0. It can be shown that the spread of the eigenvalues of the autocorrelation matrix of x(t) determines the convergence rate, where the slowest-converging eigenmode corresponds to the smallest eigenvalue. Since the eigenvalues themselves scale with the power of the predominant spectral components in x(t), setting a = 2m/(x¢x) will make the convergence rate independent of the far-end power. This is the normalized LMS method. Even then, however, all eigenmodes will converge at the same rate only if x(t) is white noise. Therefore, pre-whitening the far-end signal will help in speeding up convergence. The LMS method is an iterative approach to echo cancellation. An example of a noniterative, block-oriented approach is the least-squares (LS) algorithm. Solving a system of equations to get $ h, however, is computationally more costly. This cost can be reduced considerably by running the LS method on a sample-by-sample basis and by taking advantage of the fact that the new signal vectors are the old vectors with the oldest sample dropped and one new sample added. This is the recursive least-squares (RLS) algorithm. It also has the advantage FIGURE 19.6 (a) Principle of using an echo canceller in teleconferencing. (b) Realization of the echo canceller in subbands. ( After M. M. Sondhi and W. Kellermann, “Adaptive echo cancellation for speech signals,” in Advances in Speech Signal Processing, S. Furui and M. M. Sondhi, Eds., New York: Marcel Dekker, 1991. By courtesy of Marcel Dekker, Inc.) d d ^ ^ ^ h x h h t =- — =- — = mm m e t et et et 2 () () () () 2 2