当前位置：和泉文库 > 电子与通信 > 《电子工程师手册》学习资料（英文版）Chapter 15 Speech Signal Processing

《电子工程师手册》学习资料（英文版）Chapter 15 Speech Signal Processing

Yariv Ephraim ATeT Bell Laboratories George Mason University Jesse W.Fussell General Approaches. Model Adaptation. Analysis-by-Synthesis. Department of Defense Particular Implementations. Speech Quality and Intelligibility Standardiza Variable Rate Coding. Summary and Conclusions Lynn D. Wilcox FX Palo Alto Lab

文件格式：PDF，文件大小：715.85KB，售价：15.5元

文档详细内容（约56页）

Stan mcclellan University of Alabama 15 Texas AeM Ur Speech Signal Processing Yariv Ephraim ATeT Bell laboratories George Mason University 15.1 Coding, Transmission, and Storage Standardization. Variable Rate Coding. Summary and Conclusions Lynn D. wilcox 15.2 Speech Enhancement and Noise Reduction FX Palo alto lab Models and Performance Measures. Signal Estimation. Source oding.SignalClassification.comments Marcia a Bush Xerox palo alto research center 15.3 Analysis and Synthesis Linear predictive Yuqing Gao Dynamic Time Warping. Hidden Markov gnde gre srocessfig e Bhuvana ramabhadran Recognition System IBM 15.5 Large Vocabulary Continuous Speech Ro T.J. Watson Research Center ognition System.Hidd y Models as Acoustic Models for Speech Recognition. Speaker Michael Picheny Context in Continuous Speech. Language Modeling. Hypothesis arch. State-of-the-Art Systems. Challenges in Speech T.J. Watson Research Center ecognition.Applications 15.1 Coding, Transmission, and Storage Stan Mcclellan and Jerry D. Gibson Interest in speech coding is motivated by a wide range of applications, including commercial telephony, digital cellular mobile radio, military communications, voice mail, speech storage, and future personal communica tions networks. The goal of speech coding is to represent speech in digital form with as few bits as possible while maintaining the intelligibility and quality required for the particular application. At higher bit rates, such as 64 and 32 kbits/s, achieving good quality and intelligibility is not too difficult, but as the desired bit rate is lowered to 16 kbits/s and below, the problem becomes increasingly challenging. Depending on the application, many difficult constraints must be considered, including the issue of complexity. For example, for the 32-kbits/s speech coding standard, the ITU-T not only required highly intelligible, high-quality speech, but the coder also had to have low delay, withstand independent bit error rates up to 10-2 have acceptable performance degradation for several synchronous or asynchronous tandem connections, and pass some voiceband modem signals. Other applications may have different criteria. Digital cellular mobile radio in the U.S. has no low delay or voiceband modem signal requirements, but the speech data rates required are under 8 kbits/s and the transmission medium(or channel) can be very noisy and have relatively long fades. These considerations affect the speech coder chosen for a particular application As speech coder data rates drop to 16 kbits/s and below, perceptual criteria taking into account human auditory response begin to play a prominent role For time domain coders, the perceptual effects are incorporated using a frequency-weighted error criterion. The frequency-domain coders include perceptual effects by allocating "International Telecommunications Union, Telecommunications Standardization Sector, formerly the CCitt. c 2000 by CRC Press LLC

© 2000 by CRC Press LLC 15 Speech Signal Processing 15.1 Coding, Transmission, and Storage General Approaches • Model Adaptation • Analysis-by-Synthesis • Particular Implementations • Speech Quality and Intelligibility • Standardization • Variable Rate Coding • Summary and Conclusions 15.2 Speech Enhancement and Noise Reduction Models and Performance Measures • Signal Estimation • Source Coding • Signal Classification • Comments 15.3 Analysis and Synthesis Analysis of Excitation • Fourier Analysis • Linear Predictive Analysis • Homomorphic (Cepstral) Analysis • Speech Synthesis 15.4 Speech Recognition Speech Recognition System Architecture • Signal Pre-Processing • Dynamic Time Warping • Hidden Markov Models • State-of-the-Art Recognition Systems 15.5 Large Vocabulary Continuous Speech Recognition Overview of a Speech Recognition System • Hidden Markov Models As Acoustic Models for Speech Recognition • Speaker Adaptation • Modeling Context in Continuous Speech • Language Modeling • Hypothesis Search • State-of-the-Art Systems • Challenges in Speech Recognition • Applications 15.1 Coding, Transmission, and Storage Stan McClellan and Jerry D. Gibson Interest in speech coding is motivated by a wide range of applications, including commercial telephony, digital cellular mobile radio, military communications, voice mail, speech storage, and future personal communications networks. The goal of speech coding is to represent speech in digital form with as few bits as possible while maintaining the intelligibility and quality required for the particular application. At higher bit rates, such as 64 and 32 kbits/s, achieving good quality and intelligibility is not too difficult, but as the desired bit rate is lowered to 16 kbits/s and below, the problem becomes increasingly challenging. Depending on the application, many difficult constraints must be considered, including the issue of complexity. For example, for the 32-kbits/s speech coding standard, the ITU-T1 not only required highly intelligible, high-quality speech, but the coder also had to have low delay, withstand independent bit error rates up to 10–2, have acceptable performance degradation for several synchronous or asynchronous tandem connections, and pass some voiceband modem signals. Other applications may have different criteria. Digital cellular mobile radio in the U.S. has no low delay or voiceband modem signal requirements, but the speech data rates required are under 8 kbits/s and the transmission medium (or channel) can be very noisy and have relatively long fades. These considerations affect the speech coder chosen for a particular application. As speech coder data rates drop to 16 kbits/s and below, perceptual criteria taking into account human auditory response begin to play a prominent role. For time domain coders, the perceptual effects are incorporated using a frequency-weighted error criterion. The frequency-domain coders include perceptual effects by allocating 1 International Telecommunications Union, Telecommunications Standardization Sector, formerly the CCITT. Stan McClellan University of Alabama at Birmingham Jerry D. Gibson Texas A&M University Yariv Ephraim AT&T Bell Laboratories George Mason University Jesse W. Fussell Department of Defense Lynn D. Wilcox FX Palo Alto Lab Marcia A. Bush Xerox Palo Alto Research Center Yuqing Gao IBM T.J. Watson Research Center Bhuvana Ramabhadran IBM T.J. Watson Research Center Michael Picheny IBM T.J. Watson Research Center

Encoder FIGURE 15.1 Differential encoder transmitter with a pole-zero predictor. The focus of this article is the contrast among the three most important classes of speech coders that have representative implementations in several international standards--time-domain coders, frequency-domain coders, and hybrid coders In the following, we define these classifications, look specifically at the important haracteristics of representative, general implementations of each class, and briefly discuss the rapidly changing national and international standardization efforts related to speech coding General Approaches Time domain Coders and linear Prediction Linear Predictive Coding(LPC)is a modeling technique that has seen widespread application among time- domain speech coders, largely because it is computationally simple and applicable to the mechanisms involved in speech production. In LPC, general spectral characteristics are described by a parametric model based on estimates of autocorrelations or autocovariances. The model of choice for speech is the all-pole or autoregressive (AR) model. This model is particularly suited for voiced speech because the vocal tract can be well modeled by an all-pole transfer function. In this case, the estimated LPC model parameters correspond to an ar process which can produce waveforms very similar to the original speech segment. Differential Pulse Code Modulation (DPCM)coders(i.e, ITU-T G.721 ADPCM [CCITT, 1984])and LPC vocoders(i.e, U.S. Federal Standard 1015(National Communications System, 1984])are examples of this class of time-domain predictive architec- ture. Code Excited Coders(i.e, ITU-T G728[Chen, 1990] and U.S. Federal Standard 1016[ National Commt lications System, 1991)also utilize LPC spectral modeling techniques. I Based on the general spectral model, a predictive coder formulates an estimate of a future sample of speech based on a weighted combination of the immediately preceding samples. The error in this estimate(the prediction residual) typically comprises a significant portion of the data stream of the encoded speech. The residual contains information that is important in speech perception and cannot be modeled in a straightfor- ward fashion. The most familiar form of predictive coder is the classical Differential Pulse Code Modulation (DPCM)system shown in Fig. 15. 1 In DPCM, the predicted value at time instant k, s(kk-1),is subtracted from the input signal at time k, s(k), to produce the prediction error signal e(k). The prediction error is then approximated ( quantized) and the quantized prediction error, eg(k), is coded (represented as a binary number) for transmission to the receiver. Simultaneously with the coding, ea()is summed with s(kk-1)to yield reconstructed version of the input sample, s(k). Assuming no channel errors, an identical reconstruction, distorted only by the effects of quantization, is accomplished at the receiver. At both the transmitter and receiver, the predicted value at time instant k+l is derived using reconstructed values up through time k, and the procedure is repeated. The first DPCM systems had B(z)=0 and A(z)=>, where (a, i=1.N) are the LPC coefficients and z-I represents unit delay, so that the predicted value was a weighted linear combination of previous reconstructed values, or However, codebook excitation is generally described as a hybrid coding technique c 2000 by CRC Press LLC

© 2000 by CRC Press LLC The focus of this article is the contrast among the three most important classes of speech coders that have representative implementations in several international standards—time-domain coders, frequency-domain coders, and hybrid coders. In the following, we define these classifications, look specifically at the important characteristics of representative, general implementations of each class, and briefly discuss the rapidly changing national and international standardization efforts related to speech coding. General Approaches Time Domain Coders and Linear Prediction Linear Predictive Coding (LPC) is a modeling technique that has seen widespread application among timedomain speech coders, largely because it is computationally simple and applicable to the mechanisms involved in speech production. In LPC, general spectral characteristics are described by a parametric model based on estimates of autocorrelations or autocovariances. The model of choice for speech is the all-pole or autoregressive (AR) model. This model is particularly suited for voiced speech because the vocal tract can be well modeled by an all-pole transfer function. In this case, the estimated LPC model parameters correspond to an AR process which can produce waveforms very similar to the original speech segment. Differential Pulse Code Modulation (DPCM) coders (i.e., ITU-T G.721 ADPCM [CCITT, 1984]) and LPC vocoders (i.e., U.S. Federal Standard 1015 [National Communications System, 1984]) are examples of this class of time-domain predictive architecture. Code Excited Coders (i.e., ITU-T G728 [Chen, 1990] and U.S. Federal Standard 1016 [National Communications System, 1991]) also utilize LPC spectral modeling techniques.1 Based on the general spectral model, a predictive coder formulates an estimate of a future sample of speech based on a weighted combination of the immediately preceding samples. The error in this estimate (the prediction residual) typically comprises a significant portion of the data stream of the encoded speech. The residual contains information that is important in speech perception and cannot be modeled in a straightforward fashion. The most familiar form of predictive coder is the classical Differential Pulse Code Modulation (DPCM) system shown in Fig. 15.1. In DPCM, the predicted value at time instant k, ˆs(k *k – 1), is subtracted from the input signal at time k, s(k), to produce the prediction error signal e(k). The prediction error is then approximated (quantized) and the quantized prediction error, eq(k), is coded (represented as a binary number) for transmission to the receiver. Simultaneously with the coding, eq(k) is summed with ˆs(k *k – 1) to yield a reconstructed version of the input sample, ˆs(k). Assuming no channel errors, an identical reconstruction, distorted only by the effects of quantization, is accomplished at the receiver.At both the transmitter and receiver, the predicted value at time instant k +1 is derived using reconstructed values up through time k, and the procedure is repeated. The first DPCM systems had ˆ B(z) = 0 and Â(z) = , where {ai ,i = 1…N} are the LPC coefficients and z –1 represents unit delay, so that the predicted value was a weighted linear combination of previous reconstructed values, or 1 However, codebook excitation is generally described as a hybrid coding technique. FIGURE 15.1 Differential encoder transmitter with a pole-zero predictor. aiz i i N - Â=1

浏k-0)=∑叫k一小 (15.1) Later work showed that letting B(z)=2bg2-j improves the perceived quality of the reconstructed speech by shaping the spectrum of the quantization noise to match the speech spectrum, as well as improving noisy- channel performance [Gibson, 1984]. To produce high-quality, highly intelligible speech, it is necessary that the quantizer and predictor parameters be adaptive to compensate for nonstationarities in the speech waveform. Frequency Domain Coders Coders that rely on spectral decomposition often use the usual set of sinusoidal basis functions from signal theory to represent the specific short-time spectral content of a segment of speech. In this case, the approximated signal consists of a linear combination of sinusoids with specified amplitudes and arguments(frequency, phase) For compactness, a countable subset of harmonically related sinusoids may be used. The two most prominent types of frequency domain coders are subband coders and multi-band coders. Subband coders digitally filter the speech into nonoverlapping(as nearly as possible)frequency bands. After filtering, each band is decimated (effectively sampled at a lower rate)and coded separately using PCM, DPCM or some other method. At the receiver, the bands are decoded, upsampled, and summed to reconstruct the speech By allocating a different number of bits per sample to the subbands, the perceptually more important frequency bands can be coded with greater accuracy. The design and implementation of subband coders and ne speech quality produced have been greatly improved by the development of digital filters called quadrature mirror filters(QMFs)[Johnston, 1980] and polyphase filters. These filters allow subband overlap at the encoder, which causes aliasing, but the reconstruction filters at the receiver can be chosen to eliminate the aliasing if quantization errors are small. Multi-band coders perform a similar function by characterizing the contributions of individual sinusoidal components to the short-term speech spectrum. These parameters are then quantized, coded, transmitted, and used to configure a bank of tuned oscillators at the receiver Outputs of the oscillators are mixed in proportion to the distribution of spectral energy present in the original waveform. An important requirement of multi-band coders is a capability to precisely determine perceptually significant spectral components and track the evolution of their energy and phase. Recent developments related to multi-band coding emphasize the use of harmonically related components with carefully intermixed spectral regions of bandlimited white noise. Sinusoidal Transform Coders(STC) and Multi-Band Excitation coders(MBe) are examples of this type of frequency domain coders. Model Adaptat Adaptation algorithms for coder predictor or quantizer parameters can be loosely grouped based on the signals that are used as the basis for adaptation. Generally, forward adaptive coder elements analyze the input speech ed version of it)to characterize predictor coefficients, spectral components, or quantizer parameters in a blockwise fashion. Backward adaptive coder elements analyze a reconstructed signal, which contains quantization noise, to adjust coder parameters in a sequential fashion. Forward adaptive coder elements can produce a more efficient model of speech signal characteristics, but introduce delay into the coders operation due to buffering of the signal. Backward adaptive coder elements do not introduce delay, but produce signal models that have lower fidelity with respect to the original speech due to the dependence on the noisy reconstructed signal. Most low-rate coders rely on some form of forward adaptation. This requires moderate to high delay in processing for accuracy of parameter estimation(autocorrelations/autocovariances for LPC- based coders, sinusoidal resolution for frequency-domain coders). The allowance of significant delay for many coder architectures has enabled a spectrally matched pre- or post-processing step to reduce apparent quanti- tion noise and provide significant perceptual improvements. Perceptual enhancements combined with analysis-by-synthesis optimization, and enabled by recent advances in high-power computing architectures such as digital signal processors, have tremendously improved speech coding results at medium and low rates "In this case, the predicted value is s(kk-1)= 浏k-0+∑。be!k- c 2000 by CRC Press LLC

© 2000 by CRC Press LLC (15.1) Later work showed that letting ˆ B(z) = improves the perceived quality of the reconstructed speech1 by shaping the spectrum of the quantization noise to match the speech spectrum, as well as improving noisychannel performance [Gibson, 1984]. To produce high-quality, highly intelligible speech, it is necessary that the quantizer and predictor parameters be adaptive to compensate for nonstationarities in the speech waveform. Frequency Domain Coders Coders that rely on spectral decomposition often use the usual set of sinusoidal basis functions from signal theory to represent the specific short-time spectral content of a segment of speech. In this case, the approximated signal consists of a linear combination of sinusoids with specified amplitudes and arguments (frequency, phase). For compactness, a countable subset of harmonically related sinusoids may be used. The two most prominent types of frequency domain coders are subband coders and multi-band coders. Subband coders digitally filter the speech into nonoverlapping (as nearly as possible) frequency bands. After filtering, each band is decimated (effectively sampled at a lower rate) and coded separately using PCM, DPCM, or some other method. At the receiver, the bands are decoded, upsampled, and summed to reconstruct the speech. By allocating a different number of bits per sample to the subbands, the perceptually more important frequency bands can be coded with greater accuracy. The design and implementation of subband coders and the speech quality produced have been greatly improved by the development of digital filters called quadrature mirror filters (QMFs) [Johnston, 1980] and polyphase filters. These filters allow subband overlap at the encoder, which causes aliasing, but the reconstruction filters at the receiver can be chosen to eliminate the aliasing if quantization errors are small. Multi-band coders perform a similar function by characterizing the contributions of individual sinusoidal components to the short-term speech spectrum. These parameters are then quantized, coded, transmitted, and used to configure a bank of tuned oscillators at the receiver. Outputs of the oscillators are mixed in proportion to the distribution of spectral energy present in the original waveform. An important requirement of multi-band coders is a capability to precisely determine perceptually significant spectral components and track the evolution of their energy and phase. Recent developments related to multi-band coding emphasize the use of harmonically related components with carefully intermixed spectral regions of bandlimited white noise. Sinusoidal Transform Coders (STC) and Multi-Band Excitation coders (MBE) are examples of this type of frequency domain coders. Model Adaptation Adaptation algorithms for coder predictor or quantizer parameters can be loosely grouped based on the signals that are used as the basis for adaptation. Generally, forward adaptive coder elements analyze the input speech (or a filtered version of it) to characterize predictor coefficients, spectral components, or quantizer parameters in a blockwise fashion. Backward adaptive coder elements analyze a reconstructed signal, which contains quantization noise, to adjust coder parameters in a sequential fashion. Forward adaptive coder elements can produce a more efficient model of speech signal characteristics, but introduce delay into the coder’s operation due to buffering of the signal. Backward adaptive coder elements do not introduce delay, but produce signal models that have lower fidelity with respect to the original speech due to the dependence on the noisy reconstructed signal. Most low-rate coders rely on some form of forward adaptation. This requires moderate to high delay in processing for accuracy of parameter estimation (autocorrelations/autocovariances for LPCbased coders, sinusoidal resolution for frequency-domain coders). The allowance of significant delay for many coder architectures has enabled a spectrally matched pre- or post-processing step to reduce apparent quantization noise and provide significant perceptual improvements. Perceptual enhancements combined with analysis-by-synthesis optimization, and enabled by recent advances in high-power computing architectures such as digital signal processors, have tremendously improved speech coding results at medium and low rates. 1 In this case, the predicted value is ˆs(k* k – 1) = . s kk as k i ˆ ˆ . i i N ( - ) = - ( ) = 1 Â1 b jz j j M - Â =1 ask i be k j i i N j q j M ˆ( ) - + - ( ) Â Â = = 1 1

Analysis-by-Synthesis A significant drawback to traditional"instantaneous "coding approaches such as DPCM lies in the perceptual or subjective relevance of the distortion measure and the signals to which it is applied. Thus, the advent of analysis-by-synthesis coding techniques poses an important milestone in the evolution of medium-to low-rate speech coding. An analysis-by-synthesis coder chooses the coder excitation by minimizing distortion between the original signal and the set of synthetic signals produced by every possible codebook excitation sequence In contrast, time-domain predictive coders must produce an estimated prediction residual(innovations sequence)to drive the spectral shaping filter(s) of the LPC model, and the classical DPCM approach is to weighted distortion in the optimization of analysis-by-synthesis coders is significant in that it de-emplazae oa quantize the residual sequence directly using scalar or vector quantizers. The incorporation of freque increases the tolerance for) quantization noise surrounding spectral peaks. This effect is perceptually trans- parent since the ear is less sensitive to error around frequencies having higher energy [Atal and Schroeder, 1979 This approach has resulted in significant improvements in low-rate coder performance, and recent increases in processor speed and power are crucial enabling techniques for these applications. Analysis-by-synthesis coders based on linear prediction are generally described as hybrid coders since they fall between waveform coders and vocoders Particular Implementations Currently, three coder architectures dominate the fields of medium and low-rate speech coding: Code-Excited Linear Prediction(CELP): an LPC-based technique which optimizes a vector of excitation samples(and/or pitch filter and lag parameters)using analysis-by-synthesis Multi-Band Excitation(MBE): a direct spectral estimation technique which optimizes the spectral recon struction error over a set of subbands using analysis-by-synthesis Mixed-Excitation Linear Prediction(MELP): an optimized version of the traditional LPC vocoder which includes an explicit multiband model of the excitation signal. Several realizations of these approaches have been adopted nationally and internationally as standard speech coding architectures at rates below 16 kbits/s(ie, G728, IMBE, U.S. Federal Standard 1016, etc. ) The success of these implementations is due to LPC-based analysis-by-synthesis with a perceptual distortion criterion time frequency-domain modeling of a speech waveform or LPC residual. Additionally, the coders that operate at lower rates all benefit from forward adaptation methods which produce efficient, accurate parameter estimates. CELP The general CELP architecture is described as a blockwise analysis-by-synthesis selection of an LPC excitation sequence In low-rate CELP coders, a forward-adaptive linear predictive analysis is performed at 20 to 30 msec intervals. The gross spectral characterization is used to reconstruct, via linear prediction, candidate speech segments derived from a constrained set of plausible filter excitations(the""). The excitation vector that produces the synthetic speech segment with smallest perceptually weighted distortion(with respect to the riginal speech) is chosen for transmission. Typically, the excitation vector is optimized more often than the LPC spectral model. The use of vectors rather than scalars for the excitation is significant in bit-rate reduction. The use of perceptual weighting in the CELP reconstruction stage and analysis-by-synthesis optimization the dominant low-frequency(pitch) component are key concepts in maintaining good quality encoded speech at lower rates. CELP-based speech coders are the predominant coding methodologies for rates between 4 kbits/s and 16 kbits/s due to their excellent subjective performance. Some of the most notable are detailed below. ITU-TRecommendation G728(LD-CELP)[Chen, 1990] is a low delay, backward adaptive CELP coder. In G.728, a low algorithmic delay(less than 2.5 msec)is achieved by using 1024 candidate excitation sequences, each only 5 samples long. A 50th-order LPC spectral model is used, and the coefficients ar backward-adapted based on the transmitted excitation The speech coder standardized by the Ctia for use in the U.S. (time-division multiple-access)8 kbits/s digital cellular radio systems is called vector sum excited linear prediction(VSELP)[ Gerson and Jasiuk, c 2000 by CRC Press LLC

© 2000 by CRC Press LLC Analysis-by-Synthesis A significant drawback to traditional “instantaneous” coding approaches such as DPCM lies in the perceptual or subjective relevance of the distortion measure and the signals to which it is applied. Thus, the advent of analysis-by-synthesis coding techniques poses an important milestone in the evolution of medium- to low-rate speech coding. An analysis-by-synthesis coder chooses the coder excitation by minimizing distortion between the original signal and the set of synthetic signals produced by every possible codebook excitation sequence. In contrast, time-domain predictive coders must produce an estimated prediction residual (innovations sequence) to drive the spectral shaping filter(s) of the LPC model, and the classical DPCM approach is to quantize the residual sequence directly using scalar or vector quantizers. The incorporation of frequencyweighted distortion in the optimization of analysis-by-synthesis coders is significant in that it de-emphasizes (increases the tolerance for) quantization noise surrounding spectral peaks. This effect is perceptually transparent since the ear is less sensitive to error around frequencies having higher energy [Atal and Schroeder, 1979]. This approach has resulted in significant improvements in low-rate coder performance, and recent increases in processor speed and power are crucial enabling techniques for these applications. Analysis-by-synthesis coders based on linear prediction are generally described as hybrid coders since they fall between waveform coders and vocoders. Particular Implementations Currently, three coder architectures dominate the fields of medium and low-rate speech coding: • Code-Excited Linear Prediction (CELP): an LPC-based technique which optimizes a vector of excitation samples (and/or pitch filter and lag parameters) using analysis-by-synthesis. • Multi-Band Excitation (MBE): a direct spectral estimation technique which optimizes the spectral reconstruction error over a set of subbands using analysis-by-synthesis. • Mixed-Excitation Linear Prediction (MELP): an optimized version of the traditional LPC vocoder which includes an explicit multiband model of the excitation signal. Several realizations of these approaches have been adopted nationally and internationally as standard speech coding architectures at rates below 16 kbits/s (i.e., G.728, IMBE, U.S. Federal Standard 1016, etc.). The success of these implementations is due to LPC-based analysis-by-synthesis with a perceptual distortion criterion or shorttime frequency-domain modeling of a speech waveform or LPC residual. Additionally, the coders that operate at lower rates all benefit from forward adaptation methods which produce efficient, accurate parameter estimates. CELP The general CELP architecture is described as a blockwise analysis-by-synthesis selection of an LPC excitation sequence. In low-rate CELP coders, a forward-adaptive linear predictive analysis is performed at 20 to 30 msec intervals. The gross spectral characterization is used to reconstruct, via linear prediction, candidate speech segments derived from a constrained set of plausible filter excitations (the “codebook”). The excitation vector that produces the synthetic speech segment with smallest perceptually weighted distortion (with respect to the original speech) is chosen for transmission. Typically, the excitation vector is optimized more often than the LPC spectral model. The use of vectors rather than scalars for the excitation is significant in bit-rate reduction. The use of perceptual weighting in the CELP reconstruction stage and analysis-by-synthesis optimization of the dominant low-frequency (pitch) component are key concepts in maintaining good quality encoded speech at lower rates. CELP-based speech coders are the predominant coding methodologies for rates between 4 kbits/s and 16 kbits/s due to their excellent subjective performance. Some of the most notable are detailed below. • ITU-T Recommendation G.728 (LD-CELP) [Chen, 1990] is a low delay, backward adaptive CELP coder. In G.728, a low algorithmic delay (less than 2.5 msec) is achieved by using 1024 candidate excitation sequences, each only 5 samples long. A 50th-order LPC spectral model is used, and the coefficients are backward-adapted based on the transmitted excitation. • The speech coder standardized by the CTIA for use in the U.S. (time-division multiple-access) 8 kbits/s digital cellular radio systems is called vector sum excited linear prediction (VSELP) [Gerson and Jasiuk

点击进入文档下载页（PDF格式）

共56页，可试读19页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录