AIM:
1) To Load, display and manipulate the sample speech signal
2) To estimate pitch of speech signal by auto correlation method
3) To estimate pitch of speech signal by cepstrum method
4) To estimate pitch of speech signal by above methods for a female signal
INTRODUCTION:
Speech signal can be classified into voiced, unvoiced and silence regions. The near periodic vibration of vocal folds is excitation for the production of voiced speech. The random ...like excitation is present for unvoiced speech. There is no excitation during silence region. Majority of speech regions are voiced in nature that include vowels, semivowels and other voiced components. The voiced regions looks like a near periodic signal in the time domain representation. In a short term, we may treat the voiced speech segments to be periodic for all practical analysis and processing. The periodicity associated with such segmentsis defined is 'pitch period To' in the time domain and 'Pitch frequency or Fundamental Frequency Fo' in the frequency domain. Unless specified, the term 'pitch' refers to the fundamental frequency ' Fo'. Pitch is an important attribute of voiced speech. It contains speaker-specific information. It is also needed for speech coding task. Thus estimation of pitch is one of the important issue in speech processing. There are a large set of methods that have been developed in the speech processing area for the estimation of pitch. Among them the three mostly used methods include, autocorrelation of speech, cepstrum pitch determination and single inverse filtering technique (SIFT) pitch estimation. One success of these methods is due to the involvement of simple steps for the estimation of pitch. Even though autocorrelation method is of theoretical interest, it produce a frame work for SIFT methods.
2. PROJECT DESCRIPTION:
2.1 Autocorrelation:
The term autocorrelation can be stated as the similarity between observations as a function of the time lag between them. Autocorrelation is often used in signal processing for analyzing functions or series of values, such as time domain signals. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by harmonic frequencies. Initially we should have the basic understanding of identifying the voiced/unvoiced/silence regions of speech from their time domain and frequency domain representations. For this we need to plot the speech signal in time and frequency domains. The time domain representation is termed as waveform and frequency domain representation is termed as spectrum. we consider speech signals in short ranges for plotting their waveforms and spectra. The typical lengths include 10-30 msec. The time domain and frequency domain characteristics are distinct for the three cases. Voiced segment represents periodicity in time domain and harmonic structure in frequency domain. Unvoiced segment is random noise-like in time domain and spectrum without harmonic structure in frequency domain. Silence region does not have energy in either time or frequency domain.
Analysis of voiced speech
We should be able to identify whether given segment of speech, typically, 20 - 50 msec, is voiced speech or not. The voiced speech segment is characterized by the periodic nature, relatively high energy, less number of zero crossings and more correlation among successive samples. The voiced speech can be identified by observation of the waveform in the time domain due to its periodicity nature. In the frequency domain, the presence of harmonic structure is the evidence that the segment is voiced. Further, the spectrum will have more energy, typically, in the low frequency region. The spectrum will also have a downward trend starting from zero frequency and moving upwards. The autocorrelation of a segment of voiced speech will have a strong peak at the pitch period. The high energy can be observed in terms of high amplitude values for voiced segment. However, energy alone cannot decide the voicing information. Periodicity is crucial along with energy to identify the voiced segment unambiguously. Similarly the relatively low zero-crossings can also be indirectly observed as smooth variations among sequence of sample values. Figure 2 below shows the code to generate the waveform, spectrum and autocorrelation sequence for a given segment of voiced speech.
Analysis of Unvoiced speech:
We should be able to identify whether given segment of speech, typically, 20 - 50 msec, is unvoiced or not. The unvoiced speech segment is characterized by the non-periodic nature, relatively low energy compared to voiced speech, more number of zero crossings and relatively less correlation among successive samples. The unvoiced speech can be identified by observation of the waveform in the time domain due to its non-periodicity nature. In the frequency domain, the absence of harmonic structure is the evidence that the segment is unvoiced. Further, the spectrum will have more energy, typically, in the high frequency region. The spectrum will also have an upward trend starting from zero frequency and moving upwards. The autocorrelation of a segment of unvoiced speech will match typically that of random noise. The low energy can be observed in terms of low amplitude values for unvoiced segment. However, energy alone cannot decide the unvoicing information. The number of zero-crossings is crucial along with energy to identify the unvoiced segment unambiguously. The relatively high zero-crossings can also be indirectly observed as rapid variations among sequence of sample values. Here we find no prominent peak as in the case of voiced speech. This is the fundamental distinction between voiced and unvoiced speech.
2.2 Estimation of pitch by Cepstrum method:
A cepstrum is the result of taking the Inverse Fourier transform(IFT) of the logarithm of the estimated spectrum of a signal. The name "cepstrum" was derived by reversing the first four letters of "spectrum". By using cepstrum method we can separate the vocal tract and excitation source related information in the speech signal. The main limitation we find during the estimation of pitch by autocorrelation method is there may be peaks larger than the peak which corresponds to the pitch period. As a result there is a chance of wrong estimation of pitch. The approach to minimize such errors is to separate the vocal tract and excitation source related information in the speech signal and there use the source information for pitch estimation. The cepstral analysis of speech provides such an approach. The ceptrum of speech is defined as the inverse Fourier transform of the log magnitude spectrum. In cepstrum method all the slowly varying components in log magnitude spectrum refers to the low frequency region and fast varying components to the high frequency regions. In the log magnitude spectrum, the slowly varying components represent the envelope corresponds to the vocal tract and the fast varying components to the excitation source. As a result the vocal tract and excitation source components get represented naturally in the spectrum of speech.
The graph will depict a 30 msec segment of voiced speech, its log magnitude spectrum and ceptrum. The initial few values in the cepstrum typically 13-15 cepstral values represent the vocal tract information. The large peak present after these initial values represent the excitation information. The pitch period T0 starting from the zeroth value in number of samples. As a result the peaks that may be occuring in case of autocorrelation analysis get naturally eliminated in cepstrum pitch determination. By comparing voiced speech graph and unvoiced speech graph we can observe that there is no prominent peak in case of ceptrum of unvoiced speech after the 13-15 initial cepstral values. This is the main distinction between cepstrum of voiced and unvoiced speech. For the estimation of pitch, the target peak in the cepstral sequence after initial 2 msec 16 cepstral values, gives the estimation of pitch period in case of voiced speech. This illustrates the procedure for computing the pitch period by the high time liftering of the cepstrum of the voiced speech.