Speaker Verification System

Speaker Verification Submenu

- high level description
- preprocessing
- feature extraction
- enrollment
- threshold creation
- verification

 feature extraction

Feature extraction is done by calculating melcepstrum coefficients.

Figure 3: Mel-Cepstrum Block Diagram

The original vector of sampled values is framed into overlapping blocks. Each block contains 256 samples with adjacent frames being separated by 128 samples. This yields a minimum of 50% overlap to ensure that all sampled values are accounted for within at least two blocks. Since speech signals are quasi-stationary between 5msec and 100msec, 256 was chosen so that each block is 16msec. 256 was chosen as a power of 2 in order to use the Fast Fourier Transform in subsequent stages.

Each block is windowed to minimize spectral distortion and discontinuities. A Hamming window is used.

The Fast Fourier Transform is then applied to each windowed block at the beginning of the Mel-Cepstral Transform. After this stage, the spectral coefficients of each block are generated.

The Mel Frequency Transform is then applied to each spectral block to convert the scale to a mel scale. The mel scale is a logarithmic scale similar to the way the human ear perceives sound. A filter bank of 29 filters captures frequency bands representative of the mel-scale. See the figure below for the filterbank plot. The output powers of each of the filters are then put through a discrete cosine transform to arrive at the Mel-Frequency coefficients. Each frame gets 12 coefficients.