Human speech consists mainly of three components: a glottal signal, a vocal tract response, and a harmonic shift. The three respectively correlate with the intonation (pitch), the formants (timbre), and the speech resolution (depth). Adding the intonation of the Fundamental Frequency (FF) to Automatic Speech Recognition (ASR) systems is necessary. First, the intonation conveys a primitive para-language. Second, its speaker-tuning reduces background noises to clarify acoustic observations. Third, extracting the speech features is more efficient when they are computed together at the same time.
This work introduces a frequency-modulation model, a novel quefrency-based speech features' extraction that is named Speech Quefrency Transform (SQT), and its proper quefrency scaling and transformation function. The cepstrums, which are spectrums of spectrums, are suggested in time unit accelerations, whereby the discrete variable, the quefrency, is measured in Hertz-per-microsecond. The extracted features are comparable to Mel-Frequency Cepstral Coefficients (MFCC) integrated within a quefrency-based pitch tracker. The SQT transform directly expands time samples of stationary signals (i.e., speech) to a higher dimensional space, which can help generative Artificial Neural Networks (ANNs) in unsupervised Machine Learning and Natural Language Processing (NLP) tasks. The proposed methodologies, which are a scalable solution that is compatible with dynamic and parallel programming for refined speech and cepstral analysis, can robustly estimate the features after applying a matrix multiplication in less than a hundred sub-bands, preserving precious computational resources.