A Isolated Word Recognizer for the Swahili Dialect Using Sphinx4-Neural Network Hybrid

  • Shadrack K Kimutai Kipkabus Technical and Vocational College
  • David Gichoya Kenya Methodist University & Moi University
  • Edna Milgo Moi University
Keywords: Sphinx4, Speech Recognition, Artificial Neural Networks, Natural Language Processing


Speech is one of the most effective means of communication. Efforts to develop an automatic speech recognizer for Swahili dialect has faced numerous challenges due to lack of a standardized acoustic model for the dialect. This study aimed at developing an isolated speech recognizer for automatic transcription of Swahili dialect using Sphinx4 – Neural Network hybrid. The objectives of the study was to design and develop a neural network on the front-end module of the Sphinx4 and to compare performance between sphinx4 and sphinx4 neural network hybrid. The study was guided by statistical and probabilistic theory and the theory of linguistics. This study followed a positivist philosophical paradigm and embraced experimental research design. To achieve this, the researcher developed a small corpus of the dialect which contained twenty words. A total of two hundred speech recordings were made from ten volunteers. The words and the volunteers were selected using purposive and convenience sampling methods respectively. Results obtained from both sphinx4 and sphinx4-neural network hybrid were evaluated using descriptive analysis techniques. This study established that a neural network could be integrated with sphinx4. The study also established that sphinx4-neural network hybrid with its neural network trained to an error less than 0.0175 had a performance that was statistically at par with that of sphinx4. The study concluded having established that for isolated word recognition in Swahili, sphinx4-neural network hybrid can be used in place of sphinx4 HMM recognizer. Areas of further study recommended include; using this tool for continuous speech recognition, evaluation of the tool using a larger speech sample, comparative analysis between the tool and others line NICO and establishing ways of improving accuracy of the tool.

Article Views and Downloands Counter

Download data is not yet available.


Ahad, A., Fayyaz, A., & Mehmood, T. (2002). Speech recognition using multiple-layer perceptron. IEEE: Transactions on Pattern Analysis and Machine Inteligence.

Bamgbose, A. (2011). African languages today: The challenge of and prospects for empowerment under globalization. In E. G. Bokamba, Selected Proceedings of the 40th Annual Conference on African Linguistics. Somerville,MA: Cascadilla Proceedings Project.

Bourlard, H., & Morgan, N. (1997). Hybrid HMM/ANN systems for speech recognition: Overview and New Research Directions. Computer, 30(3), 1-29.

Freeman, J. A., & Skapura, D. M. (1991). Neural networks:Algorithms, applications and programming techniques. Reading, Massachusetts: Addison-Wesley Publishing Company, Inc.

Glass, J. R., Hazen, T. J., & Hetherington., L. (1999). Real-Time telephone-Based speech recognition in the Jupiter domain. International Conference on Acoustics, Speech, and Signal Processing, 61-64.

Hadrien, G., Solomon, T., Laurent, B., & François, P. (2010). Quality assessment of crowdsourcing transcriptions for African languages. Lyon: Universit´e de Lyon.

Haykin, S. (2005). Neural Networks: A Comprehensive Foundation. Delhi, India: Pearson Education (Singapore) Pte Ltd.

Huang, X., Acero, A., & Hsiao-wuen, H. (2001). Spoken language processing. Upper Saddle River, New Jeysey: Prentice Hall Inc.

Juang, B. H., & Rabiner, L. R. (1993). Fundamentals of speech recognition. Eaglewood Cliffs, New Jeysey: PTR-Prentice Hill.

Kimutai, S. K., Milgo, E., & Gichoya, D. (2013). Isolated Swahili words recognition using Sphinx4. International Journal of Emerging Science and Engineering, 2(2), 51-57.

Lyons, J. (2022, 12 20). Mel frequency cepstral coefficient (MFCC) tutorial. Retrieved from Practial Cryptographyy: http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

Myers, B. A. (1998). A brief history of human-computer interaction technology. interactions, 5(2), 44-54.

Peretto, P. (1994). An Introduction to modeling neural networks. New York, NY: University of Cambridge press.

Rusell, S., & Norvig, P. (2010). Artificial inteligence: A modern approach. Upper Saddle River: Pearson Education inc.

Stefanov, E. (2013, 12 7). Introduction to Feed Forward Neural Networks. Retrieved from http://www.emilstefanov.net/Introduction_to_Feedforward_Neural_Networks.aspx

Tebelskis, J. (1995). Speech Rrecognition using neural networks. Thesis. Pittsburgh, Pennsylvania: School of Computer Science, Carnegie Mellon university.

Willie, W., Paul, L., Philip, K., Bhiksha, R., Rita, S., Evandro, G., . . . Woelfel, J. (2004). Sphinx-4: A flexible open source framework for speech recognition. (pp. 1-9). SUN MICROSYSTEMS INC.
How to Cite
Kimutai, S., Gichoya, D., & Milgo, E. (2023). A Isolated Word Recognizer for the Swahili Dialect Using Sphinx4-Neural Network Hybrid. Africa Journal of Technical and Vocational Education and Training, 8(1), 216-224. Retrieved from https://afritvet.org/index.php/Afritvet/article/view/172