Peer Reviewed Journal via three different mandatory reviewing processes, since 2006, and, from September 2020, a fourth mandatory peer-editing has been added.
In this paper, we have explored speech features to identify Hindi dialects and emotions. A dialect is any distinguishable variety of a language spoken by a group of people. Emotions provide naturalness to speech. In this work, five prominent dialects of Hindi are considered for the identification task. They are Chattisgharhi (spoken in central India), Bengali (Bengali accented Hindi spoken in Eastern region), Marathi (Marathi accented Hindi spoken in Western region), General (Hindi spoken in Northern region) and Telugu (Telugu accented Hindi spoken in Southern region). Along with dialect identification, we have also carried out emotion recognition in this work. Speech database considered for dialect identification task consists of spontaneous speech spoken by male and female speakers. Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC) is used for conducting the emotion recognition studies. The emotions considered in this study are anger, disgust, fear, happy, neutral and sad. Prosodic and spectral features extracted from speech are used for discriminating the dialects and emotions. Spectral features are represented by Mel frequency cepstral coefficients (MFCC) and prosodic features are represented by durations of syllables, pitch and energy contours. Auto-associative neural network (AANN) models and Support Vector Machines (SVM) are explored for capturing the dialect specific and emotion specific information from the above specified features. AANN models are expected to capture the nonlinear relations specific to dialects or emotions through the distributions of feature vectors. SVMs perform dialect or emotion classification based on discriminative characteristics present among the dialects or emotions. Classification systems are developed separately for dialect classification and emotion classification. Recognition performance of the dialect identification and emotion recognition systems is found to be 81% and 78% respectively.