doi:10.31799/1684-8853-2019-4-45-53
Encoder-decoder models for recognition of Russian speech
N. M. Markovnikova, Programmer, orcid.org/0000-0002-2352-4195, niklemark@gmail.com I. S. Kipyatkovaa,b, PhD, Tech., Senior Researcher, orcid.org/0000-0002-1264-4458
aSaint-Petersburg Institute for Informatics and Automation of the RAS, 39, 14 Line, V. O., 199178, Saint-Petersburg, Russian Federation
bSaint-Petersburg State University of Aerospace Instrumentation, 67, B. Morskaia St., 190000, Saint-Petersburg, Russian Federation
Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markov models and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independent complex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, using deep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end models often demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toend models for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms of recognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecoder model of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neural networks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decoder model was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or using a language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could not surpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/ learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than the baseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without language models for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained on raw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives the best result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require less memory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile devices without using calculations on remote servers.
|