For citation: Markovnikov N. M., Kipyatkova I. S. Encoder-decoder models for recognition of Russian speech. Informatsionnoupravliaiushchie sistemy [Information and Control Systems], 2019, no. 4, pp. 45–53 (In Russian). doi:10.31799/1684-8853-2019-4-45-53
References
1. Bahdanau D., Chorowski J., Serdyuk D., Brakel P., BengioY. End-to-end attention-based large vocabulary speech recognition. Acoustics, Speech and Signal Processing (ICASSP), 2016, pp
. 4945–4949. doi:10.1109/ICASSP.2016.7472618 2. Allauzen C., Riley M., Schalkwyk J., Skut W., Mohri M. OpenFst: A general and efficient weighted finite-state transducer library. Implementation and Application of Automata, 2007, pp. 11–23. doi:10.1007/978-3-540-76336-9_3
Chan W., Jaitly N., Le Q., Vinyals O. Listen, attend and spell: A neural networkfor large vocabulary conversational speech recognition. Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964. doi:10.1109/ICASSP. 2016.7472621
Graves Jaitly N., Mohamed A.-r.Hybrid speech recognition with deep bidirectional LSTM. Automatic Speech Recognition and Understanding (ASRU), IEEE Workshop on, IEEE, 2013, pp. 273–278. doi:10.1109/ASRU.2013.6707742 5. Hochreiter S., Schmidhuber J.Long short-term memory. Neural Computation, 1997, no. 9, pp. 1735–1780. doi:10.1162/ neco.1997.9.8.1735
Vaswani A., et. al. Attention is all you need. arXiv, 2017. Available at: http://arxiv.org/abs/1706.03762 (accessed 27 February 2019).
recognition for under-resourced languages: A survey. Speech Communication, 2014, pp. 85–100. doi:10. 1016/j.specom.2013.07.008
Markovnikov N., Kipyatkova I. A survey of end-to-end speech recognition systems. Trudy SPIIRAN [SPIIRAS Proceedings], 2018, vol. 58, pp. 77–110 (In Russian). doi:10.15622/sp.58.4
Sutskever Vinyals O., Le Q. V. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 2014, pp. 3104–3112.
Robinson T., Hochberg M., Renals S. The use of recurrent neural networks in continuous speech recognition. Automatic Speech and Speaker Recognition,Springer, 1996, pp. 233–258.
Chorowski J. K., Bahdanau D., Serdyuk D., Cho K., Bengio Y. Attention-based models for speech recognition. Advances in Neural Information Processing Systems, 2015, pp. 577– 585.
Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv, 2014. Available at: http://arxiv.org/abs/1409.0473 (accessed 27 February 2019).
Ganchev T., Fakotakis N., Kokkinakis G. Comparative evaluation of various MFCC implementations on the speaker verification task. Proc. of the SPECOM, 2005, pp. 191–194.
Kingma D. P., Ba J. Adam: A method for stochastic optimization. arXiv, 2014. Available at: http://arxiv.org/abs/1412. 6980 (accessed 27 February 2019).
Zeyer A., Doetsch P., Voigtlaender P., Schluter R., and Ney H.A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2462–2466. doi:10.1109/ICASSP.2017.7952599
Sennrich R., Haddow B., and Birch A. Neural machine translation of rare words with subword units. ACL, 2016, pp. 1715–1725. doi:10.18653/v1/P16-1162
Simon Wiesler A. R., Schlüter R., Ney H. Mean-normalized stochastic gradient for large-scale deep learning. IEEE Intern. Conf. on Acoustics, Speech, and Signal Processing, 2014, pp. 180–184. doi:10.1109/ICASSP.2014.6853582
He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2016, pp. 770–778. doi:10.
1109/CVPR.2016.90
Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z. Rethinking the inception architecture for computer vision. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826. doi:10.1109/CVPR.2016.308
Chiu C. C., et. al. State-of-the-art speech recognition with sequence-to-sequence models. IEEE Intern. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4774–4778. doi:10.1109/ICASSP.2018.8462105
Kipyatkova I., Karpov A. DNN-based acoustic modeling for Russian speech recognition using Kaldi. Intern. Conf. on Speech and Computer (SPECOM), 2016, pp. 246–253. doi:10.1007/978-3-319-43958-7_29
Verkhodanova V., Ronzhin A., Kipyatkova I. Havrus corpus: high-speed recordings of audio-visual Russian speech.