Academia.eduAcademia.edu

Outline

Speech recognition based on spectrograms by using deep learning

2018

Abstract

Speech Recognition is widely being used and it has become part of our day to day. Several massive and popular applications have taken its use to another level. Most of the existing systems use machine learning techniques such as artificial neural networks or fuzzy logic, whereas others may just be based in a comparative analysis of the sound signals with a large lookup tables that contain possible realizations of voice commands. These models base their speech recognition algorithms on the analysis or comparison of the analog acoustic signal itself. The sound has particular characteristics that can not be seen through the representation of its propagation wave in time. This project proposes speech recognition through an innovative model that analyzes the graphic representation of the acustic signal, its spectrogram. Therefore the model does not classify the speech through its acoustic signal but its graphical representation. This leads the research to an approximation of the problem ...

References (25)

  1. Huang, X., Baker, J. and Reddy, R. A historical perspective of speech recognition. Communications of the ACM, 2014. 57(1): 94-103.
  2. Werbos, P. J. Neural networks for intelligent control, 2005. US Patent 6,882,992.
  3. Rabiner, L. R. and Schafer, R. W. Digital processing of speech signals. vol. 100. Prentice-hall Englewood Cliffs, NJ. 1978.
  4. Attenborough, K., Taherzadeh, S., Bass, H. E., Di, X., Raspet, R., Becker, G., Güdesen, A., Chrestman, A., Daigle, G. A., L'Esperance, A. et al. Benchmark cases for outdoor sound propagation models. The Journal of the Acoustical Society of America, 1995. 97(1): 173-191.
  5. Kinsler, L. E., Frey, A. R., Coppens, A. B. and Sanders, J. V. Fundamentals of acoustics. Fundamentals of Acoustics, 4th Edition, by Lawrence E. Kinsler, Austin R. Frey, Alan B. Coppens, James V. Sanders, pp. 560. ISBN 0-471- 84789-5. Wiley-VCH, December 1999., 1999: 560.
  6. ANSI, A. American National Standard Acoustical Terminology. ANSI Sl, 1994: 1-1994.
  7. Flanagan, J. L. Speech analysis synthesis and perception. vol. 3. Springer Science & Business Media. 2013.
  8. Russ, J. C. The image processing handbook. CRC press. 2016.
  9. Gonzalez, R. C. and Richard, E. Woods.(2002). Digital Image Processing.
  10. Comer, M. L. and Delp, E. J. Morphological operations for color image processing. Journal of electronic imaging, 1999. 8(3): 279-290.
  11. Palomares, F. G., Serrá, J. A. M. and Martínez, E. A. Aplicación de la convolución de matrices al filtrado de imágenes. Modelling in Science Education and Learning, 2016. 9(1): 97-108.
  12. Vargas, M. G. F. and Cruz, E. A. A. Estudio del Efecto de las Máscaras de Convolución en Imágenes Mediante el Uso de la Transformada de Fourier. Ingeniería e Investigación, 2001. (48): 46-51.
  13. Russell, S. J. and Norvig, P. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,. 2016.
  14. Takeyas, B. L. Introducción a la inteligencia artificial, 2017.
  15. Robert, C. Machine learning, a probabilistic perspective, 2014.
  16. Schmidhuber, J. Deep learning in neural networks: An overview. Neural networks, 2015. 61: 85-117.
  17. Kingsbury, B. E., Morgan, N. and Greenberg, S. Robust speech recognition using the modulation spectrogram. Speech communication, 1998. 25(1-3): 117-132.
  18. Satt, A., Rozenberg, S. and Hoory, R. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. Proc. Interspeech 2017, 2017: 1089-1093.
  19. Uchida, S., Ide, S., Iwana, B. K. and Zhu, A. A further step to perfect accuracy by training CNN with larger data. Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on. IEEE. 2016. 405-410.
  20. Graves, A., Mohamed, A.-r. and Hinton, G. Speech recognition with deep recurrent neural networks. Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE. 2013. 6645-6649.
  21. Sprengel, E., Jaggi, M., Kilcher, Y. and Hofmann, T. Audio based bird species identification using deep learning techniques. LifeCLEF 2016. 2016, EPFL- CONF-229232. 547-559.
  22. Greenberg, S. and Kingsbury, B. E. The modulation spectrogram: In pursuit of an invariant representation of speech. Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on. IEEE. 1997, vol. 3. 1647-1650.
  23. Goyal, M. Morphological image processing. IJCST, 2011. 2(4).
  24. Bovik, A. C. Handbook of image and video processing. Academic press. 2012.
  25. LeCun, Y., Bengio, Y. and Hinton, G. Deep learning. nature, 2015. 521(7553): 436.