Detecting head movements in video-recorded dyadic conversations
Proceedings of the 20th International Conference on Multimodal Interaction: Adjunct
https://doi.org/10.1145/3281151.3281152Abstract
This paper is about the automatic recognition of head movements in videos of face-to-face dyadic conversations. We present an approach where recognition of head movements is casted as a multimodal frame classification problem based on visual and acoustic features. The visual features include velocity, acceleration, and jerk values associated with head movements, while the acoustic ones are pitch and intensity measurements from the co-occuring speech. We present the results obtained by training and testing a number of classifiers on manually annotated data from two conversations. The best performing classifier, a Multilayer Perceptron trained using all the features, obtains 0.75 accuracy and outperforms the mono-modal baseline classifier.
References (28)
- Jens Allwood. 1988. The Structure of Dialog. In Structure of Multimodal Dialog II, Martin M. Taylor, Franà goise Neél, and Don G. Bouwhuis (Eds.). John Benjamins, Amsterdam, 3-24.
- Jens Allwood, Loredana Cerrato, Kristiina Jokinen, Costanza Navarretta, and Patrizia Paggio. 2007. The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena. In Multimodal Corpora for Modelling Human Multimodal Behaviour, Jean-Claude Martin, Patrizia Paggio, Peter Kuehnlein, Rainer Stiefelhagen, and Fabio Pianesi (Eds.). Special issue of the International Journal of Language Resources and Evaluation, Vol. 41. Springer, 273-287.
- Paul Boersma and David Weenink. 2009. Praat: doing phonetics by computer (Version 5.1.05) [Computer program]. (2009). Retrieved May 1, 2009, from http://www.praat.org/.
- G. Bradski and A. Koehler. 2008. Learning OpenCV: Computer Vision with the OpenCV Linbrary. O'Reilly.
- Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Philadelphia, 1-8.
- Marion Dohen, Hélène Loevenbruck, and Hill Harold. 2006. Visual correlates of prosodic contrastive focus in French: description and inter-speaker variability. In Speech Prosody 2006. p-221.
- Starkey Duncan. 1972. Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology 23 (1972), 283-292.
- Sebastian Germesin and Theresa Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of ICMI-MLMI 2009. 7-14.
- Björn Granström and David House. 2005. Audiovisual representation of prosody in expressive speech communication. Speech Communication 46, 3 (July 2005), 473-484. DOI: http://dx.doi.org/10.1016/j.specom.2005.02.017
- U. Hadar, T.J. Steiner, E.C. Grant, and F. Clifford Rose. 1983. Head Movement Correlates of Juncture and Stress at Sentence Level. Language and Speech 26, 2 (April 1983), 117-129. DOI: http://dx.doi.org/10.1177/002383098302600202
- D. Heylen, E. Bevacqua, M. Tellier, and C. Pelachaud. 2007. Searching for prototypical facial feedback signals. In Proceedings of 7th International Conference on Intelligent Virtual Agents. 147-153.
- Bart Jongejan. 2012. Automatic annotation of head velocity and acceleration in Anvil. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12). European Language Resources Distribution Agency, 201-208.
- Bart Jongejan, Patrizia Paggio, and Costanza Navarretta. 2017. Classifying head movements in video-recorded conversations based on movement velocity, acceleration and jerk. In Proceedings of the 4th European and 7th Nordic Symposium on Multimodal Communication (MMSYM 2016), Copenhagen, 29-30 September 2016. Linkà űping University Electronic Press, Linkà űpings universitet, 10-17.
- Ashish Kapoor and Rosalind W. Picard. 2001. A Real-time Head Nod and Shake Detector. In Proceedings of the 2001 Workshop on Perceptive User Interfaces (PUI '01). ACM, New York, NY, USA, 1-5.
- Adam Kendon. 2004. Gesture. Cambridge University Press.
- Michael Kipp. 2004. Gesture Generation by Imitation - From Human Behavior to Computer Character Animation. Boca Raton, Florida: Dissertation.com.
- John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).
- Evelyn McClave. 2000. Linguistic functions of head movements in the context of speech. Journal of Pragmatics 32 (2000), 855-878.
- Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. 2007. Latent-dynamic discriminative models for continuous gesture recognition. In 2007 IEEE conference on computer vision and pattern recognition. IEEE, 1-8.
- L.-P. Morency, C. Sidner, C. Lee, and T. Darrell. 2005. Contextual recognition of head gestures. In Proc. Int. Conf. on Multimodal Interfaces (ICMI).
- Patrizia Paggio, Jens Allwood, Elisabeth Ahlsén, Kristiina Jokinen, and Costanza Navarretta. 2010. The NOMCO Multimodal Nordic Resource -Goals and Characteristics. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) (19-21). European Language Resources Association (ELRA), Valletta, Malta.
- P. Paggio and C. Navarretta. 2011. Head Movements, Facial Expressions and Feedback in Danish First Encounters Interactions: A Culture-Specific Analysis. In Universal Access in Human-Computer Interaction - Users Diversity. 6th International Conference. UAHCI 2011, Held as Part of HCI International 2011 (LNCS), Constantine Stephanidis (Ed.). Springer Verlag, Orlando Florida, 583-690.
- Patrizia Paggio and Costanza Navarretta. 2016. The Danish NOMCO corpus: multimodal interaction in first acquaintance conversations. Language Resources and Evaluation (2016), 1-32. DOI: http://dx.doi.org/10.1007/s10579-016-9371-6
- W. Tan and G. Rong. 2003. A real-time head nod and shake detector using HMMs. Expert Systems with Applications 25, 3 (2003), 461-466.
- Nina Thorsen. 1980. Neutral stress, emphatic stress, and sentence Intonation in Advanced Standard Copenhagen Danish. Technical Report 14. University of Copenhagen. 121-205 pages. https://danpass. hum.ku.dk/ng/papers/aripuc14_1980_121-205.pdf
- Haolin Wei, Patricia Scanlon, Yingbo Li, David S Monaghan, and Noel E O'Connor. 2013. Real-time head nod and shake detection for continuous human affect recognition. In 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). IEEE, 1-4.
- Victor Yngve. 1970. On getting a word in edgewise. In Papers from the sixth regional meeting of the Chicago Linguistic Society. 567-578.
- Z. Zhao, Y. Wang, and S. Fu. 2012. Head Movement Recognition Based on the Lucas-Kanade Algorithm. In Computer Science Service System (CSSS), 2012 International Conference on. 2303-2306.