As we move closer to real-world social AI systems, AI agents must be able to deal with multiparty... more As we move closer to real-world social AI systems, AI agents must be able to deal with multiparty (group) conversations. Recognizing and interpreting multiparty behaviors is challenging, as the system must recognize individual behavioral cues, deal with the complexity of multiple streams of data from multiple people, and recognize the subtle contingent social exchanges that take place amongst group members. To tackle this challenge, we propose the Multiparty-Transformer (MultiPar-T), a transformer model for multiparty behavior modeling. The core component of our proposed approach is Crossperson Attention, which is specifically designed to detect contingent behavior between pairs of people. We verify the effectiveness of MultiPar-T on a publicly available video-based group engagement detection benchmark, where it outperforms state-of-the-art approaches in average F-1 scores by 5.2% and individual class F-1 scores by up to 10.0%. Through qualitative analysis, we show that our Crossperson Attention module is able to discover contingent behaviors.
There has been an emerging use of touchscreen-based smart devices, such as the iPad, for assistin... more There has been an emerging use of touchscreen-based smart devices, such as the iPad, for assisting in education and communication interventions for children with Autism Spectrum Disorder (ASD). There has also been growing evidence of the utilization of robots to foster social interaction in children with ASD. Unfortunately, although interventions using the tablet have been successfully implemented in the home environment, the robotic platforms have not. One of the reasons is due to the fact that these robotic platforms are typically not autonomous, i.e. they are typically controlled directly by the clinician or through pre-scripted behavior. This makes it difficult for immersion of such platforms in an environment outside of the clinical setting. As such, to capitalize on the widespread ease-of-use of tablet devices and the emerging success found in the field of social robotics, we present efforts that focus on designing an autonomous interactive robot that socially interacts with a child using the tablet as a shared medium. The purpose is to foster social interaction through play that is directed by the child, thus moving toward behavior that can be translated outside of the clinical setting.
The storytelling lens in human-computer interaction has primarily focused on personas, design fic... more The storytelling lens in human-computer interaction has primarily focused on personas, design fiction, and other stories crafted by designers, yet informal personal narratives from everyday people have not been considered meaningful data, such as storytelling from older adults. Storytelling may provide a clear path to conceptualize how technologies such as social robots can support the lives of older or disabled individuals. To explore this, we engaged 28 older adults in a year-long co-design process, examining informal stories told by older adults as a means of generating and expressing technology ideas and needs. This paper presents an analysis of participants' stories around their prior experience with technology, stories shaped by social context, and speculative scenarios for the future of social robots. From this analysis, we present suggestions for social robot design, considerations of older adults' values around technology design, and promotion of participant stories as sources for design knowledge and shifting perspectives of older adults and technology.
Pedagogical agent research has yielded fruitful results in both academic skill learning and meta-... more Pedagogical agent research has yielded fruitful results in both academic skill learning and meta-cognitive skill acquisition, often studied in instructional or peer-to-peer paradigms. In the past decades, child-centric pedagogical research, which emphasizes the learner's active participation in learning with self-motivation, curiosity, and exploration, has attracted scholarly attention. Studies show that combining child-driven pedagogy with appropriate adult guidance leads to efcient learning and a strengthened feeling of self-efcacy. However, research on using social robots for guidance in childdriven learning still remains open and under-explored. In our study, we focus on children's exploration as the vehicle in literacy learning and develop a social robot companion that provides guidance to encourage and motivate children to explore during a storybook reading interaction. To investigate the efect of the robot's explorative guidance, we compare it against a control condition in which children have full autonomy to explore and read the storybooks. We conduct a between-subjects study with 31 children aged 4 to 6, and the result shows that children who receive explorative guidance from the social robot exhibit a growing trend of self-exploration. Further, children's self-exploration in the explorative guidance condition is found correlated to their learning outcome. We conclude the study with recommendations for designing social agents to guide children's exploration and future research directions in childcentric AI-assisted pedagogy. • Human-centered computing → Empirical studies in HCI; Empirical studies in interaction design.
Affect understanding capability is essential for social robots to autonomously interact with a gr... more Affect understanding capability is essential for social robots to autonomously interact with a group of users in an intuitive and reciprocal way. However, the challenge of multi-person affect understanding comes from not only the accurate perception of each user's affective state (e.g., engagement) but also the recognition of the affect interplay between the members (e.g., joint engagement) that presents as complex, but subtle, nonverbal exchanges between them. Here we present a novel hybrid framework for identifying a parent-child dyad's joint engagement by combining a deep learning framework with various video augmentation techniques. Using a dataset of parent-child dyads reading storybooks together with a social robot at home, we first train RGB frame-and skeleton-based joint engagement recognition models with four video augmentation techniques (General Aug, DeepFake, CutOut, and Mixed) applied datasets to improve joint engagement classification performance. Second, we demonstrate experimental results on the use of trained models in the robot-parent-child interaction context. Third, we introduce a behavior-based metric for evaluating the learned representation of the models to investigate the model interpretability when recognizing joint engagement. This work serves as the first step toward fully unlocking the potential of end-to-end video understanding models pre-trained on large public datasets and augmented with data augmentation and visualization techniques for affect recognition in the multi-person human-robot interaction in the wild.
2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Dec 15, 2021
Parent-child nonverbal communication plays a crucial role in understanding their relationships an... more Parent-child nonverbal communication plays a crucial role in understanding their relationships and assessing their interaction styles. However, prior works have seldom studied the exchange of these nonverbal cues between the dyad and focused on isolated cues from one person at a time. In contrast, this work analyzes both parents' and children's individual and dyadic nonverbal behaviors in relation to their four relationship characteristics, i.e., child temperament, parenting style, parenting stress, and home literacy environment. We utilize a state-of-the-art feature selection framework on a dataset of 31 parent-child interactions to automatically extract and select a set of temporal nonverbal behaviors as key indicators of the dyad's relationship characteristics. The results show that relationship characteristics were associated with both individuals' and dyads' nonverbal behaviors. This finding highlights the importance of accounting for both individual-and dyad-scale nonverbal behaviors when predicting dyadic relationship characteristics as well as the potential limitations of utilizing single persons' nonverbal data in isolation. It therefore motivates future work on this topic to take a holistic and relational approach. The dataset and extracted nonverbal data are made public to aid the development of automated detection tools for parent-child relationship characteristics that trains on visual recordings of their dyadic interactions.
In this paper, we discuss a methodology to extract play primitives, defined as a sequence of low-... more In this paper, we discuss a methodology to extract play primitives, defined as a sequence of low-level motion behaviors identified during a playing action, such as stacking or inserting a toy. Our premise is that if a robot could interpret the basic movements of a human's play, it will be able to interact with many different kinds of toys, in conjunction with its human playmate. As such, we present a method that combines motion behavior analysis and behavior sequencing, which capitalizes on the inherent characteristics found in the dynamics of play such as the limited domain of the objects and manipulation skills required. In this paper, we give details on the approach and present results from applying the methodology to a number of play scenarios.
Users can provide valuable insights for designing new technologies like social robots, with the r... more Users can provide valuable insights for designing new technologies like social robots, with the right tools and methodologies. Challenges in inviting users as co-designers of social robots is due to lack of guidelines or methodologies to (1) organize co-design processes and/or (2) engage with people long-term to develop technologies together. The main contribution of this work is to provide guidelines for longterm co-design for how other researchers can adopt longterm co-design, informed by a 12-month co-design with older adults designing a social social robot. We leveraged humancentered, tactile and experiential design activities, including participatory design, based upon the following design principles: scenario specific exploration, long-term lived experiences, supporting multiple design activities, cultivating relationships, and employing divergent and convergent processes. We present seven different sessions across three stages as examples of this methodology that build on each other to engage users as codesigners, successfully deployed in a co-design project of home social robots with 28 older adults. Lastly, we detail 10 longterm divergent-convergent co-design guidelines for designing social robots. We demonstrate the value of leveraging people's lived technology experiences and co-design activities to generate actionable social robot design guidelines, advocating for more applications of the methodology in broader contexts as well.
As we move closer to real-world AI systems, AI agents must be able to deal with multiparty (group... more As we move closer to real-world AI systems, AI agents must be able to deal with multiparty (group) conversations. Recognizing and interpreting multiparty behaviors is challenging, as the system must recognize individual behavioral cues, deal with the complexity of multiple streams of data from multiple people, and recognize the subtle contingent social exchanges that take place amongst group members. To tackle this challenge, we propose the Multiparty-Transformer (Multipar-T), a transformer model for multiparty behavior modeling. The core component of our proposed approach is the Crossperson Attention, which is specifically designed to detect contingent behavior between pairs of people. We verify the effectiveness of Multipar-T on a publicly available video-based group engagement detection benchmark, where it outperforms state-of-the-art approaches in average F-1 scores by 5.2% and individual class F-1 scores by up to 10.0%. Through qualitative analysis, we show that our Crossperson Attention module is able to discover contingent behavior.
Human-robot interaction can be regarded as a flow between users and robots. Designing good intera... more Human-robot interaction can be regarded as a flow between users and robots. Designing good interaction flows takes a lot of effort and needs to be field tested. Unfortunately, the interaction flow design process is often very disjointed, with users experiencing prototypes, designers forming those prototypes, and developers implementing them as independent processes. In this paper, we present the Interaction Flow Editor (IFE), a new human-robot interaction prototyping tool that enables everyday users to create and modify their own interactions, while still providing a full suite of features that is powerful enough for developers and designers to create complex interactions. We also discuss the Flow Engine, a flexible and adaptable framework for executing robot interaction flows authors through the IFE. Finally, we present our case study results that demonstrates how older adults, aged 70 and above, can design and iterate interactions in real-time on a robot using the IFE.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Intelligent tutoring systems (ITS) provide educational benefits through one-on-one tutoring by as... more Intelligent tutoring systems (ITS) provide educational benefits through one-on-one tutoring by assessing children's existing knowledge and providing tailored educational content. In the domain of language acquisition, several studies have shown that children often learn new words by forming semantic relationships with words they already know. In this paper, we present a model that uses word semantics (semantics-based model) to make inferences about a child's vocabulary from partial information about their existing vocabulary knowledge. We show that the proposed semantics-based model outperforms models that do not use word semantics (semantics-free models) on average. A subject-level analysis of results reveals that different models perform well for different children, thus motivating the need to combine predictions. To this end, we use two methods to combine predictions from semantics-based and semantics-free models and show that these methods yield better predictions of a c...
2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), 2021
Self-disclosure is an important part of mental health treatment process. As interactive technolog... more Self-disclosure is an important part of mental health treatment process. As interactive technologies are becoming more widely available, many AI agents for mental health prompt their users to self-disclose as part of the intervention activities. However, most existing works focus on linguistic features to classify selfdisclosure behavior, and do not utilize other multi-modal behavioral cues. We present analyses of people's non-verbal cues (vocal acoustic features, head orientation and body gestures/movements) exhibited during self-disclosure tasks based on the human-robot interaction data collected in our previous work. Results from the classification experiments suggest that prosody, head pose, and body postures can be independently used to detect self-disclosure behavior with high accuracy (up to 81%). Moreover, positive emotions, high engagement, self-soothing and positive attitudes behavioral cues were found to be positively correlated to self-disclosure. Insights from our work can help build a self-disclosure detection model that can be used in real time during multi-modal interactions between humans and AI agents. • This paper explores the broadest array of high-level temporal behavioral features to date associated with self-disclosure from speech prosody, head pose, and body gestures. • This paper is also the first to undertake a detailed and comprehensive approach to provide interpretation and insights into these behaviors to support modeling transparency. • To the best of our knowledge, we are the first to evaluate this approach in a human-robot interaction context.
2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2020
Conversational AI agents are proliferating, embodying a range of devices such as smart speakers, ... more Conversational AI agents are proliferating, embodying a range of devices such as smart speakers, smart displays, robots, cars, and more. We can envision a future where a personal conversational agent could migrate across different form factors and environments to always accompany and assist its user to support a far more continuous, personalized and collaborative experience. This opens the question of what properties of a conversational AI agent migrates across forms, and how it would impact user perception. To explore this, we developed a Migratable AI system where a user's information and/or the agent's identity can be preserved as it migrates across form factors to help its user with a task. We validated the system by designing a 2x2 between-subjects study to explore the effects of information migration and identity migration on user perceptions of trust, competence, likeability and social presence. Our results suggest that identity migration had a positive effect on trust, competence and social presence, while information migration had a positive effect on trust, competence and likeability. Overall, users report highest trust, competence, likeability and social presence towards the conversational agent when both identity and information were migrated across embodiments.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021
Recent state-of-the-art approaches in opendomain dialogue include training end-to-end deep-learni... more Recent state-of-the-art approaches in opendomain dialogue include training end-to-end deep-learning models to learn various conversational features like emotional content of response, symbolic transitions of dialogue contexts in a knowledge graph and persona of the agent and the user, among others. While neural models have shown reasonable results, modelling the cognitive processes that humans use when conversing with each other may improve the agent's quality of responses. A key element of natural conversation is to tailor one's response such that it accounts for concepts that the speaker and listener may or may not know and the contextual relevance of all prior concepts used in conversation. We show that a rich representation and explicit modeling of these psychological processes can improve predictions made by existing neural network models. In this work, we propose a novel probabilistic approach using Markov Random Fields (MRF) to augment existing deep-learning methods for improved next utterance prediction. Using human and automatic evaluations, we show that our augmentation approach significantly improves the performance of existing state-ofthe-art retrieval models for open-domain conversational agents.
Proceedings of the 2020 International Conference on Multimodal Interaction, 2020
Automatic speech-based affect recognition of individuals in dyadic conversation is a challenging ... more Automatic speech-based affect recognition of individuals in dyadic conversation is a challenging task, in part because of its heavy reliance on manual pre-processing. Traditional approaches frequently require hand-crafted speech features and segmentation of speaker turns. In this work, we design end-to-end deep learning methods to recognize each person's affective expression in an audio stream with two speakers, automatically discovering features and time regions relevant to the target speaker's affect. We integrate a local attention mechanism into the end-to-end architecture and compare the performance of three attention implementations -one mean pooling and two weighted pooling methods. Our results show that the proposed weighted-pooling attention solutions are able to learn to focus on the regions containing target speaker's affective information and successfully extract the individual's valence and arousal intensity. Here we introduce and use a "dyadic affect in multimodal interaction -parent to child" (DAMI-P2C) dataset collected in a study of 34 families, where a parent and a child (3-7 years old) engage in reading storybooks together. In contrast to existing public datasets for affect recognition, each instance for both speakers in the DAMI-P2C dataset is annotated for the perceived affect by three labelers. To encourage more research on the challenging task of multi-speaker affect sensing, we make the annotated DAMI-P2C dataset publicly available 1 , including acoustic features of the dyads' raw audios, affect annotations, and a diverse set of developmental, social, and demographic profiles of each dyad.
A preliminary implementation of a robotic interface for the administration of language, literacy,... more A preliminary implementation of a robotic interface for the administration of language, literacy, and speech pathology assessments for children is presented. This robot assessment protocol will be used for several ongoing studies to improve the performance of educational robots for children. The robot used is JIBO, a personal assistant-style robot capable of expressing itself with its poseable body. JIBO’s implementation is intended for children as young as 4 years old. JIBO is designed to have friendly interactions with young children while administering assessments such as the evaluation of pronunciation, alphabetic knowledge, and explanatory discourse. Additionally, this implementation is currently being used to collect a speech database of such assessments being administered to children.
Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 2021
Intent recognition models, which match a written or spoken input's class in order to guide an int... more Intent recognition models, which match a written or spoken input's class in order to guide an interaction, are an essential part of modern voice user interfaces, chatbots, and social robots. However, getting enough data to train these models can be very expensive and challenging, especially when designing novel applications such as real-world human-robot interactions. In this work, we rst investigate how much training data is needed for high performance in an intent classication task. We train and evaluate BiLSTM and BERT models on various subsets of the ATIS and Snips datasets. We nd that only 25 training examples per intent are required for our BERT model to achieve 94% intent accuracy compared to 98% with the entire datasets, challenging the belief that large amounts of labeled data are required for high performance in intent recognition. We apply this knowledge to train models for a real-world HRI application, character strength recognition during a positive psychology interaction with a social robot, and evaluate against the Character Strength dataset collected in our previous HRI study. Our real-world HRI application results also conrm that our model can produce 76% intent accuracy with 25 examples per intent compared to 80% with 100 examples. In a real-world scenario, the dierence is only one additional error per 25 classications. Finally, we investigate the limitations of our minimal data models and oer suggestions on developing high quality datasets. We conclude with practical guidelines for training BERT intent recognition models with minimal training data and make our code and evaluation framework available for others to replicate our results and easily develop models for their own applications. • Computing methodologies ! Natural language processing; • Human-centered computing ! Systems and tools for interaction design.
Proceedings of the 17th ACM Conference on Interaction Design and Children, 2018
Intelligent toys and smart devices are becoming ubiquitous in children's homes. As such, it is im... more Intelligent toys and smart devices are becoming ubiquitous in children's homes. As such, it is imperative to understand how these computational objects impact children's development. Children's attribution of intelligence relates to how they perceive the behavior of these agents . However, their underlying reasoning is not well understood. To explore this, we invited 30 pairs of children (4-10 years old) and their parents to assess the intelligence of mice, robots, and themselves in a maze-solving activity. Participants watched videos of mice and robots solving a maze. Then, they solved the maze by remotely navigating a robot. Solving the maze enabled participants to gain insight into the agent's mind by referencing their own experience. Children and their parents gave similar answers for whether the mouse or the robot was more intelligent and used a wide variety of explanations. We also observed developmental differences in childrens' references to agents' social-emotional attributes, strategies and performance.
Uploads
Papers by Hae Won Park