Papers by Kazunori Komatani

Lecture Notes in Computer Science, 2019
Tensor factorization has become an increasingly popular approach to knowledge graph completion (K... more Tensor factorization has become an increasingly popular approach to knowledge graph completion (KGC), which is the task of automatically predicting missing facts in a knowledge graph. However, even with a simple model like CANDECOMP/PARAFAC (CP) tensor decomposition, KGC on existing knowledge graphs is impractical in resourcelimited environments, as a large amount of memory is required to store parameters represented as 32-bit or 64-bit floating point numbers. This limitation is expected to become more stringent as existing knowledge graphs, which are already huge, keep steadily growing in scale. To reduce the memory requirement, we present a method for binarizing the parameters of the CP tensor decomposition by introducing a quantization function to the optimization problem. This method replaces floating point-valued parameters with binary ones after training, which drastically reduces the model size at run time. We investigate the trade-off between the quality and size of tensor factorization models for several KGC benchmark datasets. In our experiments, the proposed method successfully reduced the model size by more than an order of magnitude while maintaining the task performance. Moreover, a fast score computation technique can be developed with bitwise operations.
This paper presents voice-awareness control related to robot's head movements. Our control is bas... more This paper presents voice-awareness control related to robot's head movements. Our control is based on a new model of spectral envelope modification for the vertical head motions, and left-right balance modulation for the horizontal head motions. The spectral envelope modification model is based on the analysis of human vocalizations. The left-right balance model is established by measuring impulse responses using a pair of microphones. Experimental results show that the voice-awareness is perceivable in a robotto-robot dialogue when the robots stand 50 cm away. We also confirmed observable voice-awareness declines as the distance becomes large up to 150 cm.

This paper presents a quantitative modeling of referential coherence by which conversation system... more This paper presents a quantitative modeling of referential coherence by which conversation systems measure the smoothness of discourse. Investigations of the corpora show that referential coherence depends on languages or genres of discourse. Our goal is to establish a quantitative model that can be statistically adapted to various languages. Centering theory explains referential coherence by using heuristic rules. Since these heuristics should be invented manually for a particular language, we need a quantitative/statistical model that can be obtained from a corpus. The meaning-game-based centering model (MGCM) (Shiramatsu et al., 2005) quantitatively reformulates centering theory by exploiting quantitative aspects of game theory. It quantifies referential coherence by using the two parameters related to salience and pronominalization: "reference probability" and "perceptual utility". However, MGCM still has two problems. The first is that perceptual utility cannot be statistically adapted to various languages. The second is that MGCM has only been verified in Japanese. We have enhanced the model by statistically defining perceptual utility. Specifically, we defined it by using occurrence frequency of the referential expression in a corpus. Experimental results using English and Japanese corpora showed that our statistical definitions enabled the parameters to be adapted to both corpora. Furthermore, the statistical tests of the enhanced MGCM showed its validity in both corpora. Table 1: Transition types Cb(Ui+1) = Cb(Ui) Cb(Ui+1) =Cb(Ui) Cb(U i+1) =Cp(U i+1) CONTINUE SMOOTH-SHIFT Cb(Ui+1) =Cp(Ui+1) RETAIN ROUGH-SHIFT Here, Cp(U i), the preferred center of utterance U i , is the highest ranked element of Cf(Ui). 2. Problems with Conventional Studies This section describes the issues of conventional CT and MGCM. 2.1. Centering Theory CT handles a discourse as a sequence of utterance units [U 1 , U 2 , • • • , U n ]. CT explains the referential expression between U i+1 and U i by using heuristics: a salience ranking of grammatical roles (Cf-ranking) (Walker et al., 1994), three constraints about "center", and two rules about referential coherence (Poesio et al., 2004). English Cf-ranking: subject > object > indirect object > complement > adjunct Japanese Cf-ranking: topic (zero or grammatical) > subject > indirect object > object > others Cb(U i): The backword-looking center of U i. Cf(U i): The forward-looking centers of U i. Constraint 1: All utterances of a segment except for the 1st have at most one Cb. Constraint 2: Every element of Cf(U i) must be realized in U i. Constraint 3: Cb(U i) is the highest ranked element of Cf(U i−1) that is realized in U i. Rule 1 (Pronominalization rule): If any Cf(U i) is pronominalized, Cb(U i) is also pronominalized. Rule 2 (Transition rule): Transition types are ordered: CONTINUE > RETAIN > SMOOTH-SHIFT > ROUGH-SHIFT (See also Table 1).

Proceedings of the AAAI Conference on Artificial Intelligence
Our goal is to develop an interactive music robot, i.e., a robot that presents a musical expressi... more Our goal is to develop an interactive music robot, i.e., a robot that presents a musical expression together with humans. A music interaction requires two important functions: synchronization with the music and musical expression, such as singing and dancing. Many instrument-performing robots are only capable of the latter function, they may have difficulty in playing live with human performers. The synchronization function is critical for the interaction. We classify synchronization and musical expression into two levels: (1) the rhythm level and (2) the melody level. Two issues in achieving two-layer synchronization and musical expression are: (1) simultaneous estimation of the rhythm structure and the current part of the music and (2) derivation of the estimation confidence to switch behavior between the rhythm level and the melody level. This paper presents a score following algorithm, incremental audio to score alignment, that conforms to the two-level synchronization design us...

This paper describes a method that recognizes musical chords from real-world audio signals in com... more This paper describes a method that recognizes musical chords from real-world audio signals in compact-disc recordings. The automatic recognition of musical chords is necessary for music information retrieval (MIR) systems, since the chord sequences of musical pieces capture the characteristics of their accompaniments. None of the previous methods can accurately recognize musical chords from complex audio signals that contain vocal and drum sounds. The main problem is that the chordboundary-detection and chord-symbol-identification processes are inseparable because of their mutual dependency. In order to solve this mutual dependency problem, our method generates hypotheses about tuples of chord symbols and chord boundaries, and outputs the most plausible one as the recognition result. The certainty of a hypothesis is evaluated based on three cues: acoustic features, chord progression patterns, and bass sounds. Experimental results show that our method successfully recognized chords in seven popular music songs; the average accuracy of the results was around 77%.

This paper describes a method for automatic singer identification from polyphonic musical audio s... more This paper describes a method for automatic singer identification from polyphonic musical audio signals including sounds of various instruments. Because singing voices play an important role in musical pieces with a vocal part, the identification of singer names is useful for music information retrieval systems. The main problem in automatically identifying singers is the negative influences caused by accompaniment sounds. To solve this problem, we developed two methods, accompaniment sound reduction and reliable frame selection. The former method makes it possible to identify the singer of a singing voice after reducing accompaniment sounds. It first extracts harmonic components of the predominant melody from sound mixtures and then resynthesizes the melody by using a sinusoidal model driven by those components. The latter method then judges whether each frame of the obtained melody is reliable (i.e. little influenced by accompaniment sound) or not by using two Gaussian mixture models for vocal and non-vocal frames. It enables the singer identification using only reliable vocal portions of musical pieces. Experimental results with forty popular-music songs by ten singers showed that our method was able to reduce the influences of accompaniment sounds and achieved an accuracy of 95%, while the accuracy for a conventional method was 53%.

Reliable predictability, which is tightly connected to consistency of environmental changes, is o... more Reliable predictability, which is tightly connected to consistency of environmental changes, is one of the main factors that determine human behaviors. As a constructive approach to understanding this mechanism, the authors have developed a method to generate autonomous object pushing motions based on consistency of object motions using a humanoid robot. The method consists of constructing a dynamics prediction model using Recurrent Neural Network with Parametric Bias (RNNPB), and motion searching based on an object consistency evaluation function using Steepest Descent Method. First, RNNPB is trained using the observed object dynamics and robot motion sequences, acquired during active sensing with objects. Next, the Steepest Descent Method is applied for searching the reliably predictable motion through the constructed dynamics model. Finally, the obtained motion is linked to the initial object image using a hierarchical neural network. The model inputs the object image outputting ...
Analyzing user utterances in barge-in-able spoken dialogue system for improving identification accuracy
Interspeech 2010, 2010

Improving speech recognition of two simultaneous speech signals by integrating ICA BSS and automatic missing feature mask generation
Interspeech 2006, 2006
Robot audition systems require capabilities for sound source sep- aration and the recognition of ... more Robot audition systems require capabilities for sound source sep- aration and the recognition of separated sounds, since we hear a mixture of sounds in our daily lives, especially mixed of speech. We report a robot audition system with a pair of omni-directional microphones embedded in a humanoid that recognizes two simul- taneous talkers. It first separates the sound sources by Independent Component Analysis (ICA) with the single-input multiple-output (SIMO) model. Then, spectral distortion in the separated sounds is then estimated to generate missing feature masks. Finally, the separated sounds are recognized by missing-feature theory (MFT) for Automatic Speech Recognition (ASR). The novel aspects of our system involve estimates of spectral distortion in the temporal- frequency domain in terms of feature vectors and based on esti- mates error in SIMO-ICA signals. The resulting system outper- formed the baseline robot audition system by 7 %. Index Terms: Robot audition, ICA, missing-feature mask.
Analyzing temporal transition of real user's behaviors in a spoken dialogue system
Interspeech 2007, 2007

Interspeech 2008, 2008
This paper addresses automatic soft missing-feature mask (MFM) generation based on a leak energy ... more This paper addresses automatic soft missing-feature mask (MFM) generation based on a leak energy estimation for a simultaneous speech recognition system. An MFM is used as a weight for probability calculation in a recognition process. In a previous work, a threshold-base-zero-or-one function was applied to decide if spectral parameter can be reliable or not for each frequency bin. The function is extended into a weighted sigmoid function which has two free parameters. In addition, a contribution ratio of static features is introduced for the probability calculation in a recognition process which static and dynamic features are input. The ratio can be implemented as a part of soft mask. The average recognition rate based on a soft MFM improved by about 5% for all directions from a conventional system based on a hard MFM. Word recognition rates improved from 70 to 80% for peripheral talkers and from 93 to 97% for front speech when speakers were 90 degrees apart.
An Error Correction Framework Based on Drum Pattern Periodicity for Improving Drum Sound Detection
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings

Our aim is to acquire the attributes of concepts denoted by unknown words from users during dialo... more Our aim is to acquire the attributes of concepts denoted by unknown words from users during dialogues. A word unknown to spoken dialogue systems can appear in user utterances, and systems should be capable of acquiring information on it from the conversation partner as a kind of selflearning process. As a first step, we propose a method for generating more specific questions than simple wh-questions to acquire the attributes, as such questions can narrow down the variation of the following user response and accordingly avoid possible speech recognition errors. Specifically, we obtain an appropriately distributed confidence measure (CM) on the attributes to generate more specific questions. Two basic CMs are defined using (1) character and word distributions in the target database and (2) frequency of occurrence of restaurant attributes on Web pages. These are integrated to complement each other and used as the final CM. We evaluated distributions of the CMs by average errors from the reference. Results showed that the integrated CM outperformed the two basic CMs.
Changing timbre and phrase in existing musical performances as you like
Proceedings of the 17th ACM international conference on Multimedia, 2009

Recognition of Simultaneous Speech by Estimating Reliability of Separated Signals for Robot Audition
Lecture Notes in Computer Science, 2006
“Listening to several things at once” is a people’s dream and one goal of AI and robot audition, ... more “Listening to several things at once” is a people’s dream and one goal of AI and robot audition, because people can listen to at most two things at once according to psychophysical observations. Current noise reduction techniques cannot help to achieve this goal because they assume quasi-stationary noises, not interfering speech signals. Since robots are used in various environments, robot audition systems require minimum a priori information about their acoustic environments and speakers. We evaluate a missing feature theory approach that interfaces between sound source separation (SSS) and automatic speech recognition. The essential part is the estimate of reliability of each feature of separated sounds. We tested two kinds of robot audition systems that use SSS: independent component analysis (ICA) with two microphones, and geometric source separation (GSS) with eight microphones. For each SSS, automatic missing feature mask generation is developed. The recognition accuracy of two simultaneous speech improved to an average of 67.8 and 88.0% for ICA and GSS, respectively.
Transactions of the Japanese Society for Artificial Intelligence, 2005
Research on human-robot interaction is getting an increasing amount of attention. Since most rese... more Research on human-robot interaction is getting an increasing amount of attention. Since most research has dealt with communication between one robot and one person, quite few researchers have studied communication between a robot and multiple people. This paper presents a method that enables robots to communicate with multiple people using the "selection priority of the interactive partner" based on the concept of Proxemics. In this method, a robot changes active sensory-motor modalities based on the interaction distance between itself and a person. Our method was implemented into a humanoid robot, SIG2. SIG2 has various sensory-motor modalities to interact with humans. A demonstration of SIG2 showed that our method selected an appropriate interaction partner during interaction with multiple people.

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
This paper describes a sound source separation method for polyphonic sound mixtures of music to b... more This paper describes a sound source separation method for polyphonic sound mixtures of music to build an instrument equalizer for remixing multiple tracks separated from compact-disc recordings by changing the volume level of each track. Although such mixtures usually include both harmonic and inharmonic sounds, the dif culties in dealing with both types of sounds together have not been addressed in most previous methods that have focused on either of the two types separately. We therefore developed an integrated weighted-mixture model consisting of both harmonic-structure and inharmonic-structure tone models (generative models for the power spectrogram). On the basis of the MAP estimation using the EM algorithm, we estimated all model parameters of this integrated model under several original constraints for preventing over-training and maintaining intra-instrument consistency. Using standard MIDI les as prior information of the model parameters, we applied this model to compact-disc recordings and achieved the instrument equalizer.

Bus Information System Based on User Models and Dynamic Generation of VoiceXML Scripts
Lecture Notes in Computer Science
We have developed a telephone-based cooperative natural language dialogue system. Since natural l... more We have developed a telephone-based cooperative natural language dialogue system. Since natural language involves very various expressions, a large number of VoiceXML scripts need to be prepared to handle all possible input patterns. Thus, flexible dialogue management for various user utterances is realized by generating VoiceXML scripts dynamically. Moreover, we address the issue of appropriate user modeling to generate cooperative responses to users. Specifically, three dimensions of user models are set up: the skill level to the system, the knowledge level on the target domain and the degree of hastiness. The models are automatically derived by decision tree learning using real dialogue data collected by the system. Experimental evaluation showed that the cooperative responses adapted to individual users served as good guides for novices without increasing the duration of dialogue for skilled users.
International Conference on Acoustics, Speech, and Signal Processing, 2006
This paper describes a new technique for recognizing musical instruments in polyphonic music. Bec... more This paper describes a new technique for recognizing musical instruments in polyphonic music. Because the conventional framework for musical instrument recognition in polyphonic music had to estimate the onset time and fundamental frequency (F0) of each note, instrument recognition strictly suffered from errors of onset detection and F0 estimation. Unlike such a note-based processing framework, our technique calculates the temporal
Real-Time Auditory and Visual Talker Tracking Through Integrating EM Algorithm and Particle Filter
Lecture Notes in Computer Science
ABSTRACT
Uploads
Papers by Kazunori Komatani