For the first time in this paper we present results showing the effect of out of plane speaker he... more For the first time in this paper we present results showing the effect of out of plane speaker head pose variation on a lip biometric based speaker verification system. Using appearance DCT based features, we adopt a Mutual Information analysis technique to highlight the class discriminant DCT components most robust to changes in out of plane pose. Experiments are conducted using the initial phase of a new multi view AudioVisual database designed for research and development of poseinvariant speech and speaker recognition. We show that verification performance can be improved by substituting higher order horizontal DCT components for vertical, particularly in the case of a train/test pose angle mismatch. We show that the best performance can be achieved by combining this alternative feature selection with multi view training, reporting a relative 45% Equal Error Rate reduction over a common energy based selection.
This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient ... more This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audiovisual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audiovisual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-byframe basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.
In this paper, we present a new approach to visual speech recognition which improves contextual m... more In this paper, we present a new approach to visual speech recognition which improves contextual modelling by combining Inter-Frame Dependent and Hidden Markov Models. This approach captures contextual information in visual speech that may be lost using a Hidden Markov Model alone. We apply contextual modelling to a large speaker independent isolated digit recognition task, and compare our approach to two commonly adopted feature based techniques for incorporating speech dynamics. Results are presented from baseline feature based systems and the combined modelling technique. We illustrate that both of these techniques achieve similar levels of performance when used independently. However significant improvements in performance can be achieved through a combination of the two. In particular we report an improvement in excess of 17% relative Word Error Rate in comparison to our best baseline system.
Hidden Conditional Random Fields for Visual Speech Recognition
Abstract In this paper we present the application of hidden conditional random fields (HCRFs) to ... more Abstract In this paper we present the application of hidden conditional random fields (HCRFs) to modeling speech for visual speech recognition. HCRFs may be easily adapted to model long range dependencies across an observation sequence. As a result visual ...
For the first time in this paper we present results showing the effect of speaker head pose angle... more For the first time in this paper we present results showing the effect of speaker head pose angle on automatic lipreading performance over a wide range of closely spaced angles. We analyse the effect head pose has upon the features themselves and show that by selecting coefficients with minimum variance w.r.t. pose angle, recognition performance can be improved when train-test pose angles differ. Experiments are conducted using the initial phase of a unique multi view AudioVisual database designed specifically for research and development of pose-invariant lipreading systems. We firstly show that it is the higher order horizontal spatial frequency components that become most detrimental as the pose deviates. Secondly we assess the performance of different feature selection masks across a range of pose angles including a new mask based on Minimum Cross-Pose Variance coefficients. We report a relative improvement of 50% in Word Error Rate when using our selection mask over a common energy based selection during profile view lipreading .
Automatic gender classification has many security and commercial applications. Various modalities... more Automatic gender classification has many security and commercial applications. Various modalities have been investigated for gender classification with face-based classification being the most popular. In some real-world scenarios the face may be partially occluded. In these circumstances a classification based on individual parts of the face known as local features must be adopted. We investigate gender classification using lip movements. We show for the first time that important gender specific information can be obtained from the way in which a person moves their lips during speech. Furthermore our study indicates that the lip dynamics during speech provide greater gender discriminative information than simply lip appearance. We also show that the lip dynamics and appearance contain complementary gender information such that a model which captures both traits gives the highest overall classification result. We use Discrete Cosine Transform based features and Gaussian Mixture Modelling to model lip appearance and dynamics and employ the XM2VTS database for our experiments. Our experiments show that a model which captures lip dynamics along with appearance can improve gender classification rates by between 16-21% compared to models of only lip appearance.
Uploads
Papers by adrian pass