In agglutinative languages such as Japanese and Uyghur, selection of lexical unit is not obvious ... more In agglutinative languages such as Japanese and Uyghur, selection of lexical unit is not obvious and one of the important issues in designing language model for automatic speech recognition (ASR). In this paper, we propose a discriminative learning method to select word entries which would reduce the word error rate (WER). We define an evaluation function for each word by a set of features and their weights, and the measure for optimization by the difference of WERs by the two units (morpheme and word). Then, the weights of the features are learned by a perceptron algorithm. Finally, word entries with higher evaluation scores are selected. The discriminative method is successfully applied to an Uyghur large-vocabulary continuous speech recognition system, resulting in a significant reduction of WER without a drastic increase of the vocabulary size.
Automatic Speaker Verification (ASV) has its benefits compared to other biometric verification me... more Automatic Speaker Verification (ASV) has its benefits compared to other biometric verification methods, such as face recognition. It is convenient, low cost, and more privacy protected, so it can start being used for various practical applications. However, voice verification systems are vulnerable to unknown spoofing attacks, and need to be upgraded with the pace of forgery techniques. This paper investigates a low-cost attacking scenario in which a playback device is used to impersonate the real speaker. The replay attack only needs a recording and playback device to complete the process, so it can be one of the most widespread spoofing methods. In this paper, we explore and investigate some spectral clues in the high sampling rate recording signals, and utilize this property to effectively detect the replay attack. First, a small scale genuine-replay dataset of high sample rates are constructed using some low-cost mobile terminals; then, the signal features are investigated by comparing their spectra; machine learning models are also applied for evaluation. The experimental results verify that the high frequency spectral clue in the replay signal provides a convenient and reliable way to detect the replay attack.
Natural language processing for less popular languages is difficult, partly due to the high varia... more Natural language processing for less popular languages is difficult, partly due to the high variations in the writing form. On the other hand, many minority languages in the same region share similar properties and can be processed in a similar way. This paper publishes an integrated multilingual language processing tool. Our aim is to provide an open, free and standard toolkit for minority language processing tasks, by a uniform user interface to support multiple languages. The present implementation supports Uyghur, Kazak, Kirghiz, three major minority languages in the Western China, and our focus was put on phonetic and morphological analysis. For the phonetic analysis, we build a multilingual parallel phoneme list, with similar phonemes grouped and character codes standardized. A multilingual syllable analyzer is also developed to detect spelling mistakes, and extract irregular spelling. For the morphological analysis, we build a multilingual morpheme segmentation tool that can extract morphemes by statistical analysis. This toolkit is extendable in terms of both functions and languages.
With the popularity of the mobile internet, people all over the world can easily create and publi... more With the popularity of the mobile internet, people all over the world can easily create and publish diverse media content such as multilingual and multi-dialectal audio and video. Therefore, language or dialect identification (LID) is increasingly important for practical applications such as multilingual and cross lingual processing as the front-end part of the subsequent tasks such as speech recognition and voice identification. This paper proposes a neural network framework based on a multiscale residual network (MSRN) and multi-headed self-attention (MHSA). Experimental results show that this method can effectively improve the accuracy and robustness compared to other methods. This model uses the MSRN to extract the language spectrogram feature and uses MHSA to filter useful features and suppress irrelevant features. Training and test sets are constructed from both the "Common Voice" and "Oriental Language Recognition" (AP17-OLR) datasets. The experimental results show that this model can effectively improve the accuracy and robustness of LID.
Multilayer structure based lexicon optimization for agglutinative languages
Summary form only given. For large vocabulary continuous speech recognition (LVCSR), selection of... more Summary form only given. For large vocabulary continuous speech recognition (LVCSR), selection of appropriate lexical unit is the first important step. When the word unit is selected as the lexicon, word boundary detection problem can be avoided. But selection of lexicon is not clear for the derivative morphological structure (e.g. agglutinative languages), and there is no word boundary for many languages (Chinese, Japanese, etc.). This paper, based on the Uyghur LVCSR system, analyze multi-layered lexicon based automatic speech recognition (ASR) systems, compare the ASR results of various linguistic layers, propose a new method which can balance the advantages of two layers of lexicons. By aligning and comparing the ASR results of two layers, we analyze error patterns, extract samples as the training data for the alternative selection method. Experimental results show that the proposed method effectively improved the ASR accuracy while maintaining small lexicon size.
Morphologically derivative languages form words by fusing stems and suffixes, stems are important... more Morphologically derivative languages form words by fusing stems and suffixes, stems are important to be extracted in order to make cross lingual alignment and knowledge transfer. As there are phonetic harmony and disharmony when linguistic particles are combined, both phonetic and morphological changes need to be analyzed. This paper proposes a multilingual stemming method that learns morpho-phonetic changes automatically based on character based embedding and sequential modeling. Firstly, the character feature embedding at the sentence level is used as input, and the BiLSTM model is used to obtain the forward and reverse context sequence, and the attention mechanism is added to this model for weight learning, and the global feature information is extracted to capture the stem and affix boundaries; finally CRF model is used to learn more information from sequence features to describe context information more effectively. In order to verify the effectiveness of the above model, the model in this paper is compared with the traditional model on two different data sets of three derivative languages: Uyghur, Kazakh and Kirghiz. The experimental results show that the model in this paper has the best stemming effect on multilingual sentence-level datasets, which leads to more effective stemming. In addition, the proposed model outperforms other traditional models, and fully consider the data characteristics, and has certain advantages with less human intervention.
In this paper, based on the multilingual morphological analyzer, we researched the similar low-re... more In this paper, based on the multilingual morphological analyzer, we researched the similar low-resource languages, Uyghur and Kazakh, short text classification. Generally, the online linguistic resources of these languages are noisy. So a preprocessing is necessary and can significantly improve the accuracy. Uyghur and Kazakh are the languages with derivational morphology, in which words are coined by stems concatenated with suffixes. Usually, terms are used as the representation of text content while excluding functional parts as stop words in these languages. By extracting stems we can collect necessary terms and exclude stop words. Morpheme segmentation tool can split text into morphemes with 95% high reliability. After preparing both word-and morpheme-based training text corpora, we apply convolutional neural network (CNN) as a feature selection and text classification algorithm to perform text classification tasks. Experimental results show that the morpheme-based approach outperformed the word-based approach. Word embedding technique is frequently used in text representation both in the framework of neural networks and as a value expression, and can map language units into a sequential vector space based on context, and it is a natural way to extract and predict out-of-vocabulary (OOV) from context information. Multilingual morphological analysis has provided a convenient way for processing tasks of low resource languages like Uyghur and Kazakh.
A Review of Morphological Analysis Methods on Uyghur Language
A morphological diverse language will form a huge collection of various types of words. As an agg... more A morphological diverse language will form a huge collection of various types of words. As an agglutinative language, Uyghur language is composed of affixes connecting the front and back of the stem to form a large number of words. Based on the analysis of Uyghur language morphology, the first part focuses on the three language features of Uyghur language, including word formation and ambiguity, cohesion, and phonetic changes. The second part discusses the characteristics, applications and research purposes of Uyghur language morphological analysis. The third part introduces and describes in detail the methods, advantages and disadvantages and their characteristics based on the domestic and foreign research of morphological analysis. The fourth part respectively introduces several Uyghur language stemming methods and corresponding implementation cases, and embodies the characteristics of Uyghur language. The fifth part introduces the concatenated embedding of word embedding and character-level embedding to extract Uyghur language stems through BiLSTM-CRF model. First, obtain the word embedding of each word through the unlabeled Uyghur language corpus. Secondly, obtain the character feature embedding and then directly concatenate to obtain the concatenated embedding representation. Finally, the BiLSTM-CRF model is used to extract the stem of Uyghur language, and the accuracy rate reaches 89.21%. The conclusion part summarizes the word stem extraction, and looks forward to the future development trend of Uyghur language morphological analysis research, and also discusses the development direction of agglutinative language information processing.
A Comparative Analysis of Acoustic Characteristics between Kazak & Uyghur Mandarin Learners and Standard Mandarin Speakers
In this paper, based on the vowel and phonological pronunciation corpora of 20 Kazakh undergradua... more In this paper, based on the vowel and phonological pronunciation corpora of 20 Kazakh undergraduate Mandarin learners, 10 Uyghur learners, and 10 standard pronunciations, under the framework of the phonetic learning model and comparative analysis, the method of experimental phonetics will be applied to the Kazak and Uyghur learners. The learners and standard speaker Mandarin vowels were analyzed for acoustic characteristics, such as formant frequency values, the vowel duration similarity and other prosodic parameters were compared with the standard speaker. These results are conducive to providing learners with effective teaching-related reference information, providing reliable and correct parameters and pronunciation assessments for computer-assisted language teaching systems (CALLs), as well as improving the accuracy of multinational Chinese Putonghua speech recognition and ethnic identification.
International journal of future generation communication and networking, Feb 28, 2016
Uyghur language is an agglutinative language in which words are derived from stems (or roots) by ... more Uyghur language is an agglutinative language in which words are derived from stems (or roots) by concatenating suffixes. This property makes a large number of combinations of morphemes, and greatly increases the word-vocabulary size, causing out-of-vocabulary (OOV) and data sparseness problems for statistical models. So words are split into certain sub-word units and applied to text and speech processing applications. Proper sub-word units not only provide high coverage and smaller lexicon size, but also provide semantic and syntactic information which is necessary for downstream applications. This paper discusses a general purpose morphological analyzer tool which can split a text of words into sequence of morphemes or syllables. Uyghur morpheme segmentation is a basic part of the comprehensive effort of the Uyghur language corpus compilation. As there are no delimiters for sub-word units, a supervised method, combined with certain rules and a statistical learning algorithm, is applied for morpheme segmentation. For phonetic units like syllable and phonemes, pure rule-based methods can extract with high accuracy. Most common and proper sub-words for various applications can be the linguistic morphemes for they provide linguistic information, high coverage, low lexicon size, and easily be restored to words. As the Uyghur language is written as pronounced, phonetic alterations of speech are openly expressed in text. This property makes many surface forms for a particular morpheme. A general purpose morphological analyzer must be able to analyze and export in both standard and surface forms. So the morphophonetic alterations like phonetic harmony, weakening, and morphological changes are summarized and learnt from training corpus. And a statistical model based morpheme segmentation tool is trained on the corpus of aligned word-morpheme sequences, and applied to predict possible morpheme sequences. For an open test set, with word coverage of 86.8% and morpheme coverage of 98.4%, the morpheme segmentation accuracy is 97.6%. This morpheme segmentation tool can output both on the standard forms and on the surface forms without costing segmentation accuracy. Furthermore, for various basic lexical units of word, morpheme, and syllable, the statistical properties are compared as a comprehensive effort of the Uyghur language corpus compilation.
In multi-speaker scenarios, speech processing tasks like speaker identification and speech recogn... more In multi-speaker scenarios, speech processing tasks like speaker identification and speech recognition are susceptible to noise and overlapped voices. As the overlapped voices are a complicated mixture of signals, a target extraction method from this mixture is a good front-end solution for further processing like understanding and classifying. The quality of speech separation can be assessed by the noise ratio or subjective scoring and can also be assessed by accuracy of the downstream processing tasks like speaker identification. In order to make the separation model and speaker identification model more adapted to complex multi-speaker speech overlapping scenarios, this research investigates the speech separation model and incorporate with a voiceprint recognition task. This paper proposes a feature-scale single channel speech separation network connected to a back-end speaker verification network with MFCCT features, so the accuracy of speaker identification indicates the quality of speech separation task. The datasets are prepared by synthesizing Voxceleb1 data, and used for training and testing. The results show that using an objective downstream evaluation can effectively improve the overall performance, as the optimized speech separation model significantly reduced the error rate of speaker verification.
Morphologically derivative languages form words by fusing stems and suffixes, stems are important... more Morphologically derivative languages form words by fusing stems and suffixes, stems are important to be extracted in order to make cross lingual alignment and knowledge transfer. As there are phonetic harmony and disharmony when linguistic particles are combined, both phonetic and morphological changes need to be analyzed. This paper proposes a multilingual stemming method that learns morpho-phonetic changes automatically based on character based embedding and sequential modeling. Firstly, the character feature embedding at the sentence level is used as input, and the BiLSTM model is used to obtain the forward and reverse context sequence, and the attention mechanism is added to this model for weight learning, and the global feature information is extracted to capture the stem and affix boundaries; finally CRF model is used to learn more information from sequence features to describe context information more effectively. In order to verify the effectiveness of the above model, the m...
Uploads
Papers by Mijit Ablimit