CN108109615A

CN108109615A - A kind of construction and application method of the Mongol acoustic model based on DNN

Info

Publication number: CN108109615A
Application number: CN201711390588.6A
Authority: CN
Inventors: 马志强; 杨双涛; 李图雅
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2018-06-01

Abstract

The invention provides a method for constructing and using a DNN-based Mongolian acoustic model. The DNN deep neural network is used to replace the GMM Gaussian mixture model to estimate the posterior probability of the Mongolian acoustic state, construct a DNN‑HMM acoustic model, and use the DNN‑HMM acoustic model to recognize Mongolian speech data. The invention can effectively reduce the error rate of word recognition and the error rate of word recognition, and improve the performance of the model.

Description

A DNN-based Mongolian Acoustic Model Construction and Application Method

技术领域technical field

本发明属于蒙古语语音识别领域，具体涉及一种基于DNN的蒙古语声学模型的构造和使用方法。The invention belongs to the field of Mongolian speech recognition, and in particular relates to a method for constructing and using a DNN-based Mongolian acoustic model.

背景技术Background technique

典型的大词汇量连续语音识别系统(Large Vocabulary Continuous SpeechRecognition，LVCSR)由特征提取、声学模型、语言模型和解码器等组成.声学模型是语音识别系统的核心组成部分，基于GMM模型(混合高斯模型)和HMM模型(隐马尔可夫模型)构建的GMM-HMM声学模型一度是大词汇量连续语音识别系统中应用最广的声学模型。在GMM-HMM模型中，GMM模型对语音特征向量进行概率建模，然后通过EM算法(最大期望算法)生成语音观察特征的最大化概率，当混合高斯分布数目足够多时，GMM模型可以充分拟合声学特征的概率分布，HMM模型根据GMM模型拟合的观察状态生成语音的时序状态。当采用GMM模型混合高斯模型的概率来描述语音数据分布时，GMM模型本质上属于浅层模型，并在拟合声学特征分布时对特征之间进行了独立性的假设，因此无法充分描述声学特征的状态空间分布；同时，GMM建模的特征维数一般是几十维，不能充分描述声学特征之间的相关性，模型表达能力有限。A typical large vocabulary continuous speech recognition system (Large Vocabulary Continuous SpeechRecognition, LVCSR) is composed of feature extraction, acoustic model, language model and decoder. The acoustic model is the core component of the speech recognition system, based on the GMM model (mixed Gaussian model ) and the HMM model (Hidden Markov Model) built the GMM-HMM acoustic model was once the most widely used acoustic model in large vocabulary continuous speech recognition systems. In the GMM-HMM model, the GMM model probabilistically models the speech feature vector, and then generates the maximum probability of speech observation features through the EM algorithm (maximum expectation algorithm). When the number of mixed Gaussian distributions is large enough, the GMM model can fully fit The probability distribution of the acoustic features, the HMM model generates the temporal state of the speech from the observed state fitted by the GMM model. When the probability of the GMM model mixed Gaussian model is used to describe the distribution of speech data, the GMM model is essentially a shallow model, and the assumption of independence between the features is made when fitting the distribution of the acoustic features, so the acoustic features cannot be fully described. At the same time, the feature dimension of GMM modeling is generally tens of dimensions, which cannot fully describe the correlation between acoustic features, and the model expression ability is limited.

发明内容Contents of the invention

20世纪80年代利用神经网络和HMM模型构建声学模型的研究开始出现，但是，由于当时计算机计算能力不足且缺乏足够的训练数据，模型的效果不及GMM-HMM。2010年微软亚洲研究院的邓力与Hinton小组针对大规模连续语音识别任务提出了CD-DBN(动态贝叶斯网络)-HMM的混合声学模型框架，并进行了相关实验。实验结果表明，相比GMM-HMM声学模型，采用CD-DBN-HMM声学模型使语音识别系统识别正确率提高30％左右，CD-DBN-HMM混合声学模型框架的提出彻底革新了语音识别原有的声学模型框架。与传统的高斯混合模型相比，深度神经网络属于深度模型，能够更好地表示复杂非线性函数，更能捕捉语音特征向量之间的相关性，易于取得更好的建模效果。基于上述成果本发明提出了一种基于DNN模型的蒙古语声学模型的构造和使用方法，以更好的完成蒙古语声学模型建模任务。In the 1980s, research on building acoustic models using neural networks and HMM models began to emerge. However, due to insufficient computer computing power and insufficient training data at that time, the effect of the model was not as good as GMM-HMM. In 2010, Deng Li and Hinton's group of Microsoft Research Asia proposed a CD-DBN (Dynamic Bayesian Network)-HMM hybrid acoustic model framework for large-scale continuous speech recognition tasks, and conducted related experiments. The experimental results show that compared with the GMM-HMM acoustic model, the adoption of the CD-DBN-HMM acoustic model can increase the recognition accuracy of the speech recognition system by about 30%. The proposal of the CD-DBN-HMM hybrid acoustic model framework has completely revolutionized the original Acoustic model framework. Compared with the traditional Gaussian mixture model, the deep neural network is a deep model, which can better represent complex nonlinear functions, better capture the correlation between speech feature vectors, and easily achieve better modeling results. Based on the above achievements, the present invention proposes a method for constructing and using a Mongolian acoustic model based on a DNN model, so as to better complete the Mongolian acoustic model modeling task.

本发明的技术方案为：Technical scheme of the present invention is:

1.模型构建：1. Model construction:

用DNN深度神经网络代替GMM高斯混合模型，实现对蒙古语声学状态的后验概率进行估算。在给定蒙古语声学特征序列的情况下，首先由DNN模型用来估算当前特征属于HMM状态的概率，然后用HMM模型描述蒙古语语音信号的动态变化，捕捉蒙古语语音信息的时序状态信息。The DNN deep neural network is used to replace the GMM Gaussian mixture model to realize the estimation of the posterior probability of the Mongolian acoustic state. In the case of a given Mongolian acoustic feature sequence, the DNN model is first used to estimate the probability that the current feature belongs to the HMM state, and then the HMM model is used to describe the dynamic changes of the Mongolian speech signal and capture the temporal state information of the Mongolian speech information.

蒙古语声学模型中DNN网络的训练分为预训练和调优两个阶段。The training of the DNN network in the Mongolian acoustic model is divided into two stages: pre-training and tuning.

在DNN网络的预训练中，采用了逐层无监督训练算法，属于生成式训练算法。逐层无监督预训练算法是对DNN网络的每一层进行训练，而且每次只训练其中的一层，其他层的参数保持原来的初始化的参数不变，训练时，对每一层的输入和输出的误差尽量的减小，以保证每一层的参数对于该层来说都是最优的。接下来，将训练好的每一层的输出数据作为下一层的输入数据，则下一层的输入数据将比直接训练时经过多层神经网络输入到下一层的数据的误差小很多，逐层无监督预训练算法能够保证每一层之间的输入输出数据的误差都是相对较小的。In the pre-training of the DNN network, a layer-by-layer unsupervised training algorithm is used, which belongs to the generative training algorithm. The layer-by-layer unsupervised pre-training algorithm is to train each layer of the DNN network, and only train one layer at a time. The parameters of other layers keep the original initialization parameters unchanged. During training, the input of each layer The error of output and output is reduced as much as possible to ensure that the parameters of each layer are optimal for this layer. Next, the output data of each layer trained is used as the input data of the next layer, then the input data of the next layer will have a much smaller error than the data input to the next layer through the multi-layer neural network during direct training, The layer-by-layer unsupervised pre-training algorithm can ensure that the error of input and output data between each layer is relatively small.

通过逐层无监督预训练算法可以得到较好的神经网络初始化参数，使用蒙古语标注数据(即特征状态)通过BP算法(误差反向传播算法)进行有监督的调优，最终得到可用于声学状态分类的DNN深度神经网络模型。Better neural network initialization parameters can be obtained through the layer-by-layer unsupervised pre-training algorithm, and supervised tuning is carried out through the BP algorithm (error back-propagation algorithm) using Mongolian label data (that is, the feature state), and finally the neural network can be used for acoustics. A DNN Deep Neural Network Model for State Classification.

2.模型使用：2. Model use:

通过对DNN网络的预训练和调优后，可以利用DNN-HMM声学模型对蒙古语语音数据进行识别，具体的过程如下：After pre-training and tuning the DNN network, the DNN-HMM acoustic model can be used to recognize Mongolian speech data. The specific process is as follows:

步骤一：根据输入的蒙古语声学特征向量，计算DNN深度神经网络前L层的输出。Step 1: Calculate the output of the first L layer of the DNN deep neural network according to the input Mongolian acoustic feature vector.

步骤二：利用L层的softmax分类层计算当前特征关于全部声学状态的后验概率。即当前特征属于各蒙古语声学状态的概率。Step 2: Use the softmax classification layer of the L layer to calculate the posterior probability of the current feature with respect to all acoustic states. That is, the probability that the current feature belongs to each Mongolian acoustic state.

步骤三：根据贝叶斯公式，将各个状态的后验概率除以其自身的先验概率，得到各状态规整的似然值。Step 3: According to the Bayesian formula, divide the posterior probability of each state by its own prior probability to obtain the regularized likelihood value of each state.

步骤四：利用维特比解码算法进行解码得到最优路径。Step 4: Use the Viterbi decoding algorithm to decode to obtain the optimal path.

其中，隐含状态的先验概率，通过计算各状态对应帧总数与全部声学特征帧数的比值即可得到。Among them, the prior probability of the hidden state can be obtained by calculating the ratio of the total number of frames corresponding to each state to the number of all acoustic feature frames.

3.模型的训练过程3. Model training process

DNN深度神经网络调优阶段采用的标注数据是由GMM-HMM声学模型强制对齐得到的，并采用随机梯度下降算法完成模型参数的更新，因此，在训练DNN-HMM前首先需要训练一个足够好的GMM-HMM蒙古语声学模型，然后由GMM-HMM蒙古语声学模型通过维特比算法生成DNN-HMM蒙古语声学模型训练所需要的标注数据。The labeled data used in the DNN deep neural network tuning stage is obtained by forced alignment of the GMM-HMM acoustic model, and the stochastic gradient descent algorithm is used to update the model parameters. Therefore, before training the DNN-HMM, it is first necessary to train a sufficiently good The GMM-HMM Mongolian acoustic model, and then the GMM-HMM Mongolian acoustic model generates the labeled data required for DNN-HMM Mongolian acoustic model training through the Viterbi algorithm.

由于DNN模型在调优时需要蒙古语语音帧对齐的蒙古语标注数据，而且标注数据质量往往会影响到DNN模型的性能。因此，在实际应用时，我们利用已训练的GMM-HMM蒙古语声学模型实现语音特征到状态的强制对齐。所以，DNN-HMM声学模型的训练过程为：首先训练GMM-HMM蒙古语声学模型，得到对齐的蒙古语语音特征数据；然后在对齐语音特征数据的基础上对深度神经网络(DNN)进行训练和调优；最后根据得到的蒙古语语音观察状态再对隐马尔科夫模型(HMM)进行训练。Since the DNN model needs Mongolian annotation data aligned with Mongolian speech frames when tuning, and the quality of the annotation data often affects the performance of the DNN model. Therefore, for practical application, we utilize the trained GMM-HMM Mongolian acoustic model to achieve forced alignment of speech features to states. Therefore, the training process of the DNN-HMM acoustic model is as follows: first train the GMM-HMM Mongolian acoustic model to obtain aligned Mongolian speech feature data; then train the deep neural network (DNN) on the basis of the aligned speech feature data. Tuning; Finally, the Hidden Markov Model (HMM) is trained according to the obtained Mongolian speech observation state.

附图说明Description of drawings

图1为DNN-HMM蒙古语声学模型图。Figure 1 is a DNN-HMM Mongolian acoustic model diagram.

图2为DNN网络预训练过程图。Figure 2 is a diagram of the DNN network pre-training process.

图3为相对于GMM-HMM声学模型的实验对比结果图。Fig. 3 is a graph of experimental comparison results with respect to the GMM-HMM acoustic model.

实施方式Implementation

为了能够更清楚地描述本发明的技术内容，下面结合具体实施例来进行进一步的描述。In order to describe the technical content of the present invention more clearly, further description will be given below in conjunction with specific embodiments.

实施例一：Embodiment one:

1.模型构建：1. Model construction:

DNN-HMM蒙古语声学模型结构具体如附图1所示。在DNN-HMM蒙古语声学模型中，DNN网络通过不断地自下而上堆叠隐含层实现的。其中S表示HMM模型中的隐含状态，A表示状态转移概率矩阵，L表示DNN深度神经网络的层数(其中隐含层为L-1层，L0层为输入层，LL层为输出层，DNN网络共包含L+1层)，W表示层之间的连接矩阵。DNN-HMM蒙古语声学模型在进行蒙古语语音识别过程建模前，需要对DNN神经网络进行训练。在完成DNN神经网络的训练后，对蒙古语声学模型的建模过程与GMM-HMM模型一致。The structure of the DNN-HMM Mongolian acoustic model is shown in Figure 1. In the DNN-HMM Mongolian acoustic model, the DNN network is implemented by continuously stacking hidden layers from bottom to top. Among them, S represents the hidden state in the HMM model, A represents the state transition probability matrix, and L represents the number of layers of the DNN deep neural network (the hidden layer is L-1 layer, L0 layer is the input layer, LL layer is the output layer, The DNN network contains a total of L+1 layers), and W represents the connection matrix between layers. The DNN-HMM Mongolian acoustic model needs to train the DNN neural network before modeling the Mongolian speech recognition process. After completing the training of the DNN neural network, the modeling process of the Mongolian acoustic model is consistent with the GMM-HMM model.

在DNN网络的预训练(如附图2所示)中，采用了逐层无监督训练算法，属于生成式训练算法。逐层无监督预训练算法是对DNN网络的每一层进行训练，而且每次只训练其中的一层，其他层的参数保持原来的初始化的参数不变，训练时，对每一层的输入和输出的误差尽量的减小，以保证每一层的参数对于该层来说都是最优的。接下来，将训练好的每一层的输出数据作为下一层的输入数据，则下一层的输入数据将比直接训练时经过多层神经网络输入到下一层的数据的误差小很多，逐层无监督预训练算法能够保证每一层之间的输入输出数据的误差都是相对较小的。预训练算法见算法1。In the pre-training of the DNN network (as shown in Figure 2), a layer-by-layer unsupervised training algorithm is used, which belongs to the generative training algorithm. The layer-by-layer unsupervised pre-training algorithm is to train each layer of the DNN network, and only train one layer at a time. The parameters of other layers keep the original initialization parameters unchanged. During training, the input of each layer The error of output and output is reduced as much as possible to ensure that the parameters of each layer are optimal for this layer. Next, the output data of each layer trained is used as the input data of the next layer, then the input data of the next layer will have a much smaller error than the data input to the next layer through the multi-layer neural network during direct training, The layer-by-layer unsupervised pre-training algorithm can ensure that the error of input and output data between each layer is relatively small. See Algorithm 1 for the pre-training algorithm.

通过逐层无监督预训练算法可以得到较好的神经网络初始化参数，使用蒙古语标注数据(即特征状态)通过BP算法(误差反向传播算法)进行有监督的调优，最终得到可用于声学状态分类的DNN深度神经网络模型。有监督的调优算法采用随机梯度下降算法进行实现，具体见算法2。Better neural network initialization parameters can be obtained through the layer-by-layer unsupervised pre-training algorithm, and supervised tuning is carried out through the BP algorithm (error back-propagation algorithm) using Mongolian label data (that is, the feature state), and finally the neural network can be used for acoustics. A DNN Deep Neural Network Model for State Classification. The supervised tuning algorithm is implemented using the stochastic gradient descent algorithm, see Algorithm 2 for details.

2.模型使用：2. Model use:

步骤一：根据输入的蒙古语声学特征向量，计算DNN深度神经网络前L层的输出。即：Step 1: Calculate the output of the first L layer of the DNN deep neural network according to the input Mongolian acoustic feature vector. which is:

v^α＝f(z^α)＝f(W^αv^α-1+b^α),0≤α<Lv ^α ＝f(z ^α )＝f(W ^α v ^α-1 +b ^α ),0≤α<L

(1)其中，z^α表示激励向量，z^α＝W^αv^α-1+b^α且v^α表示激活向量，W^α表示权重矩阵,b^α表示偏执向量，N_α表示第α层的神经节点个数且N_α∈R。V⁰表示网络的输入特征，在DNN-HMM声学模型中，输入特征即为声学特征向量。其中N₀＝D表示输入声学特征向量的维度，表示激活函数对激励向量的计算过程，f(·)表示激活函数。(1) Among them, z ^α represents the excitation vector, z ^α =W ^α v ^α-1 +b ^α and v ^α represents the activation vector, W ^α represents the weight matrix, b ^α represents the paranoid vector, N _α represents the number of neural nodes in layer α and N _α ∈ R. V ⁰ represents the input features of the network, In the DNN-HMM acoustic model, the input feature is the acoustic feature vector. Where N ₀ =D represents the dimension of the input acoustic feature vector, Indicates the calculation process of the activation function on the excitation vector, and f( ) indicates the activation function.

步骤二：利用L层的softmax分类层计算当前特征关于全部声学状态的后验概率。即当前特征属于各蒙古语声学状态的概率，即当前特征属于各蒙古语声学状态的概率。Step 2: Use the softmax classification layer of the L layer to calculate the posterior probability of the current feature with respect to all acoustic states. That is, the probability that the current feature belongs to each Mongolian acoustic state, that is, the probability that the current feature belongs to each Mongolian acoustic state.

v_i＝P_dnn(i|O)＝softmax(i) (2)v _i ＝P _dnn (i|O)＝softmax(i) (2)

在公式(2)中，i∈{1,2,…,C},其中C表示声学模型的隐含状态个数，x_i表示softmax层第i个神经单元的输入，v_i表示softmax分类层第i个神经单元的输出，即输入声学特征向量O关于声学模型第i个隐含状态的后验概率。In formula (2), i∈{1,2,…,C}, where C represents the number of hidden states of the acoustic model, x _i represents the input of the i-th neural unit in the softmax layer, and v _i represents the softmax classification layer The output of the i-th neural unit, that is, the posterior probability of the input acoustic feature vector O about the i-th hidden state of the acoustic model.

3.实验及结果：3. Experiment and results:

3.1为验证提出的DNN-HMM蒙古语声学模型的有效性，制定如下实验：3.1 In order to verify the effectiveness of the proposed DNN-HMM Mongolian acoustic model, the following experiments are formulated:

(1)提取MFCC声学特征，展开GMM-HMM、DNN-HMM蒙古语声学模型建模的实验研究。观察不同声学建模单元对声学模型的性能影响、以及对比不同类型声学模型对语音识别系统的影响。(1) Extract the acoustic features of MFCC, and conduct experimental research on the modeling of Mongolian acoustic models of GMM-HMM and DNN-HMM. Observe the impact of different acoustic modeling units on the performance of the acoustic model, and compare the impact of different types of acoustic models on the speech recognition system.

(2)通过构建不同层数的深度网络结构的DNN-HMM三音子蒙古语声学模型，开展层数对蒙古语声学模型、以及对过拟合现象影响的实验研究。(2) By constructing a DNN-HMM triphone Mongolian acoustic model with a deep network structure of different layers, carry out experimental research on the influence of the number of layers on the Mongolian acoustic model and the phenomenon of overfitting.

3.2实验参数：3.2 Experimental parameters:

蒙古语语音识别的语料库由310句蒙古语教学语音组成，共计2291个蒙古语词汇，命名为IMUT310语料库。语料库共由三部分组成：音频文件、发音标注以及相应的蒙文文本。实验中，将IMUT310语料库划分成训练集和测试集两部分，其中训练集为287句，测试集为23句。实验在Kaldi平台上完成。Kaldi的具体实验环境配置如表1所示。The corpus of Mongolian speech recognition is composed of 310 Mongolian teaching voices, with a total of 2291 Mongolian vocabulary, named IMUT310 corpus. The corpus consists of three parts: audio files, pronunciation annotations, and corresponding Mongolian texts. In the experiment, the IMUT310 corpus is divided into two parts, the training set and the test set. The training set is 287 sentences, and the test set is 23 sentences. The experiment is done on the Kaldi platform. The specific experimental environment configuration of Kaldi is shown in Table 1.

表1实验环境Table 1 Experimental environment

实验过程中，蒙古语声学特征采用MFCC声学特征表示，共有39维数据，其中前13维特征由12个倒谱特征和1个能量系数组成，后面的两个13维特征是对前面13维特征的一阶差分和二阶差分。在提取蒙古语MFFC特征时，帧窗口长度为25ms，帧移10ms。对训练集和测试集分别进行特征提取，全部语音数据共生成119960个MFCC特征，其中训练数据生成的特征为112535个，测试数据生成的特征为7425个。GMM-HMM声学模型训练时，蒙古语语音MFCC特征采用39维数据进行实验。单音子DNN-HMM实验时，蒙古语MFCC语音特征为13维(不包括一、二阶差分特征)。三音子DNN-HMM实验时，蒙古语MFCC的特征为39维During the experiment, Mongolian acoustic features were represented by MFCC acoustic features, with a total of 39 dimensional data, of which the first 13 dimensional features consisted of 12 cepstrum features and 1 energy coefficient, and the latter two 13 dimensional features were based on the previous 13 dimensional features. first-order difference and second-order difference. When extracting Mongolian MFFC features, the frame window length is 25ms, and the frame shift is 10ms. Feature extraction is performed on the training set and the test set respectively, and a total of 119,960 MFCC features are generated from all speech data, of which 112,535 features are generated from the training data, and 7,425 features are generated from the test data. During the training of the GMM-HMM acoustic model, the Mongolian speech MFCC features were experimented with 39-dimensional data. In the monophonic DNN-HMM experiment, the Mongolian MFCC phonetic features are 13-dimensional (excluding first- and second-order differential features). Mongolian MFCC features are 39-dimensional when experimenting with triphone DNN-HMM

DNN网络训练时，特征提取采用上下文结合的办法，即在当前帧前后各取5帧来表示当前帧的上下文环境，因此，在实验过程中，单音子DNN网络的输入节点数为143个(13*(5+1+5))，三音子DNN网络的输入节点数为429个(39*(5+1+5))。DNN网络的输出层节点为可观察蒙古语语音音素个数，根据语料库标注的标准，输出节点为27个；DNN网络的隐含层节点数设定为1024，调优训练次数设定为60，初始学习率设定为0.015，最终学习率设定为0.002。During the training of the DNN network, the feature extraction adopts the method of context combination, that is, 5 frames are taken before and after the current frame to represent the context of the current frame. Therefore, during the experiment, the number of input nodes of the monophonic DNN network is 143 ( 13*(5+1+5)), the number of input nodes of the triphone DNN network is 429 (39*(5+1+5)). The output layer nodes of the DNN network are the number of observable Mongolian phonemes. According to the corpus labeling standard, the number of output nodes is 27; the number of hidden layer nodes of the DNN network is set to 1024, and the number of tuning training times is set to 60. The initial learning rate is set to 0.015, and the final learning rate is set to 0.002.

3.3实验及结果：3.3 Experiment and results:

4个实验单元分别是：单音子GMM-HMM、三音子GMM-HMM、单音子DNN-HMM和三音子DNN-HMM实验。实验结果数据见表2，对比结果见附图3.The four experimental units are: monophonic GMM-HMM, triphonic GMM-HMM, monophonic DNN-HMM and triphonic DNN-HMM experiments. The data of the experimental results are shown in Table 2, and the comparison results are shown in Figure 3.

表2GMM-HMM与DNN-HMM蒙古语声学模型实验数据Table 2 GMM-HMM and DNN-HMM Mongolian acoustic model experimental data

附图3(a)中可以发现，相对于单音子GMM-HMM蒙古语声学模型，单音子DNN-HMM蒙古语声学模型在训练集上的词错误率降低了8.84％，在测试集上的词识别错误率降低了11.14％；但是，对于三音子模型来说，三音子DNN-HMM蒙古语声学模型比三音子GMM-HMM蒙古语声学模型在训练集上的词错误率降低了1.33％，在测试集上的词识别错误率降低了7.5％。附图3(b)发现，单音子模型在训练集上的句识别错误率降低了32.43％，在测试集上的句识别错误率降低了17.88％；对于三音子模型来说，三音子DNN-HMM蒙古语声学模型比三音子GMM-HMM蒙古语声学模型在训练集上的句识别错误率降低了19.3％，在测试集上的句识别错误率降低了13.63％。In accompanying drawing 3 (a), it can be found that, compared with the monophonic GMM-HMM Mongolian acoustic model, the word error rate of the monophonic DNN-HMM Mongolian acoustic model on the training set is reduced by 8.84%, and on the test set 11.14% reduction in the word recognition error rate; however, for the triphone model, the triphone DNN-HMM Mongolian acoustic model has a lower word error rate on the training set than the triphone GMM-HMM Mongolian acoustic model 1.33%, and the word recognition error rate on the test set is reduced by 7.5%. Accompanying drawing 3 (b) finds that the sentence recognition error rate of the monophone model on the training set has been reduced by 32.43%, and the sentence recognition error rate on the test set has been reduced by 17.88%; Compared with the triphone GMM-HMM Mongolian acoustic model, the sub-DNN-HMM Mongolian acoustic model has a 19.3% lower sentence recognition error rate on the training set and 13.63% lower sentence recognition error rate on the test set.

由以上分析可见：单音子DNN-HMM蒙古语声学模型明显优于单音子GMM-HMM蒙古语声学模型；对于三音子模型来说，三音子DNN-HMM蒙古语声学模型比三音子GMM-HMM蒙古语声学模型的识别率还要高。From the above analysis, it can be seen that the monophone DNN-HMM Mongolian acoustic model is significantly better than the monophone GMM-HMM Mongolian acoustic model; for the triphone model, the triphone DNN-HMM Mongolian acoustic model is better than the triphone The recognition rate of sub-GMM-HMM Mongolian acoustic model is even higher.

DNN-HMM蒙古语声学模型可有效降低词识别的错误率和字识别的错误率，提高模型使用性能。The DNN-HMM Mongolian acoustic model can effectively reduce the error rate of word recognition and word recognition, and improve the performance of the model.

在此说明书中，本发明已参照其特定的实施例作了描述。但是，很显然仍可以作出各种修改和变换而不背离本发明的精神和范围。因此，说明书和附图应被认为是说明性的而非限制性的。In this specification, the invention has been described with reference to specific embodiments thereof. However, it is obvious that various modifications and changes can be made without departing from the spirit and scope of the invention. Accordingly, the specification and drawings are to be regarded as illustrative rather than restrictive.

Claims

1. a method for constructing a DNN-based Mongolian acoustic model, characterized in that:

The DNN deep neural network is used to replace the GMM Gaussian mixture model to realize the estimation of the posterior probability of the Mongolian acoustic state. In the case of a given Mongolian acoustic feature sequence, the DNN model is first used to estimate the probability that the current feature belongs to the HMM state, and then the HMM model is used to describe the dynamic changes of the Mongolian speech signal and capture the temporal state information of the Mongolian speech information.

The training of the DNN network in the Mongolian acoustic model is divided into two stages: pre-training and tuning.

In the pre-training of the DNN network, a layer-by-layer unsupervised training algorithm is used, which belongs to the generative training algorithm. The layer-by-layer unsupervised pre-training algorithm is to train each layer of the DNN network, and only train one layer at a time. The parameters of other layers keep the original initialization parameters unchanged. During training, the input of each layer The error of output and output is reduced as much as possible to ensure that the parameters of each layer are optimal for this layer. Next, the output data of each layer trained is used as the input data of the next layer, then the input data of the next layer will have a much smaller error than the data input to the next layer through the multi-layer neural network during direct training, The layer-by-layer unsupervised pre-training algorithm can ensure that the error of input and output data between each layer is relatively small.

Better neural network initialization parameters can be obtained through the layer-by-layer unsupervised pre-training algorithm, and supervised tuning is carried out through the BP algorithm (error back-propagation algorithm) using Mongolian label data (that is, the feature state), and finally the neural network can be used for acoustics. A DNN Deep Neural Network Model for State Classification.

2. the construction method of a kind of DNN-based Mongolian acoustic model as claimed in claim 1, is characterized in that: described DNN network realizes by constantly stacking hidden layers from bottom to top.

3. the construction method of a kind of Mongolian acoustic model based on DNN as claimed in claim 1 or 2, is characterized in that: the pre-training of described DNN network adopts layer by layer unsupervised pre-training algorithm:

Input: ∈ represents the learning rate, T represents the maximum number of iterations, and L represents the number of layers to be trained

N=(n ¹ ,n ² ,…,n ^L ) indicates the number of hidden units in each hidden layer

X ^j is the sequence of training data divided by mini-batch, j=(1,...,Max), where Max represents the sequence length

Output: W ⁱ represents the link weight between the i-th hidden layer and the i-1-th hidden layer, i=(1,2,...,L)

b ⁱ represents the bias vector of the i-th layer, i=(0,1,…,L).

4. as claimed in claim 1 or 2, a kind of construction method based on the Mongolian acoustic model of DNN is characterized in that: the optimization of described DNN network adopts stochastic gradient descent algorithm:

Input: training set set, batch size batch_size

Learning rate, the number of cycles epoch

Output: model parameter weight.

5. A method for using a DNN-based Mongolian acoustic model, characterized in that:

After pre-training and tuning the DNN network, the DNN-HMM acoustic model can be used to recognize Mongolian speech data. The specific process is as follows:

Step 1: Calculate the output of the first L layer of the DNN deep neural network according to the input Mongolian acoustic feature vector.

Step 2: Use the softmax classification layer of the L layer to calculate the posterior probability of the current feature with respect to all acoustic states. That is, the probability that the current feature belongs to each Mongolian acoustic state.

Step 3: According to the Bayesian formula, divide the posterior probability of each state by its own prior probability to obtain the regularized likelihood value of each state.

Step 4: Use the Viterbi decoding algorithm to decode to obtain the optimal path.

Among them, the prior probability of the hidden state can be obtained by calculating the ratio of the total number of frames corresponding to each state to the number of all acoustic feature frames.

6. the using method of a kind of Mongolian acoustic model based on DNN as claimed in claim 5, is characterized in that:

Described step one: at first according to the Mongolian acoustic feature vector of input, calculate the output of the front L layer of DNN deep neural network, namely:

v ^α ＝f(z ^α )＝f(W ^α v ^α-1 +b ^α ),0≤α<L

Among them, z ^α represents the excitation vector, z ^α =W ^α v ^α-1 +b ^α and v ^α represents the activation vector, W ^α represents the weight matrix, b ^α represents the paranoid vector, N _α represents the number of neural nodes in layer α and N _α ∈ R. V ⁰ represents the input features of the network, In the DNN-HMM acoustic model, the input feature is the acoustic feature vector. Where N ₀ =D represents the dimension of the input acoustic feature vector, Indicates the calculation process of the activation function on the excitation vector, and f( ) indicates the activation function.

7. the using method of a kind of Mongolian acoustic model based on DNN as claimed in claim 5, is characterized in that:

The second step: using the softmax classification layer of the L layer to calculate the posterior probability of the current feature with respect to all the acoustic states, that is, the probability that the current feature belongs to each Mongolian acoustic state.

v _i ＝P _dnn (i|O)＝softmax(i)

Among them, i∈{1,2,…,C}, where C represents the number of hidden states of the acoustic model, x _i represents the input of the i-th neural unit in the softmax layer, v _i represents the i-th neural unit in the softmax classification layer The output of the input acoustic feature vector O is the posterior probability of the i-th hidden state of the acoustic model.