CN109388806B

CN109388806B - Chinese word segmentation method based on deep learning and forgetting algorithm

Info

Publication number: CN109388806B
Application number: CN201811258651.5A
Authority: CN
Inventors: 卢学裕; 王安; 杨大海; 杨利军
Original assignee: Beijing Botbrain AI Technology Co Ltd
Current assignee: Beijing Botbrain AI Technology Co Ltd
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2023-06-27
Anticipated expiration: 2038-10-26
Also published as: CN109388806A

Abstract

The invention discloses a Chinese word segmentation method based on deep learning and forgetting algorithm, which includes the following steps: 1: Scanning sentences word by word to obtain natural language, and using deep learning word segmentation method to divide the scanned natural language into word sequences and income to the first word second: scan the sentence word by word to obtain natural language, and use the forgetting algorithm word segmentation method to hyphenate the obtained natural language into candidate words and store them in the second thesaurus; third: combine the word sequence in the first thesaurus with the second The candidate words in the thesaurus are fused to obtain the final word segmentation results. The fusion method is: if the continuous words in the second thesaurus correspond to words in deep learning, they will be merged into words; if a single word in the second thesaurus corresponds to deep learning If it is a word in learning, it will be merged forward or backward into a word. The word segmentation method of the present invention can automatically detect domain knowledge by combining the deep learning word segmentation method and the forgetting algorithm word segmentation method, complete the new word discovery function in the unsupervised field, and improve the word segmentation effect.

Description

A Chinese word segmentation method based on deep learning and forgetting algorithm

技术领域technical field

本发明涉及分词技术领域，具体涉及一种基于深度学习及遗忘算法的中文分词方法。The invention relates to the technical field of word segmentation, in particular to a Chinese word segmentation method based on deep learning and forgetting algorithms.

背景技术Background technique

中文分词(Chinese Word Segmentation)指的是将一个汉字序列切分成一个一个单独的词。分词就是将连续的字序列按照一定的规范重新组合成词序列的过程。Chinese Word Segmentation refers to dividing a sequence of Chinese characters into individual words. Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications.

1、基于字符串匹配的分词方法1. Word segmentation method based on string matching

这种方法又叫做机械分词方法，它是按照一定的策略将待分析的汉字串与一个“充分大的”机器词典中的词条进行配，若在词典中找到某个字符串，则匹配成功(识别出一个词)。按照扫描方向的不同，串匹配分词方法可以分为正向匹配和逆向匹配；按照不同长度优先匹配的情况，可以分为最大(最长)匹配和最小(最短)匹配；按照是否与词性标注过程相结合，又可以分为单纯分词方法和分词与标注相结合的一体化方法。常用的几种机械分词方法如下：This method is also called the mechanical word segmentation method. It matches the Chinese character string to be analyzed with an entry in a "sufficiently large" machine dictionary according to a certain strategy. If a certain string is found in the dictionary, the match is successful. (a word is recognized). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the priority matching of different lengths, it can be divided into maximum (longest) matching and minimum (shortest) matching; according to whether it is compatible with the part-of-speech tagging process It can be divided into a simple word segmentation method and an integrated method combining word segmentation and labeling. Several commonly used mechanical word segmentation methods are as follows:

1)正向最大匹配法(由左到右的方向)；1) forward maximum matching method (direction from left to right);

2)逆向最大匹配法(由右到左的方向)；2) reverse maximum matching method (direction from right to left);

3)最少切分(使每一句中切出的词数最小)。3) Minimal segmentation (making the number of words cut out in each sentence minimum).

还可以将上述各种方法相互组合，例如，可以将正向最大匹配方法和逆向最大匹配方法结合起来构成双向匹配法。由于汉语单字成词的特点，正向最小匹配和逆向最小匹配一般很少使用。一般说来，逆向匹配的切分精度略高于正向匹配，遇到的歧义现象也较少。统计结果表明，单纯使用正向最大匹配的错误率为1/169，单纯使用逆向最大匹配的错误率为1/245。但这种精度还远远不能满足实际的需要。实际使用的分词系统，都是把机械分词作为一种初分手段，还需通过利用各种其它的语言信息来进一步提高切分的准确率。The above various methods can also be combined with each other, for example, the forward maximum matching method and the reverse maximum matching method can be combined to form a two-way matching method. Due to the characteristics of Chinese characters into words, forward minimum matching and reverse minimum matching are generally seldom used. Generally speaking, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities encountered. The statistical results show that the error rate of purely using forward maximum matching is 1/169, and the error rate of purely using reverse maximum matching is 1/245. But this precision is far from meeting the actual needs. The word segmentation systems actually used all use mechanical word segmentation as a means of initial segmentation, and it is necessary to use various other language information to further improve the accuracy of segmentation.

一种方法是改进扫描方式，称为特征扫描或标志切分，优先在待分析字符串中识别和切分出一些带有明显特征的词，以这些词作为断点，可将原字符串分为较小的串再来进机械分词，从而减少匹配的错误率。另一种方法是将分词和词类标注结合起来，利用丰富的词类信息对分词决策提供帮助，并且在标注过程中又反过来对分词结果进行检验、调整，从而极大地提高切分的准确率。One method is to improve the scanning method, which is called feature scanning or flag segmentation, and firstly identify and segment some words with obvious characteristics in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Then perform mechanical word segmentation for smaller strings, thereby reducing the error rate of matching. Another method is to combine word segmentation and part-of-speech tagging, use rich part-of-speech information to help word segmentation decision-making, and in turn check and adjust the word segmentation results during the tagging process, thereby greatly improving the accuracy of segmentation.

对于机械分词方法，可以建立一个一般的模型，在这方面有专业的学术论文，这里不做详细论述。For the mechanical word segmentation method, a general model can be established. There are professional academic papers in this area, which will not be discussed in detail here.

2、基于理解的分词方法2. Comprehension-based word segmentation method

这种分词方法是通过让计算机模拟人对句子的理解，达到识别词的效果。其基本思想就是在分词的同时进行句法、语义分析，利用句法信息和语义信息来处理歧义现象。它通常包括三个部分：分词子系统、句法语义子系统、总控部分。在总控部分的协调下，分词子系统可以获得有关词、句子等的句法和语义信息来对分词歧义进行判断，即它模拟了人对句子的理解过程。这种分词方法需要使用大量的语言知识和信息。由于汉语语言知识的笼统、复杂性，难以将各种语言信息组织成机器可直接读取的形式，因此目前基于理解的分词系统还处在试验阶段。This word segmentation method achieves the effect of recognizing words by letting the computer simulate the human understanding of the sentence. Its basic idea is to perform syntactic and semantic analysis at the same time of word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually includes three parts: the word segmentation subsystem, the syntax and semantics subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This word segmentation method requires the use of a large amount of language knowledge and information. Due to the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by machines. Therefore, the word segmentation system based on comprehension is still in the experimental stage.

3、基于统计的分词方法3. Statistical word segmentation method

从形式上看，词是稳定的字的组合，因此在上下文中，相邻的字同时出现的次数越多，就越有可能构成一个词。因此字与字相邻共现的频率或概率能够较好的反映成词的可信度。可以对语料中相邻共现的各个字的组合的频度进行统计，计算它们的互现信息。定义两个字的互现信息，计算两个汉字X、Y的相邻共现概率。互现信息体现了汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时，便可认为此字组可能构成了一个词。这种方法只需对语料中的字组频度进行统计，不需要切分词典，因而又叫做无词典分词法或统计取词方法。但这种方法也有一定的局限性，会经常抽出一些共现频度高、但并不是词的常用字组，例如“这一”、“之一”、“有的”、“我的”、“许多的”等，并且对常用词的识别精度差，时空开销大。实际应用的统计分词系统都要使用一部基本的分词词典(常用词词典)进行串匹配分词，同时使用统计方法识别一些新的词，即将串频统计和串匹配结合起来，既发挥匹配分词切分速度快、效率高的特点，又利用了无词典分词结合上下文识别生词、自动消除歧义的优点。In terms of form, a word is a combination of stable characters, so in the context, the more times adjacent characters appear at the same time, the more likely it is to form a word. Therefore, the frequency or probability of adjacent co-occurrence of words can better reflect the credibility of words. The frequency of combinations of adjacent co-occurring characters in the corpus can be counted, and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters, and calculate the adjacent co-occurrence probability of two Chinese characters X and Y. Mutual appearance information reflects the closeness of the combination relationship between Chinese characters. When the degree of closeness is higher than a certain threshold, it can be considered that this word group may form a word. This method only needs to count the frequency of words in the corpus, and does not need to segment the dictionary, so it is also called the dictionary-free word segmentation method or the statistical word extraction method. However, this method also has certain limitations. It will often extract some common word groups that have a high co-occurrence frequency but are not words, such as "one", "one", "some", "my", "Many", etc., and the recognition accuracy of common words is poor, and the time and space overhead is large. The statistical word segmentation system used in practice must use a basic word segmentation dictionary (common word dictionary) for string matching word segmentation, and at the same time use statistical methods to identify some new words, that is, to combine string frequency statistics and string matching. It has the characteristics of fast speed and high efficiency, and also utilizes the advantages of no-dictionary word segmentation combined with contextual recognition of new words and automatic disambiguation.

1、歧义识别1. Ambiguity recognition

歧义是指同样的一句话，可能有两种或者更多的切分方法。例如：表面的，因为“表面”和“面的”都是词，那么这个短语就可以分成“表面的”和“表面的”。这种称为交叉歧义。像这种交叉歧义十分常见，前面举的“和服”的例子，其实就是因为交叉歧义引起的错误。“化妆和服装”可以分成“化妆和服装”或者“化妆和服装”。由于没有人的知识去理解，计算机很难知道到底哪个方案正确。Ambiguity refers to the same sentence, there may be two or more segmentation methods. For example: surface, because "surface" and "surface" are both words, then this phrase can be divided into "surface" and "surface". This is called intersection ambiguity. This kind of cross-ambiguity is very common. The example of "kimono" mentioned above is actually an error caused by cross-ambiguity. "Makeup and clothing" can be divided into "makeup and clothing" or "makeup and clothing". Since there is no human knowledge to understand, it is difficult for the computer to know which solution is correct.

交叉歧义相对组合歧义来说是还算比较容易处理，组合歧义就必需根据整个句子来判断了。例如，在句子“这个门把手坏了”中，“把手”是个词，但在句子“请把手拿开”中，“把手”就不是一个词；在句子“将军任命了一名中将”中，“中将”是个词，但在句子“产量三年中将增长两倍”中，“中将”就不再是词。这些词计算机又如何去识别？Crossover ambiguity is easier to deal with than combination ambiguity, and combination ambiguity must be judged based on the entire sentence. For example, in the sentence "This doorknob is broken," "handle" is a word, but in the sentence "Please remove your hand," "handle" is not a word; in the sentence "The general appointed a lieutenant general" , "Lieutenant General" is a word, but in the sentence "production will triple in three years", "Lieutenant General" is no longer a word. How do computers recognize these words?

如果交叉歧义和组合歧义计算机都能解决的话，在歧义中还有一个难题，是真歧义。真歧义意思是给出一句话，由人去判断也不知道哪个应该是词，哪个应该不是词。例如：“乒乓球拍卖完了”，可以切分成“乒乓球拍卖完了”、也可切分成“乒乓球拍卖完了”，如果没有上下文其他的句子，恐怕谁也不知道“拍卖”在这里算不算一个词。If both cross-ambiguity and combined ambiguity can be solved by computers, there is still a difficult problem in ambiguity, which is true ambiguity. True ambiguity means that given a sentence, it is up to people to judge which should be a word and which should not be a word. For example: "The table tennis auction is over", which can be divided into "The table tennis auction is over", and it can also be divided into "The table tennis auction is over". one word.

2、新词识别2. New word recognition

新词，专业术语称为未登录词。也就是那些在字典中都没有收录过，但又确实能称为词的那些词。最典型的是人名，人可以很容易理解句子“王军虎去广州了”中，“王军虎”是个词，因为是一个人的名字，但要是让计算机去识别就困难了。如果把“王军虎”做为一个词收录到字典中去，全世界有那么多名字，而且每时每刻都有新增的人名，收录这些人名本身就是一项巨大的工程。即使这项工作可以完成，还是会存在问题，例如：在句子“王军虎头虎脑的”中，“王军虎”还能不能算词？New words, technical terms are called unregistered words. That is, those words that have not been included in the dictionary, but can indeed be called words. The most typical is the name of a person, which can be easily understood by humans. In the sentence "Wang Junhu has gone to Guangzhou", "Wang Junhu" is a word because it is a person's name, but it is difficult for a computer to recognize it. If "Wang Junhu" is included in the dictionary as a word, there are so many names in the world, and there are new names added every moment. Including these names is a huge project in itself. Even if this work can be done, there will still be problems, for example: in the sentence "Wang Junhu has a tiger head and a tiger brain", can "Wang Junhu" still count as a word?

新词中除了人名以外，还有机构名、地名、产品名、商标名、简称、省略语等都是很难处理的问题，而且这些又正好是人们经常使用的词，因此对于搜索引擎来说，分词系统中的新词识别十分重要。目前新词识别准确率已经成为评价一个分词系统好坏的重要标志之一。现有分词算法基于词库的，在词库没有出现的词无法完成分词。In addition to personal names, new words include institution names, place names, product names, brand names, abbreviations, abbreviations, etc., which are difficult to deal with, and these happen to be words that people often use, so for search engines , new word recognition in word segmentation system is very important. At present, the accuracy of new word recognition has become one of the important symbols to evaluate the quality of a word segmentation system. The existing word segmentation algorithm is based on thesaurus, and words that do not appear in the thesaurus cannot be segmented.

发明内容Contents of the invention

本发明针对上述技术问题，提供一种基于深度学习及遗忘算法的中文分词方法，通过将深度学习分词方法和遗忘算法分词方法的融合，可以自动侦测领域知识，完成无监督领域新词发现功能，提高分词效果。Aiming at the above technical problems, the present invention provides a Chinese word segmentation method based on deep learning and forgetting algorithm. By combining the deep learning word segmentation method and the forgetting algorithm word segmentation method, it can automatically detect domain knowledge and complete the new word discovery function in the unsupervised field. , improve word segmentation effect.

为解决上述技术问题，本发明采用的技术方案是：一种基于深度学习及遗忘算法的中文分词方法，包括以下步骤：In order to solve the above technical problems, the technical solution adopted in the present invention is: a Chinese word segmentation method based on deep learning and forgetting algorithm, comprising the following steps:

步骤一：逐字扫描句子获取自然语言，采用深度学习分词方法对扫描的自然语言划分成词语序列收入至第一词库；Step 1: Scan the sentence word by word to obtain natural language, and use the deep learning word segmentation method to divide the scanned natural language into word sequences and store them in the first thesaurus;

步骤二：逐字扫描句子获取自然语言，采用遗忘算法分词方法对获取的自然语言进行断字划分成候选词收入至第二词库，Step 2: Scan the sentence word by word to obtain natural language, use the forgetting algorithm word segmentation method to hyphenate the obtained natural language into candidate words and enter them into the second thesaurus,

步骤三：将第一词库中的词语序列与第二词库中的候选词融合获取最终分词结果，其中，融合方法为：Step 3: Fuse the word sequence in the first thesaurus with the candidate words in the second thesaurus to obtain the final word segmentation result, wherein the fusion method is:

第一词库与第二词库均为词，则合并为词；第一词库与第二词库均为单字，则合并为字；第二词库中的连续单字，若对应深度学习中为词，则合并为词；第二词库中的单个单字，若对应深度学习中为词，则向前或向后合并为词。Both the first thesaurus and the second thesaurus are words, they will be merged into words; both the first thesaurus and the second thesaurus are single characters, then they will be merged into words; the continuous words in the second thesaurus, if they correspond to the deep learning If it is a word in the second lexicon, it will be merged into a word; if a single word in the second lexicon corresponds to a word in deep learning, it will be merged into a word forward or backward.

其中，步骤一深度学习分词方法采用RNN方法。Wherein, the step 1 deep learning word segmentation method adopts the RNN method.

其中，步骤一深度学习分词方法采用RNN方法中的LSTM模型。Wherein, the step 1 deep learning word segmentation method adopts the LSTM model in the RNN method.

其中，步骤二遗忘算法分词方法采用判断公式为：Wherein, the step 2 forgetting algorithm word segmentation method adopts the judgment formula as follows:

P(W_nW_n+1)＜P(W_n)*P(W_n+1)P(W _n W _n+1 )＜P(W _n )*P(W _n+1 )

其中，Wn为在扫描句子中第n个字Among them, Wn is the nth word in the scanned sentence

其中，步骤二中遗忘算法采用的遗忘曲线为牛顿冷却曲线。Wherein, the forgetting curve adopted by the forgetting algorithm in step 2 is the Newton cooling curve.

本发明的有益效果是：The beneficial effects of the present invention are:

本发明的分词方法具有以下优点：The word segmentation method of the present invention has the following advantages:

(1)无监督学习，可以使用大量的语料进行训练；(1) Unsupervised learning, which can use a large amount of corpus for training;

(2)O(N)级时间复杂度，对于大规模分词可以用相对比较短的时间进行完成；(2) O(N)-level time complexity, which can be completed in a relatively short time for large-scale word segmentation;

(3)词库自维护，程序可无需人工参与的情况下，自行发现并添加新词、调整词频、清理错词、移除生僻词，保持词典大小适当；(3) Thesaurus self-maintenance, the program can discover and add new words, adjust word frequency, clean up wrong words, remove uncommon words, and keep the dictionary size appropriate without manual participation;

(4)领域自适应：领域变化时，词条、词频自适应的随之调整；(4) Domain self-adaptation: when the domain changes, entries and word frequency are adaptively adjusted accordingly;

(5)可以支持冷僻艺人姓名，节目名称等专有词库的分词。(5) It can support the word segmentation of the proprietary thesaurus such as names of uncommon artists and program titles.

附图说明Description of drawings

图1是本发明的一种基于深度学习及遗忘算法的中文分词方法中遗忘系数采用的遗忘曲线图；Fig. 1 is the forgetting curve figure that forgetting coefficient adopts in a kind of Chinese word segmentation method based on deep learning and forgetting algorithm of the present invention;

图2是本发明的一种基于深度学习及遗忘算法的中文分词方法中LSTM模型逻辑图。Fig. 2 is a logic diagram of the LSTM model in a Chinese word segmentation method based on deep learning and forgetting algorithm of the present invention.

具体实施方式Detailed ways

在以下优选的实施例的具体描述中，将参考构成本发明一部分的所附的附图。所附的附图通过示例的方式示出了能够实现本发明的特定的实施例。示例的实施例并不旨在穷尽根据本发明的所有实施例。可以理解，在不偏离本发明的范围的前提下，可以利用其他实施例，也可以进行结构性或者逻辑性的修改。因此，以下的具体描述并非限制性的，且本发明的范围由所附的权利要求所限定。In the following detailed description of the preferred embodiment, reference is made to the accompanying drawings which form a part hereof. The accompanying drawings show, by way of example, specific embodiments in which the invention can be practiced. The illustrated embodiments are not intended to be exhaustive of all embodiments in accordance with the invention. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present invention. Accordingly, the following detailed description is not limiting, and the scope of the invention is defined by the appended claims.

一种基于深度学习及遗忘算法的中文分词方法，包括以下步骤：A Chinese word segmentation method based on deep learning and forgetting algorithm, comprising the following steps:

本发明采用深度学习和遗忘算法的结合，可以自动侦测领域知识，完成无监督领域新词发现功能，提高分词的效果。The invention adopts the combination of deep learning and forgetting algorithm, which can automatically detect the domain knowledge, complete the function of discovering new words in the unsupervised domain, and improve the effect of word segmentation.

遗忘算法的主要步骤如下：The main steps of the forgetting algorithm are as follows:

使用下面的步骤可以O(N)级时间，单遍扫描完成分词：Use the following steps to complete word segmentation in O(N)-level time with a single pass scan:

逐字扫描句子，从词库中查出限定字长内以该字结尾的所有词，分别计算其中的词与该词之前各词的概率乘积，取结果值最大的词，分别缓存下当前字所在位置的最大概率积，以及对应的分词结果。重复上面的步骤，直到句子扫描完毕，最后一字位置所得到即为整句分词结果。Scan the sentence word by word, find all the words ending with the word within the limited word length from the thesaurus, calculate the product of the probability of the word in it and the words before the word, take the word with the largest result value, and cache the current word respectively The maximum probability product of the location, and the corresponding word segmentation results. Repeat the above steps until the sentence is scanned, and the result of the word segmentation of the last word is the word segmentation result of the whole sentence.

如果相邻两字无关，就可以将两字中间断开。逐字扫描句子，如果相邻两字满足下面的公式，则将两字断开，如此可将句子切成若干子串，从而获得“候选词”集，判断公式如下图所示：If the adjacent two words are irrelevant, the middle of the two words can be disconnected. Scan the sentence word by word, and if the two adjacent words satisfy the following formula, then the two words will be disconnected. In this way, the sentence can be cut into several substrings to obtain the "candidate word" set. The judgment formula is shown in the figure below:

P(W_nW_n+1)＜P(W_n)*P(W_n+1)P(W _n W _n+1 )＜P(W _n )*P(W _n+1 )

公式中所需的参数可以通过统计获得：遍历一次语料，即可获得公式中所需的“单字的频数”、“相邻两字共现的频数”，以及“所有单字的频数总和”。The parameters required in the formula can be obtained through statistics: by traversing the corpus once, the "frequency of a word", "frequency of co-occurrence of two adjacent words", and "sum of frequencies of all words" required in the formula can be obtained.

其中的遗忘系数使用的遗忘曲线如图1：The forgetting coefficient used in the forgetting curve is shown in Figure 1:

深度学习方法采用的是RNN方法，具体采用的是LSTM模型。The deep learning method uses the RNN method, specifically the LSTM model.

中文分词是将自然语言文本划分成词语序列，优选序列标注，即用BMES这个四个标签去标注句子中的每一个字(B是词首，M是词中，E是词尾，S是单字词)。Chinese word segmentation is to divide the natural language text into a sequence of words, preferably sequence labeling, that is, use the four labels of BMES to label each word in the sentence (B is the beginning of the word, M is the middle of the word, E is the end of the word, S is the word word).

对于{京东搜索与大数据平台数据挖掘算法部}For {JD Search and Big Data Platform Data Mining Algorithm Department}

其标注为{BE BE S BME BE BMME BME}It is labeled {BE BE S BME BE BMME BME}

针对原始输入序列和输出序列用以标记语料进行训练，最终产生分词序列，LSTM模型逻辑图见图2。图2中，X为输入序列，H为输出序列，分词的基础思想还是使用序列标注问题，将一个句子中的每个字标记成BEMS四种label。模型整的输入是字符序列，输出是一个标注序列，因此这是一个标准的sequenceto sequence问题。The original input sequence and output sequence are used to mark the corpus for training, and finally the word segmentation sequence is generated. The logic diagram of the LSTM model is shown in Figure 2. In Figure 2, X is the input sequence, and H is the output sequence. The basic idea of word segmentation is to use the sequence labeling problem to mark each word in a sentence into four BEMS labels. The input of the model is a sequence of characters, and the output is a sequence of labels, so this is a standard sequence to sequence problem.

组合的分词方法通过对两种方法结果的融合，提升分词的效果，以遗忘算法为主题，因为The combined word segmentation method improves the effect of word segmentation by fusing the results of the two methods, and takes the forgetting algorithm as the theme, because

·综艺，艺人名等对于在推荐占据重要地位Variety shows, artist names, etc. play an important role in the recommendation

·遗忘算法作为无监督学习，训练语料获取途径成本低The forgetting algorithm is used as unsupervised learning, and the cost of obtaining training corpus is low

·深度学习算法训练语料稀缺，训练时间长·Deep learning algorithm training corpus is scarce and the training time is long

合并方案Merger plan

·遗忘算法结果中的连续单字，若对应深度学习中为词，则合并为词· Forget the continuous words in the algorithm results, if they correspond to words in deep learning, they will be merged into words

·遗忘算法结果中的单个单字，若对应深度学习中为词，则向前或向后合并为词·Forgetting a single word in the algorithm result, if it corresponds to a word in deep learning, it will be merged forward or backward into a word

·参考词性进行合并·Merge with reference to part of speech

实施例1：Example 1:

通过扫描句子获取自然语言，然后通过遗忘算法分词和深度学习分词，并融合后的分词结果：Obtain natural language by scanning sentences, and then use forgetting algorithm and deep learning to segment words, and the fused word segmentation results:

以下是两种算法分别分词的结果：The following are the results of the word segmentation of the two algorithms:

遗忘算法改进分词结果；The forgetting algorithm improves word segmentation results;

<实拍><男子><地铁><猥亵><女><乘客><><被><热心><乘客><扭><获><Real shot> <man> <subway> <obscene> <female> <passenger> <> <being> <enthusiastic> <passenger> <twisting> <obtaining>

<口袋妖怪><网络><版><的><注册><下载><教学><视频><Pokemon> <Network><Edition><Registration><Download><Tutorial><Video>

<霍><某><某><雪夜><觅><真爱><超><浪漫><表白><女生><感动><痛哭><161105><非常><完美><Huo> <some> <some> <snowy night> <seeking> <true love> <super> <romantic> <confession> <girl> <moved> <cry> <161105><very><perfect>

<微微一笑很倾城><郑某><杨某><吻戏><玩><游戏><谈><恋爱><Small smile is alluring> <Zheng Mou> <Yang Mou> <kiss scene> <play> <game> <talk> <love>

<姜某某><调侃><麻将><应><进><奥运><笑称><可><与><体操><结合><Jiang Moumou><ridicule><mahjong><should><enter><Olympic><joke><may><and><gymnastics><combination>

<周><某><某><守><备><站><左><外><野><球><场><喇叭><声><超大><自><备><妙招><防><敌><军>；<week> <some> <some> <keep> <preparation> <stand> <left> <outside> <field> <ball> <field> <horn> <sound> <super big> <self> <preparation> <coup ><defence><enemy><army>;

深度学习算法分词结果：Word segmentation results of deep learning algorithm:

<实><拍><男子><地铁><猥亵女><乘客><被><热心><乘客><扭获><actual> <shoot> <man> <subway> <obscene woman> <passenger> <been> <enthusiastic> <passenger> <wrestle>

<口袋><妖怪><网络版><的><注册><下载><教学><视频><Pocket> <Yokai> <Online Version> <The> <Register> <Download> <Tutorial> <Video>

<霍某某><雪><夜觅><真爱><超><浪漫><表白><女生><感动痛><哭><161105><非常><完美><Huo Moumou><Snow><Night Seeking><True Love><Super><Romantic><Confession><Girl><Moving Pain><Crying><161105><Very><Perfect>

<微微><一笑><很><倾城><郑某><杨某><吻><戏><玩><游戏><谈恋爱><微微><一笑><very><Qingcheng><Zheng Mou><Yang Mou><kiss><play><play><game><love in love>

<姜><某><某><调侃><麻将><应><进奥运><笑称><可><与><体操><结合><Jiang> <some> <some> <ridicule> <mahjong> <should> <enter the Olympics> <laughing> <may> <and> <gymnastics><combination>

<周某某><守备><站><左><外><野球场><喇叭><声超大><自备><妙><招防><敌军>；<Zhou Moumou> <Defense> <Stand> <Left> <Outside> <Field Field> <Horn> <Sound super loud> <Self-prepared> <Wonderful> <Recruitment> <Enemy>;

通过以上方案合并之后的结果：The result after combining the above schemes:

<实拍><男子><地铁><猥亵女><乘客><被><热心><乘客><扭获><Real shot> <man> <subway> <obscene woman> <passenger> <been> <enthusiastic> <passenger> <wrestled>

<口袋妖怪><网络版><的><注册><下载><教学><视频><Pokemon> <online version> <of> <registration> <download> <tutorial> <video>

<霍某某><雪夜><觅><真爱><超><浪漫><表白><女生><感动><痛哭><161105><非常><完美><Huo Moumou><Snow Night><Looking><True Love><Super><Romantic><Confession><Girl><Moving><Crying><161105><Very><Perfect>

<微微一笑很倾城><郑某><杨某><吻戏><玩><游戏><谈恋爱><Small smile is very charming> <Zheng Mou> <Yang Mou> <kiss scene> <play> <game> <fall in love>

<周某某><守备><站><左><外><野球场><喇叭><声超大><自备><妙招><防><敌军>。<Zhou Moumou> <defense> <stand> <left> <outside> <field field> <horn> <sound super loud> <self-preparation> <coup> <defense> <enemy>.

Claims

1. A Chinese word segmentation method based on deep learning and forgetting algorithm is characterized by comprising the following steps:

step one: the method comprises the steps of scanning sentences word by word to obtain natural language, dividing the scanned natural language into word sequences by adopting a deep learning word segmentation method, and collecting the word sequences into a first word stock;

step two: the sentence is scanned word by word to obtain natural language, the word segmentation method of forgetting algorithm is adopted to divide the obtained natural language into candidate words and receive the candidate words into a second word stock,

step three: fusing the word sequence in the first word bank with the candidate words in the second word bank to obtain a final word segmentation result, wherein the fusion method comprises the following steps:

the first word stock and the second word stock are words, and then are combined into words; the first word stock and the second word stock are single words, and are combined into words; the continuous single words in the second word stock are combined into words if the words in the corresponding deep learning; if the single word in the second word stock is the word in the corresponding deep learning, merging the single word forwards or backwards into the word;

the word segmentation method of the forgetting algorithm adopts a judgment formula as follows:

P(W _n W _n+1 )＜P(W _n )*P(W _n+1 )

wherein Wn is the nth word in the scanned sentence

P(W _n )：

P(W _n Wn ₊₁ )：

And in the second step, the forgetting curve adopted by the forgetting algorithm is a Newton cooling curve.

2. The method for Chinese word segmentation based on deep learning and forgetting algorithm as set forth in claim 1, wherein the step one deep learning word segmentation method adopts RNN method.

3. The method for Chinese word segmentation based on deep learning and forgetting algorithm according to claim 1 or 2, wherein the step one deep learning word segmentation method adopts an LSTM model in the RNN method.