需要金幣:![]() ![]() |
資料包括:完整論文 | ![]() |
![]() |
轉(zhuǎn)換比率:金額 X 10=金幣數(shù)量, 例100元=1000金幣 | 論文字?jǐn)?shù):12342 | ![]() | |
折扣與優(yōu)惠:團(tuán)購(gòu)最低可5折優(yōu)惠 - 了解詳情 | 論文格式:Word格式(*.doc) | ![]() |
摘要:隨著人工智能技術(shù)的發(fā)展,自然語(yǔ)言理解領(lǐng)域的應(yīng)用已經(jīng)越來(lái)越廣泛,幾乎任何基于漢語(yǔ)文本的系統(tǒng)都必須經(jīng)過(guò)分詞這一步。中文分詞技術(shù)是對(duì)中文句子的切分技術(shù),是計(jì)算機(jī)理解漢字意思的基礎(chǔ),是中文信息處理系統(tǒng)中最重要的預(yù)處理技術(shù)。而未登錄詞的識(shí)別則是影響中文分詞準(zhǔn)確率的一個(gè)重要因素。所謂未登錄詞主要是指分詞系統(tǒng)的常用詞詞典中未收錄的詞。漢語(yǔ)中未登錄詞的種類很多、結(jié)構(gòu)規(guī)律各異、數(shù)量眾多,而且還在不斷形成,不可能完全收錄到常用詞詞典中。但是如果一篇文章中存在未被識(shí)別的未登錄詞,將直接影響到中文分詞的準(zhǔn)確率和召回率。雖然現(xiàn)在國(guó)內(nèi)外也有許多分詞軟件,未登錄詞識(shí)別的準(zhǔn)確率和召回率都有所提高,但是未登錄詞的誤判和漏判將干擾中文信息檢索以及中文分詞的正確進(jìn)行。 首先,本文選取人民日?qǐng)?bào)(2001~2004)的語(yǔ)料,作為實(shí)驗(yàn)的語(yǔ)料庫(kù)。然后利用中科院的分詞軟件對(duì)語(yǔ)料進(jìn)行切割分詞。本文主要處理的是連續(xù)單字所組成的散串(分詞碎片),判斷他們是否為可能為未登錄詞。通過(guò)陳小荷教授的一攬子算法對(duì)分詞后的語(yǔ)料進(jìn)行分析,通過(guò)大規(guī)模語(yǔ)料求出單字概率,單字詞概率以及單字非詞概率,并根據(jù)所求的數(shù)據(jù)進(jìn)行算法實(shí)現(xiàn)。本文選取了三個(gè)測(cè)試語(yǔ)料,通過(guò)提取的未登錄詞的總個(gè)數(shù),正確的未登錄詞個(gè)數(shù),以及未提取到的未登錄詞個(gè)數(shù),準(zhǔn)確率和召回率分別為:84.61%、91.67%,81.66%、98.0%,83.33% 、90.91%。結(jié)果表明本系統(tǒng)對(duì)未登錄詞的識(shí)別率比較高。 關(guān)鍵詞:未登錄詞,分詞碎片,單字非詞概率,單字詞概率
Abstract:With the development of artificial intelligence, natural language understanding application of the field has become increasingly widespread, almost any system based on Chinese word must be segmentation through this step, Chinese word segmentation is a technology for Chinese sentences, and foundation of the computer to understand Chinese characters, the most important pre-processing techniques in Chinese information processing system. The identification of unknown words is an important factor for the accuracy of Chinese word segmentation. The so-called unknown words are mainly refers to the common dictionary words are not included in the segmentation system. Unknown- words in Chinese have many different types, different law structure, a large number, but also constantly updated and expanded, and not fully included into the common word dictionary. But if an article has the unknown words which can’t be identified, this will directly affect the accuracy and recall of Chinese word segmentation. Although there are many words segmentation software at home and abroad, unknown word recognition and recall accuracy are improved, but the miscarriage and the Missing of unknown words would interfere the Chinese information retrieval and the Chinese word segmentation correctly. First, we select the corpus of the People’s Daily (2001~2004) as the experimental corpus. And use the CAS word segmentation software to cut the corpus. This paper deals with the bulk consisting of a continuous string of words (sub-word fragments) and to determine whether they are likely to unknown words. Analysis the corpus which have been segmentation though the Package algorithm of Professor Chen Xiao he. Though a large corpus calculate the probability of word, single word probability and the probability of words of non-words. And according to gained data to achieve the algorithm. This paper selects three test corpus, By extracting the total number of unknown words, the correct number of unknown words, and not to extract the number of unknown words , precision and recall rates were: 84.61%, 91.67%, 81.66%, 98.0%, 83.3%, and 90.91%. The results show that the system recognition rate of unknown words is relatively high. Key words: unknown words sub-word fragments single word of non-words probability single word probability
|