需要金幣:![]() ![]() |
資料包括:完整論文 | ![]() |
![]() |
轉(zhuǎn)換比率:金額 X 10=金幣數(shù)量, 例100元=1000金幣 | 論文字?jǐn)?shù):13295 | ![]() | |
折扣與優(yōu)惠:團(tuán)購(gòu)最低可5折優(yōu)惠 - 了解詳情 | 論文格式:Word格式(*.doc) | ![]() |
摘要:文本分類是自然語(yǔ)言處理的一個(gè)重要應(yīng)用領(lǐng)域,在信息檢索、數(shù)字圖書館、文本過濾等方面有著重要地位。文本分類,能夠推動(dòng)文檔管理工作走向科學(xué)化、規(guī)范化,使之適應(yīng)現(xiàn)代管理制度的要求。 本文首先介紹了文本分類的研究背景和意義及其在國(guó)內(nèi)外的研究現(xiàn)狀;其次對(duì)實(shí)現(xiàn)文本分類系統(tǒng)的過程中使用的相關(guān)技術(shù)和算法,分別進(jìn)行了詳細(xì)闡述;接著在介紹了中文信息處理、文本分類技術(shù)和算法的基礎(chǔ)上,實(shí)現(xiàn)了一個(gè)基于向量空間模型的漢語(yǔ)文本分類系統(tǒng),就是通過特征選擇,對(duì)訓(xùn)練樣本集合構(gòu)建類模型,并以該模型作為文本自動(dòng)分類時(shí)的依據(jù)設(shè)計(jì)分類器,先后采用ROCCHIO、KNN文本分類方法對(duì)文本進(jìn)行分類;最后對(duì)實(shí)驗(yàn)結(jié)果進(jìn)行了分析與評(píng)價(jià)。 文本自動(dòng)分類主要包括文本模型、訓(xùn)練、分類、性能評(píng)估四個(gè)過程。首先對(duì)文本進(jìn)行預(yù)處理,將文本用模型表示,進(jìn)行特征提??;接著構(gòu)造并訓(xùn)練分類器;然后用分類器對(duì)新文本進(jìn)行分類;最后對(duì)分類性能進(jìn)行評(píng)估。 本實(shí)驗(yàn)所選用的中文語(yǔ)料分為訓(xùn)練語(yǔ)料和測(cè)試預(yù)料兩部分,其中包括計(jì)算機(jī)、環(huán)境、軍事、交通、教育、經(jīng)濟(jì)、體育、醫(yī)藥、藝術(shù)、政治,共10類,訓(xùn)練語(yǔ)料1430篇,測(cè)試預(yù)料195篇,共計(jì)1525篇。實(shí)驗(yàn)數(shù)據(jù)表明,特征抽取方法MI的分類性能隨著特征維數(shù)的增加分類性能變化明顯,KNN中K值的選取也對(duì)分類器的性能有較大的影響;當(dāng)特征維數(shù)和K值都選取最佳時(shí),KNN分類器的宏平均查準(zhǔn)率達(dá)到91.9%,宏平均查全率達(dá)到90.8%,具有較理想的精準(zhǔn)率和查全率, ROCCHIO分類器的宏平均查準(zhǔn)率達(dá)到54.9%,宏平均查全率達(dá)到45.1%,相對(duì)于KNN算法而言,分類性能不理想。 關(guān)鍵字:文本分類,向量空間模型,特征提取,訓(xùn)練樣本
Abstract:Text classification is an important natural language processing applications, in information retrieval, digital library, text filtering, and so has an important position. Text classification can make document management work to promote the scientific, standardized and adapt to a modern management system requirements. This article introduces the research background of the text classification and significance of their research status at home and abroad; Secondly, the process of realization of the text classification system used in related technologies and algorithms are described in detail; Then based on the introduction of the Chinese information processing and Text classification techniques and algorithms, showing a Chinese text categorization system based on a vector space model. That is through the selection, the training sample set of model building classes, and to the model as a basis for automatic text classification design classifier. Using ROCCHIO、KNN text classification method to classify the text; Finally, experimental results are analyzed and evaluated. Text categorization includes text model, training, classification, performance evaluation of. First, pretreatment of the text and said the text with the model to construct and train the classifier; then constructed and trained classifier; then use the classifier to classify new text; finally, evaluate the classification performance. The Chinese used in this experiment were divided into training data and test corpus is expected in two parts, including computers, environmental, military, transportation, education, economy, sports, medicine, art, politics, a total of 10 categories, the training is expected to 1430, the test is expected to 195, a total of 1525. Experimental data show that feature extraction method with the characteristics of MI classification performance of the dimension changed significantly increased classification performance, KNN in the selection of K value on the classification performance but also have a greater impact; when the feature dimension and K both select the best value,KNN classifier achieved 91.9% precision rate, recall rate of 90.8%, with better precision and recall rate, ROCCHIO classifier precision rate 54.9%, recall rate of 45.1%, compared with KNN algorithm, the classification performance is not satisfactory. Key words: Text classification, vector space model, feature extraction, training samples
|