杜若鹏,鲜国建,寇远涛.基于改进TF-IDF-CHI算法的农业科技文献文本特征抽取[J].数字图书馆论坛,2019,(8):18~24 |
基于改进TF-IDF-CHI算法的农业科技文献文本特征抽取 |
Improvement and Application of TF-IDF-CHI in Agricultural Science Text Feature Extraction |
投稿时间:2019-07-06 |
DOI:10.3772/j.issn.1673-2286.2019.08.003 |
中文关键词: 特征抽取;TF-IDF;卡方统计;文本分类;农业科技文献 |
英文关键词: Feature Extraction; TF-IDF; Chi-Square Statistics; Text Categorization; Agricultural Science |
基金项目:本研究得到国家社会科学基金项目“科技论文全景式摘要知识图谱构建与应用研究”(编号:19BTQ61)、中国农业科学院科技创新工程项目(编号:CAAS-ASTIP-2016-AII)和中国工程科技知识中心建设项目(编号:CKCEST-2018-1-15)资助。 |
作者 | 单位 | 杜若鹏 | 中国农业科学院农业信息研究所/农业农村部农业大数据重点实验室 | 鲜国建 | 中国农业科学院农业信息研究所/农业农村部农业大数据重点实验室 | 寇远涛 | 中国农业科学院农业信息研究所/农业农村部农业大数据重点实验室 |
|
摘要点击次数: 2349 |
全文下载次数: 1578 |
中文摘要: |
针对相近农业科研领域文献的文本特征信息高度重合的特点,以及传统的文本特征抽取方法存在的不足,对TF-IDF算法进行优化并加以应用验证。通过引入卡方检验值与特征词频修正因子等方式,对特征词加权函数进行重构,形成改进的ImpTF-IDF-CHI方法。将该方法与文档频率法、信息增益法及TF-IDF 3种传统的文本特征抽取结果应用于朴素贝叶斯分类实验,根据实验结果判定方法的优劣性。通过4种方法的58组特征抽取与文本分类实验,发现与前述的3种特征抽取方法相比,ImpTF-IDF-CHI方法抽取的特征词,应用于文本分类的正确率最高,平均准确率达94%,F1值为0.844,证明该方法在对相近农业科研领域文本进行特征抽取方面,具有准确率高、稳定性好、主题词代表性强等优点,可以有效地应用于此类文献文本分类、特征表达、主题抽取等场景。 |
英文摘要: |
This paper is aimed at improving the lack of traditional TF-IDF method and verifying its effectiveness through text classification tests in the agricultural field. The improved method is called ImpTF_IDF_CHI which is to reconstruct the feature word weighting function by adding chi-square test values and weight correction factors. First, we use the ImpTF-IDF-CHI method, document frequency method, information gain method and the TF-IDF to perform the feature word extraction test. Then we use feature extraction words for test of text classification and judge the pros and cons based on the test. In all the test results, the best results were obtained using the ImpTF-IDF-CHI method. The Accuracy of naive Bayesian text classification using the ImpTF-IDF-CHI method is 94% and F1 value is 0.844. The experiment fully proves the effectiveness and advancement of the ImpTF-IDF-CHI method. The ImpTF-IDF-CHI method has the characteristics of high accuracy, good stability, strong subject representative in text feature extraction. This method can be applied to fields such as text categorization, feature expression and theme extraction. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |