文章摘要
丁睿祎,王玉琢,章成志.基于学术论文全文内容的特定领域算法实体抽取研究[J].数字图书馆论坛,2022,(3):2~14
基于学术论文全文内容的特定领域算法实体抽取研究
Extraction Algorithmic Entity from Full Text of Academic Articles in Special Domain
投稿时间:2022-03-03  
DOI:10.3772/j.issn.1673-2286.2022.03.001
中文关键词: 学术论文全文内容;算法实体;实体抽取;学术文本挖掘
英文关键词: Full Text of Academic Articles; Algorithmic Entity; Entity Extraction; Academic Text Mining
基金项目:本研究得到江苏省社会科学基金项目“多维视角下学术创新力评估与预测研究”(编号:18TQD003)资助。
作者单位
丁睿祎 南京理工大学经济管理学院 
王玉琢 南京理工大学经济管理学院 
章成志 南京理工大学经济管理学院 
摘要点击次数: 1490
全文下载次数: 2437
中文摘要:
      对学术论文中的算法实体进行研究,能够促进深入了解算法对科学研究的作用,而从全文数据中抽取算法实体是相关研究的基础。学术论文全文内容中算法实体的抽取可以看作一种特殊的命名实体识别。本文通过人工识别的方法,从4 641篇论文中抽取出977种算法实体并构建算法实体词列表,以此为基础构建标注语料,训练算法实体自动抽取模型,在剩余语料上抽取得到221种新算法实体,并将自动抽取结果与人工抽取结果进行整合得到全部算法实体1 198种。研究结果表明:人工抽取法的结果能够为自动抽取法构建一定数量的标注语料,所构建的算法实体自动抽取模型能够有效地抽取出人工方法中遗漏的新算法实体,同时还能够抽取出已有算法实体的全新表达形式,进一步对人工抽取结果进行扩充和完善。
英文摘要:
      The research on algorithmic entities in academic papers can promote an in-depth understanding of the role of algorithmic in scientific research, and extracting algorithmic entities from full-text of academic articles, which is regarded a special named entity extraction, is the basis of this research. Through the method of manual recognition, this paper extracts 977 algorithmic entities from full-text content of 4 641 papers and obtains a dictionary of algorithmic entity. Based on this, a labeled corpus is constructed and an automatic extraction model of algorithmic entities is trained. 221 new algorithmic entities are extracted from the remaining corpus. Finally, the automatic and manual extracting results are integrated to obtain a total of 1 198 algorithmic entities. The results show that the manual extraction method can build an annotated corpus for the automatic method, and the automatic model can extract the new algorithmic entities which are missed in the manual method effectively. What’s more, the new expression form of the existing algorithmic entities are extracted, so as to further expand and improve the manual extraction results.
查看全文   查看/发表评论  下载PDF阅读器
关闭

分享按钮