Analysis of Dataset Citing Behaviors in the Field of Natural Language Processing in China
中文关键词: 数据集引用;数据引用;自然语言处理;高被引数据集;数据集重用
英文关键词: Dataset Citation; Data Citation; Natural Language Processing; Highly Cited Dataset; Dataset Reuse
徐琳宏 大连外国语大学软件学院 
王凯达 大连外国语大学软件学院 
张立杰 大连外国语大学软件学院 
摘要点击次数: 608
全文下载次数: 548
      随着科学研究对数据的依赖性不断增强,分析国内自然语言处理领域内数据集的引用行为,有利于规范化数据集的构建和使用,推动国内自然语言处理领域的快速发展。选取《中文信息学报》2013—2022年的1 628篇论文为样本,通过全文本分析法,人工标注1 970条数据集引用信息,以研究文献对数据集的引用行为。研究发现:在国内自然语言处理领域研究中,引用他人数据集的论文数量逐渐增加,使用自建数据集的论文逐渐减少,并且引用数据集论文的篇均被引频次高于自建数据集论文;引用多个数据集的倾向较为明显,引用单个数据集的论文逐渐减少,并且引用2~3个数据集论文的篇均被引频次高于引用单个数据集的论文;数据集重用性较低,高被引数据集主要来源于评测。
      With the increasing dependence of scientific research on data, investigating the reference behavior of datasets in the field of natural language processing (NLP) in China is conducive to promoting the standardized construction and citation of datasets and the fast development of this field. This paper selects 1 628 papers from the Journal of Chinese Information Processing from 2013 to 2022 as samples and the citation information of 1 970 datasets is manually marked through full-text analysis to study the citation behavior of datasets in the literature. In the field of NLP research in China, the number of papers citing others’ datasets is gradually increasing, while the number of papers using self-built datasets is decreasing. Furthermore, the average citation frequency of papers citing datasets is higher than that of papers using self-built datasets. There is a tendency to cite multiple datasets, and the number of papers citing a single dataset is decreasing. Moreover, the average citation frequency of papers citing 2 to 3 datasets is higher than that of papers citing a single dataset. Dataset reusability is relatively low, and highly cited datasets primarily come from evaluations.
查看全文   查看/发表评论  下载PDF阅读器
