| 王小丫,马建玲.基于DistilBERT的领域智库文本自动分类方法研究[J].数字图书馆论坛,2026,22(2):1~10 |
| 基于DistilBERT的领域智库文本自动分类方法研究 |
| Research on the Automatic Classification Method of Domain Think Tank Texts Based on DistilBERT |
| 投稿时间:2025-10-15 |
| DOI:10.3772/j.issn .1673-2286.2026.02.001 |
| 中文关键词: 智库;智库文本;文本自动分类;主题建模;DistilBERT;生物医药 |
| 英文关键词: Think Tank; Think Tank Text;Automatic Text Classification; Topic Modeling; DistilBERT; Biomedicine |
| 基金项目:本研究得到中国科学院文献情报中心项目“智库资源保障平台建设及资源采集服务”(编号:E340090701)资助 |
| 作者 | 单位 | | 王小丫 | 中国科学院西北生态环境资源研究院 | | 马建玲 | 中国科学院西北生态环境资源研究院 |
|
| 摘要点击次数: 17 |
| 全文下载次数: 19 |
| 中文摘要: |
| 随着国外智库成果数量的激增及其在政策研究与情报分析等应用场景中的参考价值日益凸显,高效处理与深度分析海量智库文本成为相关研究与信息分析工作的核心问题,其中实现智库文本的自动分类是提升信息检索效率、支撑后续深度分析的重要基础。本研究旨在构建面向领域的国外智库文本自动分类方法,为情报研究与分析提供技术支撑。针对目前领域智库文本标注数据稀缺、分类标准差异较大的问题,提出基于无监督主题建模与轻量化预训练语言模型微调相结合的混合自动分类技术框架:首先通过BERTopic实现无监督主题发现,构建数据驱动的分类标准;其次利用知识蒸馏模型DistilBERT进行领域自适应微调,实现对领域智库文本的高效准确分类。在生物医药领域智库文本的实验结果表明,该方法在F1值(0.746 5)和召回率(0.750 0)上均优于多种基线模型,验证了其在领域智库文本分类任务中的有效性。 |
| 英文摘要: |
| Abstract: With the surge in the number of foreign think tank results and their reference value in application scenarios such as policy research and intelligence analysis, efficient processing and in-depth analysis of massive think tank texts have become core issues in related research and information analysis. Among them, automatic classification of think tank texts is an important foundation for improving information retrieval efficiency and supporting subsequent in-depth analysis. This research aims to construct a field-oriented automatic classification method for foreign think tank texts to provide technical support for intelligence research and analysis. In view of the current problems of scarcity of text annotation data and large differences in classification standards in domain think tanks, a hybrid automatic classification technology framework based on a combination of unsupervised topic modeling and lightweight pre-trained language model fine-tuning is proposed: First, unsupervised topic discovery is realized through BERTopic to build a data-driven classification standard; then the knowledge distillation model DistilBERT is used for domain adaptive fine-tuning to achieve efficient and accurate classification of domain think tank texts. Experimental results on think tank texts in the field of biomedicine show that this method is superior to various baseline models in both F1 value (0.746 5) and recall rate (0.750 0), verifying its effectiveness in text classification tasks in domain think tanks. |
|
查看全文
查看/发表评论 下载PDF阅读器 |
| 关闭 |