文章摘要
彭鹏,徐红姣.面向开源科技情报分析的智能文本分类方法研究[J].数字图书馆论坛,2025,21(2):65~72
面向开源科技情报分析的智能文本分类方法研究
Intelligent Text Classification Method for Open-Source Technology Intelligence Analysis
投稿时间:2025-01-23  
DOI:10.3772/j.issn.1673-2286.2025.02.007
中文关键词: 开源科技情报;文本分类;信息过滤;预训练语言模型
英文关键词: Open-Source Technology Intelligence; Text Classification; Information Filtering; Pre-Trained Language Model
基金项目:
作者单位
彭鹏 中国科学技术信息研究所 
徐红姣 中国科学技术信息研究所 
摘要点击次数: 32
全文下载次数: 127
中文摘要:
      随着网络信息的爆发式增长,从海量的网络文本信息中识别有价值的科技情报并对其进行智能分类成为开源科技情报分析的关键。针对开源科技情报文本的特点,构建了面向开源科技情报分析的文本智能去噪与分类一体化模型。结合大语言模型与提示工程的自动标注方法进行噪声数据标注及文本分类数据标注;基于预训练语言模型进行噪声识别与过滤,过滤非科技情报文本;利用多语言预训练模型及蒸馏技术,改进损失函数设计,解决类别分布不均和数据不足的问题,实现在一定程度上提升多标签科技情报文本分类的精度和稳定性的目标。实验结果表明,与TextCNN与BERT方法相比,所提出的方法具有较高的分类性能、更好的鲁棒性和适应性。
英文摘要:
      With the explosive growth of network information, identifying valuable technology intelligence from massive network text information and classifying it intelligently have become the key to open-source technology intelligence analysis. Based on the characteristics of open-source technology intelligence texts, this paper constructs an integrated model of text denoising and classification for open-source technology intelligence analysis. It combines large language model with automatic annotation method of prompt engineering to annotate noise data and text classification data. A pre-trained language model is constructed for noise recognition and filtering, filtering non-technology intelligence texts. Multilanguage pre-trained models and distillation techniques are used to improve the loss function design, solve the problems of uneven class distribution and insufficient data, and achieve the goal of improving the accuracy and stability of multi-label technology intelligence text classification to a certain extent. The experimental results show that compared with TextCNN and BERT methods, the method proposed in this paper has higher classification ability, robustness, and adaptability.
查看全文   查看/发表评论  下载PDF阅读器
关闭

分享按钮