文章摘要
王丽君,赵子岩,马丽,蒋慧超,张冉.基于大语言模型的绿色低碳领域三元组抽取方法[J].数字图书馆论坛,2025,21(1):46~54,66
基于大语言模型的绿色低碳领域三元组抽取方法
Triplet Extraction Method for Green and Low-Carbon Field Based on Large Language Models
投稿时间:2024-10-16  
DOI:10.3772/j.issn.1673-2286.2025.01.006
中文关键词: 三元组抽取;知识图谱;大语言模型;绿色低碳
英文关键词: Triplet Extraction; Knowledge Graph; Large Language Model; Green and Low-Carbon
基金项目:本研究得到国家电网公司总部科技项目“‘双碳’目标下电力绿色低碳关键支撑技术评价方法和专利标准化研究”(编号:1400-202340338A-1-1-ZN)资助。
作者单位
王丽君 国家电网有限公司信息通信分公司 
赵子岩 国家电网有限公司信息通信分公司 
马丽 国家电网有限公司信息通信分公司 
蒋慧超 国家电网有限公司信息通信分公司 
张冉 国家电网有限公司信息通信分公司 
摘要点击次数: 25
全文下载次数: 57
中文摘要:
      三元组抽取旨在提取文本中的实体及其相互关系,从而形成结构化的知识表示,是构建自动化知识图谱的关键技术。尽管基于传统深度学习的三元组抽取方法在拥有充足训练数据时表现出色,但在电力行业绿色低碳领域等垂直场景中,由于缺乏规范化的监督数据,人工标注成本高昂,且论文和专利数据中存在大量专业术语,深度学习抽取方法的识别准确度受限。为了解决这些问题,设计了基于大语言模型的三元组抽取方法,利用闭源大模型标注少量高质量监督数据,结合检索增强技术指导开源模型进行抽取,实现了高质量且自动化的垂直领域抽取。此外,为了提升少样本场景下的抽取效率与精确率,本方法还包含了数据分流与复杂数据划分模块,以抽取难易程度为标准将数据分流,并进一步划分复杂数据来简化抽取,从而提升抽取效果。为了验证模型性能,利用GPT-4自动化标注了一个基于电力领域专利和论文的数据集,并引入了ChatGPT和ChatGLM等知名闭源以及开源大模型作对比,实验结果证明提出的方法具有更好的抽取性能。
英文摘要:
      Triplet extraction aims to extract entities and their relationships from text to form structured knowledge representations, which is a key technology for building automated knowledge graphs. Although traditional deep learning-based triplet extraction methods perform well when sufficient training data is available, in vertical scenarios such as the green and low-carbon sector of the power industry, the lack of standardized supervised data, high cost of manual annotation, and the presence of many specialized terms in papers and patents limit the recognition accuracy of these methods. To address these issues, this paper proposes a triplet extraction method based on large language models. By using proprietary large models to annotate a small amount of high-quality labeled data and combining retrieval-augmented techniques to guide open-source models for extraction, high-quality and automated vertical domain extraction has been achieved. Moreover, to improve extraction efficiency and precision in few-shot scenario, this method also includes a data streamlining and complex data segmentation module, which divides the data based on the difficulty level of extraction and further divides complex data to simplify the extraction process, thereby improving the extraction effect. To verify the performance of the model, we automatically annotate a dataset of patents and papers in the power field using GPT-4, and introduce comparisons with well-known proprietary and open-source large models such as ChatGPT and ChatGLM. The experimental results demonstrate that our method achieves better extraction performance.
查看全文   查看/发表评论  下载PDF阅读器
关闭

分享按钮