文章摘要
王文瑾,李昀昊,张寅.大语言模型文档图像智能问答指令设计与微调方法实证研究[J].数字图书馆论坛,2025,21(1):11~21,32
大语言模型文档图像智能问答指令设计与微调方法实证研究
Empirical Research on Instruction Design and Fine-Tuning Methods for Large Language Models in Document Image Intelligent Question Answering
投稿时间:2024-09-30  
DOI:10.3772/j.issn.1673-2286.2025.01.003
中文关键词: 文档图像;智能问答;大语言模型;提示学习;指令微调
英文关键词: Document Image; Intelligent Question Answering; Large Language Model; Prompt Learning; Prompt Tuning
基金项目:本研究得到浙江省自然科学基金重点项目“低资源跨语言多模态内容表示学习技术研究”(编号:LZ23F020009)、国家自然科学基金 面上项目“面向多模态知识搜索的多轮机器阅读理解和持续学习研究”(编号:62072399)资助。
作者单位
王文瑾 浙江大学计算机科学与技术学院;数字图书馆教育部工程研究中心 
李昀昊 浙江大学计算机科学与技术学院;数字图书馆教育部工程研究中心 
张寅 浙江大学计算机科学与技术学院;数字图书馆教育部工程研究中心 
摘要点击次数: 32
全文下载次数: 69
中文摘要:
      文档图像智能问答是实现数字图书馆智能化的关键技术之一。基于多模态预训练模型的文档图像智能问答技术能有效实现文本、视觉和布局信息的融合,但通常需要进行针对性的微调训练,成本高且无法应用于一些数据资源稀缺的场景。以ChatGPT为代表的大语言模型具有良好的零样本学习能力,无需针对性微调即可在各个下游任务上取得良好表现,但大语言模型只能处理纯文本指令,无法直接处理文档图像。因此,提出利用空格符和换行符来模拟表示文档图像中文本间的相对位置关系,生成布局感知文本,并针对不同任务构造不同的指令模板,以文本指令形式指导大语言模型生成符合任务要求的答案。实验证明这种布局与任务感知的指令设计与微调方法显著提升了多种大语言模型的零样本文档图像智能问答表现,其最佳组合在DocVQA、InfographicVQA和MP-DocVQA 3个文档图像问答评测集上的零样本平均规范化莱文斯坦相似度分别为0.865 1、0.545 1和0.612 9,达到甚至超过布局感知预训练模型的全量微调表现水平。所提方法还应用在大学数字图书馆国际合作计划(CADAL)民国报刊扫描文档智能问答场景中,提高读者在扫描文档中定位所需答案的效率。
英文摘要:
      Document image intelligent question answering is one of the key technologies for realizing intelligent digital libraries. While document imageintelligent question answering based on multimodal pre-trained models can effectively integrate textual, visual, and layout information, such approaches typically require targeted fine-tuning, which incur high costs and cannot be applied in data-scarce scenarios. Large language models, exemplified by ChatGPT,excel in zero-shot learning, achieving impressive performance on various downstream tasks without requiring task-specific fine-tuning. However, these large language models are limited to processing pure text instructions and cannot directly handle document images. To address this, we propose a novelapproach that utilizes spaces and line breaks to simulate the relative positional relationships between texts in document images, thereby generating layoutaware text. Additionally, we construct different instruction templates tailored to various tasks, using text-based instructions to guide large language models to generate answers that align with task requirements. Our experiments demonstrate that this layout- and task-aware instruction design and fine-tuning approach significantly improve zero-shot question answering performance of the large language model on document images. The optimal combination of our approach with large language model achieves zero-shot ANLS scores of 0.865 1, 0.545 1, and 0.612 9 on the DocVQA, InfographicVQA, and MP-DocVQA datasets, respectively. These results are comparable to or even surpass those of fully fine-tuned layout-aware pre-trained models. The proposed approach has also been applied to intelligent Q&A system for scanned Republican era periodicals in the CADAL, significantly improving readers’ efficiency in locating desired answers within scanned documents.
查看全文   查看/发表评论  下载PDF阅读器
关闭

分享按钮