林鑫,余华娟,闫奕臻.复杂表格数据化中的单元格语义关系识别研究[J].数字图书馆论坛,2022,(9):28~35 |
复杂表格数据化中的单元格语义关系识别研究 |
Research on Cell Semantic Relation Recognition in Complex Table Digitization |
投稿时间:2022-08-27 |
DOI:10.3772/j.issn.1673-2286.2022.09.004 |
中文关键词: 复杂表格;语义关系;表格数据化;机器视觉 |
英文关键词: Complex Table; Semantic Relationship; Form Digitization; Machine Vision |
基金项目:本研究得到国家社会科学基金青年项目“社会网络中基于用户认知结构的知识标注研究”(编号:17CTQ024)资助。 |
作者 | 单位 | 林鑫 | 华中师范大学信息管理学院 湖北省数据治理与智能决策研究中心 | 余华娟 | 华中师范大学信息管理学院 | 闫奕臻 | 华中师范大学信息管理学院 |
|
摘要点击次数: 1100 |
全文下载次数: 1417 |
中文摘要: |
复杂表格能够以简单、直观的方式描述数据,被广泛应用于各行各业,然而,复杂表格具有结构复杂、单元格类型多样、表格文档构成方式不一等问题,需要进行数据化处理后才能实现共享与复用。因此,本文构建一种基于无监督学习的单元格语义关系识别模型来实现复杂表格数据化,首先利用机器视觉技术实现复杂表格分割,然后基于表格结构和内容相似度识别同模板表格,在此基础上,结合表头单元格、说明性单元格、表体单元格3类单元格的取值、位置特点,设置启发式规则进行单元格语义关系的识别,最后通过实证研究验证本文的方法能够在复杂表格数据化中取得较高的准确率和召回率,具有可行性。 |
英文摘要: |
Complex tables can describe data in a simple and intuitive way, and are widely used in all walks of life. However, complex tables have problems such as complex structures, diverse cell types, and different forms of table documents. They need to be data processed before they can be shared and reused. Therefore, this paper constructs a cell semantic relationship recognition model based on unsupervised learning to realize the digitization of complex tables. First, it uses machine vision technology to realize the segmentation of complex tables, and then recognizes the same template table based on the similarity of table structure and content. On this basis, heuristic rules are set to identify the semantic relationship of cells in combination with the value and location characteristics of header cells, illustrative cells and table body cells. Finally, the empirical research verifies that the method in this paper can achieve high accuracy and recall rate in complex table digitization, which is feasible. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |