首页> 外文学位 >A methodology to improve the performance of extracting information from financial documents.
【24h】

A methodology to improve the performance of extracting information from financial documents.

机译:一种改进从财务文件中提取信息的性能的方法。

获取原文
获取原文并翻译 | 示例

摘要

The Information Extraction (IE) technology retrieves the most relevant, context sensitive, and specific pieces of information from unstructured documents and presents it in a structured format. The IE problem is very difficult for several reasons. First of all, there is no clear boundary of the items to be retrieved. Secondly, information retrieval techniques, by using a bag of words and word statistics, may not suffice to retrieve most of the relevant information because of missing contexts. Thirdly, the direct use of some statistical techniques such as the use of Naive Bayes classifier or the use of Average Mutual Information performs well on document retrieval tasks, but these techniques are not directly applicable to the IE tasks.;This study proposes an IE methodology that aims at extracting financial information of various NASDAQ listed companies with high precision and recall. The performance is improved partly by using a rule-based symbolic-learning model. A set of rules is learned by the simplest form of Tabu search algorithm. The results show that the application of the Tabu search algorithm with parts of speech tags improves precision and recall over the application of other methods and resources. The output of the learned model is further analyzed by a statistical method called "Max-Strength" to improve the precision of the items extracted by the symbolic learning model. The strength of the methodology has been evidenced by its performance on the "Seminar Announcement" corpus that has been used by several well known systems.
机译:信息提取(IE)技术从非结构化文档中检索最相关,与上下文有关的特定信息,并将其以结构化格式显示。 IE问题非常困难,原因有几个。首先,要检索的项目没有明确的界限。其次,由于缺少上下文,使用一袋单词和单词统计信息的信息检索技术可能不足以检索大多数相关信息。第三,直接使用某些统计技术(例如使用朴素贝叶斯分类器或使用平均互信息)在文档检索任务上表现良好,但这些技术并不直接适用于IE任务。旨在以高精度和高召回率提取各种纳斯达克上市公司的财务信息。通过使用基于规则的符号学习模型,可以部分提高性能。通过禁忌搜索算法的最简单形式可以学习一组规则。结果表明,将禁忌搜索算法与部分语音标签配合使用可以提高精度和召回率,优于其他方法和资源。通过称为“最大强度”的统计方法进一步分析学习模型的输出,以提高由符号学习模型提取的项目的精度。该方法在“研讨会通知”语料库上的表现已证明了该方法的优势,该系统已被多个知名系统使用。

著录项

  • 作者

    Sheikh, Mahmudul Islam.;

  • 作者单位

    The University of Mississippi.;

  • 授予单位 The University of Mississippi.;
  • 学科 Business Administration Management.;Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 149 p.
  • 总页数 149
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 贸易经济;自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:38:31

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号