首页> 外文期刊>International Journal of Librarianship >Using NLP to Generate MARC Summary Fields for Notre Dame ’s Catholic Pamphlets
【24h】

Using NLP to Generate MARC Summary Fields for Notre Dame ’s Catholic Pamphlets

机译:使用NLP为Notre Dame的天主教小册子生成Marc摘要字段

获取原文
           

摘要

Three NLP (Natural Language Processing) automated summarization techniques were tested on a special collection of Catholic Pamphlets acquired by Hesburgh Libraries. The automated summaries were generated after feeding the pamphlets as .pdf files into an OCR pipeline. Extensive data cleaning and text preprocessing were necessary before the computer summarization algorithms could be launched. Using the standard ROUGE F1 scoring technique, the Bert Extractive Summarizer technique had the best summarization score. It most closely matched the human reference summaries. The BERT Extractive technique yielded an average Rouge F1 score of 0.239. The Gensim python package implementation of TextRank scored at .151. A hand-implemented TextRank algorithm created summaries that scored at 0.144. This article covers the implementation of automated pipelines to read PDF text, the strengths and weakness of automated summarization techniques, and what the successes and failures of these summaries mean for their potential to be used in Hesburgh Libraries.
机译:在由Hesburgh图书馆收购的特殊集合的天主教小册子上进行了三种NLP(自然语言处理)自动摘要技术。在将小册子送入到OCR管道之后,生成自动摘要。在启动计算机摘要算法之前,需要广泛的数据清洁和文本预处理。使用标准Rouge F1评分技术,BERT Extractic Sumparizer技术具有最佳总结分数。它最接近与人权摘要相匹配。 BERT萃取技术产生平均胭脂F1得分为0.239。 Gensim Python封装在第011页上得分Textrank的实施。手工制定的Textrank算法创建了在0.144时得分的摘要。本文介绍了自动化管道的实施,以阅读PDF文本,自动摘要技术的优势和弱点,以及这些摘要的成功和失败是什么意思,因为他们的潜力将在HELBURGH图书馆中使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号