首页> 中文期刊> 《计算机技术与发展》 >基于PDFBox抽取学术论文信息的实现

基于PDFBox抽取学术论文信息的实现

         

摘要

为了对学术动态、热点及学术发展趋势进行研究,需要对学术研究论文进行数据挖掘研究。首先需要从海量的学术论文中提取有兴趣的信息。针对目前学术论文大多采用PDF格式的现状,重点研究了PDF文件的格式以及对PDF格式操作的各种技术,采用开源函数库PDFBox对PDF格式的学术论文按照规则进行信息的提取,提取的信息主要包括学术论文的标题、作者、单位、关键词、发表时间、摘要等信息。最后对提取信息的正确率进行了统计,有助于针对学术研究的大数据研究。%In order to research the academic dynamics,hot topic and academic development trends,need to carry out the data mining re-search for academic research papers. First of all,extract interest information from the massive papers. For the situation that the current aca-demic papers are mostly used PDF format,mainly study the format of PDF files and a variety of technical operations for PDF operations, open-source library PDFBox is used to extract information for the academic papers with PDF format in accordance with the rules,the ex-tracted information is mainly including academic titles,authors,unit,keyword,publication time,abstract and other information. Finally, the correct rate of extraction of information has been statistical,which is helpful for big data for academic research.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号