...
首页> 外文期刊>Journal of Computers >A DOM-based Anchor-Hop-T Method for Web Application Information Extraction
【24h】

A DOM-based Anchor-Hop-T Method for Web Application Information Extraction

机译:基于DOM的Web应用信息提取Anchor-Hop-T方法

获取原文
           

摘要

In order to implement the information fusion ofelectronic products, the widely adopted approach is toextract information from HTML structure of businessWebsite with deeply data processing. However, modelingWeb application is hard to be solved that the data in HTMLis semi-formal which displayed as DOM (Document ObjectModel) tree when using XML schema to data analysis. Howto understand and extract information is first to beresearched. The general model Anchor-Hop considering thetext property and label property is simple to handle thisproblem. Therefore, it has low effectiveness. This model issensitive to the data of HTML structure, that if the websitestructure is slightly changed the issue of extraction accuracyis encountered. As a result, the extraction rules should beredefined because of the changed structure. In order toimprove extraction efficiency, this paper proposed a DOMbaseddynamic model Anchor-Hop-T information extractionmodel. The HTML tags including table, ol and ul can besearched and processed using XPath so that it isconvenience to extract corresponding Anchor data block.Furthermore, the location of Hop point is considered asinvariant, by which our new model based on Anchor andHop point introduces more concepts for extractinginformation, such as Anchor data block, Anchor locatinglibrary and AH relevance value. Finally, we try to give outan experiment to demonstrate the applicability of ourapproach.
机译:为了实现电子产品的信息融合,广泛采用的方法是从企业网站的HTML结构中提取信息并进行深度数据处理。但是,当使用XML模式进行数据分析时,HTML中的数据是半正式形式(显示为DOM(文档对象模型)树),因此很难解决modellingWeb应用程序的问题。首先要研究如何理解和提取信息。考虑到text属性和label属性的通用模型Anchor-Hop很容易处理此问题。因此,它的效率很低。该模型对HTML结构的数据敏感,因此,如果网站结构稍有变化,则会遇到提取准确性的问题。结果,由于结构更改,应该重新定义提取规则。为了提高提取效率,提出了一种基于DOM的动态模型Anchor-Hop-T信息提取模型。可以使用XPath搜索和处理包括table,ol和ul在内的HTML标记,从而方便地提取相应的Anchor数据块。此外,Hop点的位置被认为是不变的,因此基于Anchor和Hop点的新模型引入了更多概念用于提取信息,例如锚点数据块,锚点定位库和AH相关值。最后,我们尝试给出一个实验来证明我们的方法的适用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号