首页> 中文期刊> 《中国电子商情·通信市场》 >基于简化DomTree的网页数据提取

基于简化DomTree的网页数据提取

         

摘要

In this paper, a Dom Tree simplified method based on the white list strategy is proposed. This method is an extension of web data analysis based on Dom Tree matching. According to white list principle, this Dom Tree simplified method prunes and compresses the web nested structure. The generated tree structure contains only relevant content block. In this paper, a web data extraction method based on simplified Dom Tree is also proposed. This extraction method can raise the extraction speed and shorten the time of web data analysis while ensuring web data integrity. Finally, some web pages of E-commerce website are used to evaluate the analysis method. Experiments show that the extracted data is integral, and has high degree of correlation. The experiment result can fit the expectations.%本文在DomTree匹配分析网页数据的基础上,提出了一种基于白名单策略的DomTree简化方法,这种简化方法根据白名单匹配原则对网页嵌套结构进行剪枝和压缩,其生成的网页文本树结构只包含与检索相关的内容区块。本文提出了一种基于简化DomTree结构进行网页数据提取的方法。这种方法可以在保证网页主要数据信息不丢失的基础上,提高网页数据分析及获取的速度,缩短网页数据分析的时间。本文利用电子商务网页文本对分析方法进行评估,实验表明提取得到的数据信息完整,主题相关程度高,取得了较好的结果。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号