【24h】

Cleaning Web Pages for Effective Web Content Mining

机译:清洁网页以有效地进行Web内容挖掘

获取原文
获取原文并翻译 | 示例

摘要

Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars. This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naieve Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches.
机译:对无噪声的网页进行分类和挖掘将提高搜索结果的准确性以及搜索速度,并且可能有益于网页组织应用程序(例如,基于关键字的搜索引擎和生物分类网页分类应用程序)。网页上的噪音与要开采的网页上的主要内容无关,包括广告,导航栏和版权声明。现有的有关网页清理的工作很少,其检测出的噪声块具有完全匹配的内容,但在检测接近重复的块时却很弱,这些块的特征在于导航栏。为了提高Web内容挖掘的准确性和效率,本文提出了一种WebPageCleaner系统,用于消除网页中的噪声块。基于视觉的技术用于从网页中提取块。然后,通过分析诸如块位置,块上的Web链接的百分比以及块内容与其他块的相似度之类的块的物理特征,将相关的网页块识别为具有高重要性级别的网页块。使用Naieve Bayes文本分类,将重要块导出以用于Web内容挖掘。实验表明,与可比的现有方法相比,WebPageCleaner导致更准确,更高效的网页分类结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号