Cleaning Web Pages for Effective Web Content Mining

机译：清洁网页以有效地进行Web内容挖掘

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars. This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naieve Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches.

机译：对无噪声的网页进行分类和挖掘将提高搜索结果的准确性以及搜索速度，并且可能有益于网页组织应用程序（例如，基于关键字的搜索引擎和生物分类网页分类应用程序）。网页上的噪音与要开采的网页上的主要内容无关，包括广告，导航栏和版权声明。现有的有关网页清理的工作很少，其检测出的噪声块具有完全匹配的内容，但在检测接近重复的块时却很弱，这些块的特征在于导航栏。为了提高Web内容挖掘的准确性和效率，本文提出了一种WebPageCleaner系统，用于消除网页中的噪声块。基于视觉的技术用于从网页中提取块。然后，通过分析诸如块位置，块上的Web链接的百分比以及块内容与其他块的相似度之类的块的物理特征，将相关的网页块识别为具有高重要性级别的网页块。使用Naieve Bayes文本分类，将重要块导出以用于Web内容挖掘。实验表明，与可比的现有方法相比，WebPageCleaner导致更准确，更高效的网页分类结果。

著录项

来源
《Database and Expert Systems Applications; Lecture Notes in Computer Science; 4080》|2006年|P.560-571|共12页
会议地点 Krakow(PL)
作者
Jing Li; C.I. Ezeife;
展开▼
作者单位

School of Computer Science, University of Windsor, Windsor, Ontario, Canada N9B 3P4;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类 TP311.13;
关键词
web page cleaning; noise block; web content mining; classification; near-duplicate; text similarity;

机译：网页清理;噪声块; Web内容挖掘;分类;近重复;文本相似度;

相似文献

外文文献
中文文献
专利

1. Dynamic user profiles using fusion of Web Structure ,Web content and Web Usage Mining [J] . Prof. Gajendra S.Chandel, Prof. Ravindra Gupta, Mr. Hemant k. Dhamecha International Journal of Engineering Research and Applications . 2012,第3期

机译：使用Web结构，Web内容和Web用法挖掘融合的动态用户配置文件
2. Dynamic user profiles using fusion of Web Structure ,Web content and Web Usage Mining [J] . Prof. Gajendra S.Chandel, Prof. Ravindra Gupta, Mr. Hemant k. Dhamecha International Journal of Engineering Research and Applications . 2012,第3期

机译：使用Web结构，Web内容和Web用法挖掘融合的动态用户配置文件
3. AN EFFECTIVE FUZZY CLUSTERING ALGORITHM FOR WEB DOCUMENT CLASSIFICATION: A CASE STUDY IN CULTURAL CONTENT MINING [J] . GEORGE E. TSEKOURAS, DAMIANOS GAVALAS International journal of software engineering and knowledge engineering . 2013,第6期

机译：Web文档分类的有效模糊聚类算法：以文化内容挖掘为例
4. Cleaning Web Pages for Effective Web Content Mining [C] . Jing Li, C.I. Ezeife Database and Expert Systems Applications; Lecture Notes in Computer Science; 4080 . 2006

机译：清洁网页以有效地进行Web内容挖掘
5. Cleaning Web pages for effective Web content mining. [D] . Li, Jing. 2006

机译：清洁网页以进行有效的Web内容挖掘。
6. AHCODA-DB: a data repository with web-based mining tools for the analysis of automated high-content mouse phenomics data [O] . Bastijn Koopmans, August B. Smit, Matthijs Verhage, 2017

机译：AHCODA-DB：带有基于Web的挖掘工具的数据库用于分析自动化的高含量鼠标特征数据
7. An Efficient Method of Web Page Noise Cleaning for Effective Web Mining [O] . S. S., B. V. 2016

机译：有效网络挖掘的网页噪声清洁方法

Cleaning Web Pages for Effective Web Content Mining

摘要

著录项

相似文献

相关主题

期刊订阅