...
首页> 外文期刊>International journal of software engineering and knowledge engineering >INFORMATION EXTRACTION VERSUS TEXT SEGMENTATION FOR WEB CONTENT MINING
【24h】

INFORMATION EXTRACTION VERSUS TEXT SEGMENTATION FOR WEB CONTENT MINING

机译:Web内容挖掘中的信息提取与文本分类

获取原文
获取原文并翻译 | 示例
           

摘要

The information explosion of the Web aggravates the problem of effective information retrieval. Even though various approaches in the literature aim to enhance retrieval, they prove to be insufficient because the actual content of a page is poorly exploited with regard to a specific semantic content. This paper extends an existing method for performing automatic semantic segmentation. The existing method initially partitions a web page into blocks based on its visual layout and the application of a set of heuristics. The subsequent step performs partitioning based on the appearance of specific types of named entities with the help of a machine learning algorithm. Our work extends the initial method in multiple directions. First of all, it examines alternative named entities as features in the learning step. Secondly, it extends the initial corpus. Thirdly, it evaluates and compares the initial method with metrics used in text segmentation. Furthermore, the result of text segmentation is incorporated as feature in the learning process. Finally, two text segmentation algorithms are applied to evaluate the effectiveness of manual annotation. Reported results show that the synergy of semantic-based and text segmentation algorithms strongly depends on the predefined semantic model used for text segmentation.
机译:Web的信息爆炸加剧了有效信息检索的问题。尽管文献中的各种方法旨在增强检索,但事实证明它们是不够的,因为关于特定语义内容,页面的实际内容很少得到利用。本文扩展了执行自动语义分割的现有方法。现有方法最初是根据网页的视觉布局和一组启发式方法将网页划分为多个块。后续步骤借助机器学习算法,根据特定类型的命名实体的外观执行分区。我们的工作将初始方法扩展到多个方向。首先,它在学习步骤中将替代命名实体作为特征进行检查。其次,它扩展了初始语料库。第三,它评估和比较初始方法与文本分割中使用的度量。此外,文本分割的结果作为特征并入学习过程中。最后,使用两种文本分割算法来评估手动注释的有效性。报告的结果表明,基于语义和文本分割算法的协同作用在很大程度上取决于用于文本分割的预定义语义模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号