...
首页> 外文期刊>Expert systems with applications >A text mining approach for automatic construction of hypertexts
【24h】

A text mining approach for automatic construction of hypertexts

机译:自动构建超文本的文本挖掘方法

获取原文
获取原文并翻译 | 示例
           

摘要

The research on automatic hypertext construction emerges rapidly in the last decade because there exists a urgent need to translate the gigantic amount of legacy documents into web pages. Unlike traditional 'flat' texts, a hypertext contains a number of navigational hyperlinks that point to some related hypertexts or locations of the same hypertext. Traditionally, these hyperlinks were constructed by the creators of the web pages with or without the help of some authoring tools. However, the gigantic amount of documents produced each day prevent from such manual construction. Thus an automatic hypertext construction method is necessary for content providers to efficiently produce adequate information that can be used by web surfers. Although most of the web pages contain a number of non-textual data such as images, sounds, and video clips, text data still contribute the major part of information about the pages. Therefore, it is npt surprising that most of automatic hypertext construction methods inherit from traditional information retrieval research. In this work, we will propose a new automatic hypertext construction method based on a text mining approach. Our method applies the self-organizing map algorithm to cluster some at text documents in a training corpus and generate two maps. We then use these maps to identify the sources and destinations of some important hyperlinks within these training documents. The constructed hyperlinks are then inserted into the training documents to translate them into hypertext form. Such translated documents will form the new corpus. Incoming documents can also be translated into hypertext form and added to the corpus through the same approach. Our method had been tested on a set of at text documents collected from a newswire site. Although we only use Chinese text documents, our approach can be applied to any documents that can be transformed to a set of index terms.
机译:在过去的十年中,对自动超文本构建的研究迅速兴起,因为迫切需要将大量的旧文档转换为网页。与传统的“扁平”文本不同,超文本包含许多导航超链接,这些导航超链接指向一些相关的超文本或同一超文本的位置。传统上,这些超链接是由网页的创建者在有或没有某些创作工具的帮助下构造的。然而,每天产生的大量文件阻止了这种手动构造。因此,内容提供商需要一种自动的超文本构造方法来有效地产生可被网络冲浪者使用的足够信息。尽管大多数网页包含许多非文本数据,例如图像,声音和视频剪辑,但是文本数据仍然构成有关页面信息的主要部分。因此,令人惊讶的是,大多数自动超文本构造方法都继承自传统的信息检索研究。在这项工作中,我们将提出一种基于文本挖掘方法的新的自动超文本构造方法。我们的方法应用自组织映射算法对训练语料库中的文本文档进行聚类,并生成两个映射。然后,我们使用这些地图来标识这些培训文档中一些重要超链接的来源和目的地。然后,将构造的超链接插入培训文档中,以将其转换为超文本形式。此类翻译后的文档将构成新的语料库。传入的文档也可以转换为超文本形式,并通过相同的方法添加到语料库中。我们的方法已经在从新闻专线站点收集的一组at文本文档上进行了测试。尽管我们仅使用中文文本文档,但是我们的方法可以应用于可以转换为一组索引术语的任何文档。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号