...
首页> 外文期刊>Journal of Intelligent Learning Systems and Applications >A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts
【24h】

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

机译:基于SOM的文档聚类,使用非分类文本的最大行数子字符串

获取原文
           

摘要

This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.
机译:提出了一种利用自组织图(SOM)和频繁最大子串技术的非分段文档聚类方法,以提高信息检索的效率。 SOM已被广泛用于文档集群,并在许多应用程序中取得了成功。然而,当应用于非分段文档时,挑战在于有效地识别任何有趣的模式。提议方法有两个主要阶段:预处理阶段和聚类阶段。在预处理阶段,首先使用频繁最大子串技术来发现感兴趣的模式,称为频繁最大子串,它们是长且频繁的子串,而不是非分段文本中的单个单词。然后将这些发现的模式用作索引项。索引项及其出现次数一起形成文档向量。在聚类阶段,使用SOM通过使用Frequent Max子字符串的特征向量来生成文档聚类图。为了证明所提出的技术,本文介绍了对由非分段文本组成的泰国文本文档进行聚类的实验研究和比较结果。结果表明,所提出的技术可用于泰国文本。通过该方法生成的文档簇图可用于更有效地查找相关文档。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号