【24h】

Rule Mining in Textual Data Using Passages

机译:使用段落在文本数据中进行规则挖掘

获取原文
获取原文并翻译 | 示例

摘要

As the interest and needs for Knowledge Discovery and Data Mining (KDD) in texts increases, applying of association rule mining, the successful standard KDD method, to texts has attracted great attention. But contrary to the expectations, most of the works resulted acquiring syntactic rules or collocation of words, which are not satisfying in the context of KDD, where the objective is to extract previously unknown, useful information. One of the reasons of the unpleasing results can be due to the fact that most of the previous works process texts on syntactic base. For example, past works used words as items and documents as transactions, words and windows, terms and documents, words and passages (segment of text) respectivly. Here we propose a way of using passages as items and documents as transactions. According to [5], breaking down long text into passages will improve the result of information retrieval. This result indicates that passages are good indication of users' interests. We follow and extend this view, and take passages as an indication of topics in a document. Our goal is to find an association between topic in documents instead of association between words. The important issue of using passage is how to compare between passages which usally consists of set of words. Since the number and frequency of words which appear in passage are different passages to passages, there is no way to compare passages directly. We must convert them to some other processable representation.. In this paper we propose a representation of passage, and discuss a way to compare between passages with the capability to apply soft matching.
机译:随着对文本中的知识发现和数据挖掘(KDD)的兴趣和需求的增加,成功的标准KDD方法关联规则挖掘在文本中的应用引起了极大的关注。但是与预期相反,大多数作品都获得了句法规则或单词搭配,这在KDD的背景下是无法满足的,KDD的目的是提取以前未知的有用信息。产生不令人满意的结果的原因之一可能是由于以前的大多数作品都是基于句法来处理文本的。例如,过去的作品分别使用单词作为项目和文档作为交易,单词和窗口,术语和文档,单词和段落(文本段)。在这里,我们提出一种使用段落作为项目并将文档作为交易的方法。根据文献[5],将长文本分解为段落将改善信息检索的结果。该结果表明段落是用户兴趣的良好指示。我们遵循并扩展了这种观点,并采用段落作为文档中主题的指示。我们的目标是找到文档中主题之间的关联,而不是单词之间的关联。使用段落的重要问题是如何在通常由一组单词组成的段落之间进行比较。由于段落中出现的单词的数量和频率是段落的不同段落,因此无法直接比较段落。我们必须将它们转换为其他可处理的表示形式。在本文中,我们提出了段落的表示形式,并讨论了一种在段落之间进行比较并具有应用软匹配功能的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号