...
首页> 外文期刊>Fuzzy Systems, IEEE Transactions on >Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set and Rough Set Based Approach
【24h】

Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set and Rough Set Based Approach

机译:跨语言文档表示和语义相似性度量:基于模糊集和粗糙集的方法

获取原文
获取原文并翻译 | 示例
           

摘要

As cross-lingual information retrieval is attracting increasing attention, tools that measure cross-lingual semantic similarity between documents are becoming desirable. In this paper, two aspects of cross-lingual semantic document similarity measures are investigated: One is document representation, and the other is the formulation of similarity measures. Fuzzy set and rough set theories are applied to capture the inherently fuzzy relationships among concepts expressed by natural languages. Our approach first develops a language-independent sense-level document representation based on the fuzzy set model to reduce the barrier between different languages and further explores the fuzzy–rough hybrid approach to obtain a more robust macrosense-level document representation through the partitioning of the integrated sense association network of the document collection into macrosenses. Then, Tversky’s notion of similarity and the F1 measure on information retrieval are adopted to formulate, respectively, two document similarity measures with fuzzy set operations on the two proposed document representations. The effectiveness of our approach is demonstrated by its success rate in identifying the English translations to their corresponding Chinese documents in a collection of Chinese–English parallel documents. Moreover, the proposed approach can be easily extended to process documents in other languages. It is believed that the proposed representations, along with the similarity measures, will enable more effective text mining processes.
机译:随着跨语言信息检索日益受到关注,测量文档之间跨语言语义相似性的工具正变得越来越受欢迎。本文研究了跨语言语义文档相似性度量的两个方面:一个是文档表示,另一个是相似性度量的制定。应用模糊集和粗糙集理论来捕获自然语言表达的概念之间固有的模糊关系。我们的方法首先基于模糊集模型开发了一种独立于语言的感官级别文档表示,以减少不同语言之间的障碍,然后进一步探索了模糊粗糙混合方法,通过对图像进行划分,获得了更强大的宏感级别文档表示。将文档收集的感官联想网络集成到宏观感官中。然后,采用特维尔斯基的相似性概念和信息检索的F1度量,分别针对两个拟议的文档表示形式,用模糊集运算来制定两个文档相似性度量。我们的方法之所以成功,是因为它在从中英文平行文档中识别出对应的中文文档的英文翻译的成功率得到了证明。而且,所提出的方法可以容易地扩展为处理其他语言的文档。据信,所提出的表示以及相似性度量将使文本挖掘过程更加有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号