首页> 外文期刊>Information Processing & Management >DeASCIIfication approach to handle diacritics in Turkish information retrieval
【24h】

DeASCIIfication approach to handle diacritics in Turkish information retrieval

机译:在土耳其信息检索中使用反ASCII方法处理变音符号

获取原文
获取原文并翻译 | 示例
           

摘要

The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics.
机译:文本文档或搜索查询中没有变音符号是土耳其信息检索的一个严重问题,因为这会造成单应性歧义。因此,对变音符号的不当处理会降低搜索引擎的检索性能。解决此问题的一种直接方法是通过将变音符号替换为对应的美国信息交换标准码(ASCII)来对符号进行标准化。但是,这种所谓的ASCII化会产生不是合法土耳其语单词的合成单词或含义与原始单词完全不同的合法单词。这些无效的合成词无法由形态分析组件(例如词干分析器或词条还原器)处理,这些分析分析组件期望输入为有效的土耳其语词。相比之下,当在文本分析管道中不使用茎或简单的first-n-character-stemmer时,合成词就不是问题。这种差异强调了茎杆变音符号的概念。在这项研究中,我们提出并评估了基于反ASCII应用的替代解决方案,该解决方案可恢复查询词或文本文档中带重音的字母。我们对风险敏感的评估结果表明,与规范化标记以删除变音符号相比,变音符号恢复方法产生了更有效和更可靠的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号