首页> 美国政府科技报告 >Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics
【24h】

Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics

机译:数据分析项目:利用n-Gram统计学利用大规模文本语料库

获取原文

摘要

We study methods of efficiently leveraging massive textual corpora through n-gram statistics. Specifically, we explore algorithms that use a database of frequency counts for sequences of tokens in a teraword Web corpus to correct spelling mistakes and to extract a list of instances of some category given only the name of the target category. For spelling correction, we use a novel correction algorithm and demonstrate high accuracy in correcting both real-word errors and non-word errors. For category extraction, we show promising preliminary results for a variety of categories. We conclude that n- gram statistics provide an efficient way to use information contained in a massive corpus of text using relatively simple algorithms. The report ends with a reflection on problems encountered, possible solutions, and future work.

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号