首页> 美国政府科技报告 >Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics

【24h】

Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics

机译：数据分析项目：利用n-Gram统计学利用大规模文本语料库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study methods of efficiently leveraging massive textual corpora through n-gram statistics. Specifically, we explore algorithms that use a database of frequency counts for sequences of tokens in a teraword Web corpus to correct spelling mistakes and to extract a list of instances of some category given only the name of the target category. For spelling correction, we use a novel correction algorithm and demonstrate high accuracy in correcting both real-word errors and non-word errors. For category extraction, we show promising preliminary results for a variety of categories. We conclude that n- gram statistics provide an efficient way to use information contained in a massive corpus of text using relatively simple algorithms. The report ends with a reflection on problems encountered, possible solutions, and future work.

著录项

作者
Carlson, A.; Mitchell, T. M.; Fette, I.;
展开▼
作者单位

展开▼
年度 2008
页码 p.1-31
总页数 31
原文格式 PDF
正文语种 eng
中图分类工业技术;
关键词
Corrections; Natural language; Text processing; Algorithms; Internet; Extraction; Learning machines; Statistics; Data processing; Errors;

机译：校正;自然语言;文本处理;算法;互联网;提取;学习机器;统计;数据处理;错误;

相似文献

外文文献
中文文献
专利

1. Relational data modelling of textual corpora: The Skaldic Project and its extensions [J] . Wills Tarrin Literary & linguistic computing . 2015,第2期

机译：文本语料库的关系数据建模：Skaldic项目及其扩展
2. Combining statistical data analysis techniques to extract topical keyword classes from corpora [J] . Mathias Rossignol, Pascale Sebillot Intelligent data analysis . 2005,第1期

机译：结合统计数据分析技术从语料库中提取主题关键词类
3. Global statistical analysis of MISR aerosol data: a massive data product from NASA's Terra satellite [J] . Tao Shi, Noel Cressie Environmetrics . 2007,第7期

机译：MISR气溶胶数据的全球统计分析：来自NASA Terra卫星的海量数据产品
4. Interference Statistics Approximations for Data Rate Analysis in Uplink Massive MTC [C] . Sergi Liesegang, Olga Mu?oz, Antonio Pascual-Iserte IEEE Global Conference on Signal and Information Processing . 2019

机译：上行链路大规模MTC中数据速率分析的干扰统计近似
5. Two topics: A jackknife maximum likelihood approach to statistical model selection, and, A convex hull peeling depth approach to nonparametric massive multivariate data analysis with applications. [D] . Lee, Hyunsook. 2006

机译：两个主题：用于统计模型选择的折刀最大似然方法，以及用于非参数大规模多元数据分析的凸壳剥离深度方法及其应用。
6. Cerebrovascular accident and acute coronary syndrome and perioperative outcomes (CAPO) study protocol: a 10-year database linkage between Hospital Episode Statistics Admitted Patient Care Myocardial Infarction National Audit Project and Office for National Statistics registries for time-dependent risk analysis of perioperative outcomes in English NHS hospitals [O] . Matthew Stephen Luney, William Lindsay, Tricia M McKeever, 2020

机译：脑血管事故和急性冠状动脉综合征和围手术期结果（CAPO）研究议定书：医院剧集统计数据库联系患者护理心肌梗死国家审计项目和国家统计登记处的时间依赖风险分析英语NHS医院
7. Risk Projection for Time-to-Event Outcome Leveraging Summary Statistics With Source Individual-Level Data [O] . Jiayin Zheng, Yingye Zheng, Li Hsu 2021

机译：与源单独级别数据利用汇总统计的时间 - 事件结果的风险投影

Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics

摘要

著录项

相似文献

相关主题

期刊订阅